Now that I have a lot of charts up, I want to explain in further detail how they were made. To do this, I need to talk a bit more about kernel density estimation. KDE as I have said previously is a way to estimate the probability density function (pdf) that when integrated gives the probability of having a data point fall within a specific range. A mouthful yes, but let me explain it another way. Today in Grand Rapids, MI the temperature ranged from 12°F to 28°F (even though, technically it is spring). The pdf for the temperature today would then be able to tell us the probability of the temperature being between any two values in that range. For the full range, 12°F to 28°F, the pdf would integrate to 1. For a smaller range, it would be less. Using KDE to estimate the pdf of the called strikes let’s us do the same thing for an area in the x-z plane at the front of home plate. We can do the same thing for pitches called a ball by the umpire. This is quite a nice thing to be able to do and it’s worth looking further into the mathematics of KDE.
KDE works by attaching a function called a kernel at each data point. The kernel is itself a pdf. Think of this like saying wherever there is a data point there is a little hill or bump centered on that point. KDE then sums up all the kernels and divides by the number of data points. This extra step of division means that the summation integrates to 1, just like a pdf should. KDE can be done on 1-D, 2-D and even higher dimensional data sets. When I analyze pitch data, I use the 2-D dimensional form of KDE which produces a 3-D plot.
Bandwidth controls the smoothing, or how far away the kernel goes from the data point before it gets really small. Choosing h is a really important step in KDE. In one dimension h is just a number. In two dimensions h becomes 2×2 matrix. The matrix allows the kernel to change shape and orientation. This is really useful, especially when combined with a method for optimizing the bandwidth. It lets the kernel adjust to clustering and orientation trends in the data. The .gif below shows a two dimensional KDE using an elliptical kernel with eruption data from Old Faithful Geyser in Yellowstone National Park (click on the image to watch it play).
This is similar to the process that I use to generate my strike zone maps. The difference is that when I generate a strike zone map, I use a full bandwidth matrix, which means that my kernel is elliptical and can change orientation. The size and orientation of the kernel is selected for each data set using the plug-in bandwidth selector and the SAMSE pilot kernel. Sounds like a mouthful, but this is where R comes in really handy.
Once I have generated the KDE for the called balls and called strikes for an umpire, I then want to combine them to take into account the fact that umpires don’t always make the same call for the same pitch. I do this by normalizing the strike and ball densities to the total number of pitches that the umpire has called. Equations 2 and 3 are the normalized KDE for the called balls and called strikes data sets, respectively.
After I normalize the two densities I can then subtract equation 2 from equation 3, essentially giving me the strike zone maps that I have published. In reality, to make the plots easier to read I divide the difference by the maximum value of the difference. This creates a plot that ranges from about 1 to -1 and makes it really easy to visualize. However, it can be somewhat misleading. It is not correct to say that a pitch thrown at the 0.75 contour line will be called a strike 75% of the time. Basically the 0.75 contour line represents where the difference between the ball and strike densities is 75% the maximum difference. This final division is essentially an attempt to make the plot easier to read by showing normalized values instead of the actual density differences which are an order of magnitude smaller.
Basically that’s the more in depth description of how I have gone about mapping the strike zone. In future posts I will start to share some of the codes I have written that I use to collect data, to carry out the analyses, and to visualize the results. In short, I use Perl to send data to R which carries out the math part and then I use Perl again to send the results to GMT which generates the plots. For more on GMT and R go here and here, respectively.