How I Map the Strike Zone

Now that I have a lot of charts up, I want to explain in further detail how they were made. To do this, I need to talk a bit more about kernel density estimation. KDE as I have said previously is a way to estimate the probability density function (pdf) that when integrated gives the probability of having a data point fall within a specific range. A mouthful yes, but let me explain it another way. Today in Grand Rapids, MI the temperature ranged from 12°F to 28°F (even though, technically it is spring). The pdf for the temperature today would then be able to tell us the probability of  the temperature being between any two values in that range. For the full range, 12°F to 28°F, the pdf would integrate to 1. For a smaller range, it would be less. Using KDE to estimate the pdf of the called strikes let’s us do the same thing for an area in the x-z plane at the front of home plate. We can do the same thing for pitches called a ball by the umpire. This is quite a nice thing to be able to do and it’s worth looking further into the mathematics of KDE.

KDE works by attaching a function called a kernel at each data point. The kernel is itself a pdf. Think of this like saying wherever there is a data point there is a little hill  or bump centered on that point. KDE then sums up all the kernels and divides by the number of data points. This extra step of division means that the summation integrates to 1, just like a pdf should. KDE can be done on 1-D, 2-D and even higher dimensional data sets. When I analyze pitch data, I use the 2-D dimensional form of KDE which produces a 3-D plot.

In one dimension, the KDE equation takes on the form in shown in equation 1, where n is the number of data points and h is the bandwidth.

Bandwidth controls the smoothing, or how far away the kernel goes from the data point before it gets really small. Choosing h is a really important step in KDE. In one dimension h is just a number. In two dimensions h becomes 2×2 matrix. The matrix allows the kernel to change shape and orientation. This is really useful, especially when combined with a method for optimizing the bandwidth. It lets the kernel adjust to clustering and orientation trends in the data. The .gif below shows a two dimensional KDE using an elliptical kernel with eruption data from Old Faithful Geyser in Yellowstone National Park (click on the image to watch it play).

density

This is similar to the process that I use to generate my strike zone maps. The difference is that when I generate a strike zone map, I use a full bandwidth matrix, which means that my kernel is elliptical and can change orientation. The size and orientation of the kernel is selected for each data set using the plug-in bandwidth selector and the SAMSE pilot kernel. Sounds like a mouthful, but this is where R comes in really handy.

Once I have generated the KDE for the called balls and called strikes for an umpire, I then want to combine them to take into account the fact that umpires don’t always make the same call for the same pitch. I do this by normalizing the strike and ball densities to the total number of pitches that the umpire has called. Equations 2 and 3 are the normalized KDE for the called balls and called strikes data sets, respectively.

After I normalize the two densities I can then subtract equation 2 from equation 3, essentially giving me the strike zone maps that I have published. In reality, to make the plots easier to read I divide the difference by the maximum value of the difference. This creates a plot that ranges from about 1 to -1 and makes it really easy to visualize. However, it can be somewhat misleading. It is not correct to say that a pitch thrown at the 0.75 contour line will be called a strike 75% of the time. Basically the 0.75 contour line represents where the difference between the ball and strike densities is 75% the maximum difference. This final division is essentially an attempt to make the plot easier to read by showing normalized values instead of the actual density differences which are an order of magnitude smaller.

Basically that’s the more in depth description of how I have gone about mapping the strike zone. In future posts I will start to share some of the codes I have written that I use to collect data, to carry out the analyses, and to visualize the results. In short, I use Perl to send data to R which carries out the math part and then I use Perl again to send the results to GMT which generates the plots. For more on GMT and R go here and here, respectively.

Advertisements

Mapping the strike zone

In the last post, I discussed some ways people have mapped the strike zone before, notably at Brooks Baseball and Baseball Heat Maps. They do a pretty good job, but I laid out some ways that I think it can be done better. Specifically, I said that an improved strike zone mapping method should do several things:

  1. Quantify the probability of a pitch being called a strike or a ball using a continuous function
  2. Not make any prior assumptions about the shape of the strike zone
  3. Look good and be easy to read

I believe I can use kernel density estimation–which, from here on in, will be referred to as KDE–to make strike zone maps that meet each of these criteria.

There are a lot of resources that explain the theory behind KDE that will do lot better job that I am going to do here, but here is the twenty second blurb. KDE is a statistical method to estimate the probability density function of a set of data points. It does this by attaching a mathematical function, called a kernel, centered at each data point. Then, where the kernels overlap, they are summed together. By choosing the kernel function carefully, the resulting summation will be a function that when integrated, gives the probability of having a data point (say, a pitch called a strike) within a specific area.

So, KDE can be used to generate maps of the distribution of called strikes or called balls for games, umps, seasons, teams, or whatever division that is interesteing. Below are two plots, from the catcher’s perspective of the called strikes and called balls for Cubs 2012 season opener as called by Dana DeMuth. These plots are informative, but they aren’t really new to strike zone mapping. Others have already generated similar plots using similar methods.

KDE of pitches called a ball by Dana DeMuth on April 5, 2012.

KDE of pitches called a ball by Dana DeMuth on April 5, 2012.

KDE of pitches called a strike by Dana DeMuth on April 5, 2012.

KDE of pitches called a strike by Dana DeMuth on April 5, 2012.

Let’s consider a pitch that just catches the black on the outside corner of the plate. An ump has to make a decision to call this pitch a strike or a ball. Even if this pitch is called a strike this time, there is no guarantee that if the umpire sees the exact same pitch again he will call it a strike. A strike zone map has to take into account the inconsistencies inherent in umpiring.

I did this by normalizing the strike distribution and the ball distribution to the total number of pitches that the umpire has called. Then, I subtracted the ball density from the strike density and divided by the maximum value of the difference. The result is a plot that ranges from -1 to 1 that gives the relative likelihood of a pitch being called a ball versus a pitch being called a strike. A positive value means that a pitch at that location is more likely to be called a strike than a ball. This is color coded as well. Warm colors indicate positive values and cool colors indicate negative values.

Strike zone map for Dana DeMuth on April 5, 2012.

Strike zone map for Dana DeMuth on April 5, 2012.

So using this method I generated the strike zone map for the Cubs home opener in 2012 for Dana DeMuth. I argue that this method fulfills all of the criteria I set out to meet. The probability of a pitch being called  a strike or a ball is quantified and displayed in a good looking, easy to read strike zone map. The edges of the strike zone are free to mimic the distribution of the data, and we did not make any assumptions about the shape of the strike zone beforehand.