In the last post, I discussed some ways people have mapped the strike zone before, notably at Brooks Baseball and Baseball Heat Maps. They do a pretty good job, but I laid out some ways that I think it can be done better. Specifically, I said that an improved strike zone mapping method should do several things:
- Quantify the probability of a pitch being called a strike or a ball using a continuous function
- Not make any prior assumptions about the shape of the strike zone
- Look good and be easy to read
I believe I can use kernel density estimation–which, from here on in, will be referred to as KDE–to make strike zone maps that meet each of these criteria.
There are a lot of resources that explain the theory behind KDE that will do lot better job that I am going to do here, but here is the twenty second blurb. KDE is a statistical method to estimate the probability density function of a set of data points. It does this by attaching a mathematical function, called a kernel, centered at each data point. Then, where the kernels overlap, they are summed together. By choosing the kernel function carefully, the resulting summation will be a function that when integrated, gives the probability of having a data point (say, a pitch called a strike) within a specific area.
So, KDE can be used to generate maps of the distribution of called strikes or called balls for games, umps, seasons, teams, or whatever division that is interesteing. Below are two plots, from the catcher’s perspective of the called strikes and called balls for Cubs 2012 season opener as called by Dana DeMuth. These plots are informative, but they aren’t really new to strike zone mapping. Others have already generated similar plots using similar methods.
Let’s consider a pitch that just catches the black on the outside corner of the plate. An ump has to make a decision to call this pitch a strike or a ball. Even if this pitch is called a strike this time, there is no guarantee that if the umpire sees the exact same pitch again he will call it a strike. A strike zone map has to take into account the inconsistencies inherent in umpiring.
I did this by normalizing the strike distribution and the ball distribution to the total number of pitches that the umpire has called. Then, I subtracted the ball density from the strike density and divided by the maximum value of the difference. The result is a plot that ranges from -1 to 1 that gives the relative likelihood of a pitch being called a ball versus a pitch being called a strike. A positive value means that a pitch at that location is more likely to be called a strike than a ball. This is color coded as well. Warm colors indicate positive values and cool colors indicate negative values.
So using this method I generated the strike zone map for the Cubs home opener in 2012 for Dana DeMuth. I argue that this method fulfills all of the criteria I set out to meet. The probability of a pitch being called a strike or a ball is quantified and displayed in a good looking, easy to read strike zone map. The edges of the strike zone are free to mimic the distribution of the data, and we did not make any assumptions about the shape of the strike zone beforehand.