So three weeks later…

Well it seems life has caught up with me again. Since my last post, my wife took a job in another part of the country, I had several interviews for a teaching position in the fall, and we moved 1,700 miles to Cedar City, Utah. So I apologize for the delay and the lack of posts the last few weeks. I will be getting back in the saddle here shortly as we just got the internet hooked up in our new apartment.


Back in the saddle

So, I must apologize for the recent absence from this blog, and the lack of warning with which it occurred. I got married 10 days ago and so the last week and a half has been a blur of last minute wedding preparations, the wedding itself, our honeymoon up to Traverse City, and then moving and getting our house settled. So forgive me for being a little preoccupied the last little bit. Unfortunately our new house, which is actually pretty old, has a pretty slow internet connection, so that has deterred me from posting as well. But I have some spare time at the moment so I hope to resume regular posts again. Enjoy!

Variable Bandwidth

Well, in the last two days I have been messing around with a new package in R called “locfit”. This package does regression and density estimation using a variable bandwidth parameter instead of the fixed one that I have been using thus far. The good news is that I have updated my scripts to take advantage of this new capability. The bad news is that it’s going to take me a little bit of time to update the maps I have posted already. So in the next couple of days I hope to redo my umpire season plots from 2012.  After that, I hope to begin doing maps for individual games again. So bear with me as I transition to this new methodology.

Posts past and posts future

Today I posted the strike zone maps for 5 umpires that are on the 2013 MLB roster. These strike zone maps were created using all the called strikes and balls for each umpire over the course of the 2012 regular and post season. I hope to have all the 2012 maps for the 2013 MLB umpires posted before the start of the regular season. All of these have been posted in a new category called season maps and they are tagged with “2012 Season” and the name of the umpire.

Next, during the regular season, I hope to post game strike zone maps each day for the MLB games the previous day. I say I hope, because I haven’t yet finished the script that will control that process. I’ll also have to see how much time it will take each day to do. I am trying to automate as much of the process as I can, but like I said, I haven’t yet finished writing that script.

During the regular season I hope to also do some other analysis looking at other things related to pitching and hitting using the pitch f/x data and the methods I apply to the strike zone. Those posts will be less regular, and it again, depends on the time I have during the course of the regular season. But that’s my goal for the coming season: regular strike zone maps with some other analyses interspersed.

Finally, I hope to write more posts detailing exactly how I have generated theses maps. These will include more detail into the mathematics behind the plots and some more information about the software I use. I will be including some of my codes that do specific tasks, but I don’t plan on releasing all my codes at the moment. Hopefully though, if people are interested they will be able to generate their own plots using the codes I share and some ingenuity of their own.

On the shoulders of giants

Mapping strike zones as called by umpires is not a new task. Several people have already done this before me. In this post, I want to show what has been done already, and then, in the next post, contrast that with the method I have employed. To start let’s look at the strike zone tool available at Brooks Baseball.

As an example, let’s look at the strike zone called by Dana DeMuth on April 5, 2012. This was the home opener for the Cubs who played the Washington Nationals. The Brooks Baseball strike zone map is here. On this plot, which is from the catcher’s perspective, strikes are plotted in red and balls are plotted in green. There is a box that denotes the strike zone as defined by MLB rules and a box that that shows the strike zone as called by Dana DeMuth.

I have two major issues with this plot. First, it is a qualitative analysis of the umpire’s performance. We can see a green pitch that is located inside the strike zone and know that he messed up, but we can’t make any estimates of how often that happens. We also can’t say how often a pitch is called a strike in a certain part of the zone. Second, this plot assumes the strike zone is rectangular. This may be true for the strike zone defined by MLB, but umpires do not see the strike zone that way. So this tool makes some basic assumptions about the shape of the strike zone and does not provide us with a way to quantify the umpires performance.

Another way to map the strike zone is available at Baseball Heat Maps. We can again use Dana DeMuth as an example. Their plot for  left handed hitters is here, and the right handed hitter plot is here for the same game. Notice, that they quantify the frequency with which Dana DeMuth calls strikes in different parts of the zone. The shape of the strike zone is also allowed to vary based on the performance of the umpire. So they do a pretty good job at Baseball Heat Maps, but  I think it can be improved.

First, the plots aren’t allowed to freely mimic the distribution of pitches called by the umpire. Notice the straight lines and sharp angles where there is a change in direction. This is a product of the method they used to generate the graph. A robust method will allow edges to have curved shapes, or in other words be a continuous function. Second, and this might be nitpicky, but their color scheme is well, ugly and not that informative.

So if I want to improve upon these strike zone maps, I need a method that allows the strike zone map to be a continuous function, quantifies the frequency pitches get called strikes or balls, and has a better color scheme than the other two methods shown here. Fortunately, all these things can be done using kernel density estimation but that’s for the next post.