FanPost

Intro to Analytics using Statcast data

Morris MacMatzen/Getty Images

I figured since I have finally been able to do a dive deep into pitch data in my spare time I might as well do what I can to provide information and even crowd source on future projects. Some of yall may know me but I played baseball in college and have been a data scientist for the last 7-8 years. Additionally, I taught Python for Data Science at UTD for 3 semesters in their MS in Business Analytics program before moving from the metroplex. Currently, I am a Chief AI Engineer for my current company in DC.

This is going to be a series of posts where I attempt to walk everyone through how to pull your own data, where to go for it, and some sample analyses. We will be focusing on the output from statcast. Additionally, you may even pick up some elementary to intermediate coding skills (I am going to try to keep it super simple) as we go which is a bonus!

This first post will be an intro and my goal will be to post twice a week. I will be providing links to github that will contain both R and Python code. A lot of this work is based on amazing packages built by Bill Petti, Carson Sievert, et al.

First of all I realize that you can get a lot of this information on amazing sites such as Fangraphs and baseball savant. However sometimes you either want to take it beyond descriptive statistics and/or look at the data in a different way. The best way to do this is using arguably the most used programming languages that I use daily; R and/or Python. Both have their strengths and weaknesses.

Each language has a library/package that is great for this, in R it is called baseballr (https://github.com/BillPetti/baseballr) and in Python there is pyBaseball (https://github.com/jldbc/pybaseball). I honestly go back and forth on which one I use. I find Petti's baseballr more robust but pyBaseball has its advantages as well. Also, reference the shiny app that Bill Petti has written (https://billpetti.shinyapps.io/live_pbp_viewer/)...this grabs live data and allows you to visualize it on the second tab visualize at the top.

Now for the statcast data the data dictionary is referenced here: https://app.box.com/v/statcast-pitchfx-glossary-pett . So you can see that statcast outputs about 85-90 features some of which have been deprecated.

Whats next: I am going to show yall how install R, Studio and Anaconda. Having these on your computer should allow you to do exactly what I do as far as possibilities go. Then we are going to do our first analysis which will include some data visualizations. I open to suggestions on what that should be.

I am also open to any questions or comments you have as I walk everyone through how to analyze data yourself in R and Python using statcast. All of the analyses will either focus on the Rangers or provide context to the Rangers and their performance either individually or as a team.

PS - Apologies if I jump around a little on this post this is just something I have been wanting to do for a while and figured might as well start today. Understand both data science and statcast data itself are massive topics so i will be doing my best to focus each post.