FanPost

Sabermetrics and sigma

I've been reading Sabermetric-leaning writers (who I greatly respect) for almost 15 years ever since I found Rob Neyer's column on ESPN. Through it all, I've learned a lot and grown in my appreciation of baseball. However, I finally felt the need to write a bit about a couple of things have always bothered me.

First is the lack of scalability in the Pythagorean method for calculating expected wins/losses. In other words, if you divided a team schedule into parts, the sum of the expected record for each part does not mathematically equal the expected record if the season is taken as a whole. The quadratic nature of the formula prevents this from being true.

Now I can live with this as a simplification for a more complicated model, but I was curious what a more accurate and complicated model might look like. For example, the model might use the runs for/against information and try to calculate the probability that a team wins any given game (and then extrapolate that into an expected win/loss record). However, to do this, you'd have to make assumptions about the distribution of runs across the various games.

The problem is that Sabermetrics doesn't seems to pay a lot of attention to standard deviation, which seems important. To illustrate with an extreme scenario, if a team scored more runs than it gave up, but had a standard deviation of 0 for both runs for and runs against, how many games would you expect the team to have won? The answer is all of them.

It seems to me that standard deviation potentially could help explain problems like "Why did the Angels consistently outperform their Pythagorean expected record in the 2000's?" or open up additional ways of evaluating players that complement WAR.

Without reading more or doing any further work (since I do have a day job), does anyone know if the lack of attention to standard deviation is because:

  1. Calculations would become too complicated such that most people would be turned off by sabermetrics.
  2. The standard deviation for run distribution doesn't vary much from team to team and year to year (and thus can be ignored).
  3. Standard deviation varies too much (i.e. is not easily predicted and thus difficult to project).
  4. Been there done that: This was already addressed by authors X,Y, and Z 50 years ago and I'm an idiot.
  5. Lack of resources to explore more fully the idea.

Thanks.