/cdn.vox-cdn.com/uploads/chorus_image/image/32956027/489311947.0.jpg)
In "Quantifying the added importance of recent data", Eric T. starts with the statement "It almost certainly has to be true that the recent results matter more than the older ones, but how much more? If you're trying to guess how a goalie will do this year, is last year's performance 10 percent more important than the year before that, twice as important, or ten times as important?"
This is a completely unjustified assumption. It would make more sense to first ask if prior performance has any significant relationship to current results.
Let's see whether there is any correlation between different years. For each year, I'm looking at only the goalies who have a 4 year track record leading up to the year of analysis. So for 2003, Year-1 is 2002, Year-2 is 2001, Year-3 is 2000 and Year-4 is 1999. To be included in the analysis, a goalie also had to play in all 5 seasons. Also, I'm just skipping over the lost season of 2004. So for 2005, Year-1 is 2003. Everything in these tables are the correlation coefficients. If you are bothered by 2004, the years 2001-2003 and 2009-2012 are unaffected.
Year |
Year-1 |
Year-2 |
Year-3 |
Year-4 |
2001 |
0.2715 |
0.2236 |
-0.0209 |
-0.1878 |
2002 |
0.3376 |
0.1083 |
0.2405 |
-0.0515 |
2003 |
0.4379 |
0.1310 |
0.2037 |
0.4460 |
2005 |
0.3814 |
0.0427 |
0.0813 |
0.1523 |
2006 |
0.1918 |
0.0818 |
-0.0049 |
0.0400 |
2007 |
0.2496 |
0.1637 |
-0.0540 |
-0.1452 |
2008 |
0.3063 |
0.0975 |
0.0455 |
0.0373 |
2009 |
0.5158 |
-0.0114 |
0.1230 |
0.0230 |
2010 |
0.1702 |
0.4042 |
0.2027 |
-0.0708 |
2011 |
0.1741 |
0.1957 |
0.1336 |
0.1592 |
2012 |
0.4895 |
0.4167 |
0.3788 |
0.1765 |
Average |
0.3205 |
0.1685 |
0.1208 |
0.0526 |
So Year-1 is a little better than the rest. Not much predictive power on average. Years -2, -3, and -4 are just about useless. I would also point out that 2012 has a strange pattern. It correlates with Years -2, -3, and -4 at a degree much higher than average.
A correlation of 0.32 may sound impressive but there really is not much relationship there. In any given year, the number of goalies is small enough that 0.32 is not quite statistically significant. Even if it were, it means that one variable explains only about 10% of the variability in the the other variable.
Eric T. continues "One way to try to answer this is a direct analysis of how things have turned out for goalies in recent years, how their eventual performance compared to their most recently completed seasons
This is what is called retrospective analysis. Essentially, you are looking at your data after the fact and drawing some lines to connect the dots. Sometimes you might come up with meaningful relationships. Sometimes you might not. If you look at enough random relationships some will seem to have associations. If you find some relationships, the way to determine whether these relationships are meaningful or not is to apply your analysis to another set of data. This is called a "validation sample" or a "hold-out sample". (You "hold" this data "out" of the first analysis.) Meaningful relationships will continue to be present in the validation sample. Spurious ones will not be there anymore.
Eric T. suggests using the following system to predict a goalie's performance. "If we are predicting 2013, 2012 gets a weight of 100, 2011 gets a weight of 70, 2010 gets a weight of 50, and 2009 gets a weight of 30."
It is certainly true that for any year, there is some set of a, b, c, and d that optimizes the equation ESSP(year) = a*(year-1) + b*(year-2) + c*(year-3) + d*(year-4). If there are no trends in the data, those weights will vary from year to year. Conversely, if you try to use a single set of weightings across the board, the results will not be very good. If you use his formula to "predict" 2012, it looks pretty good. R is 0.618. The problem is that this weighting is designed to optimize the "prediction" of 2012. By being specific, it is not robust. Let's look at the validation samples:
Year |
Eric T's Formula |
2001 |
0.2631 |
2002 |
0.3391 |
2003 |
0.4596 |
2005 |
0.2927 |
2006 |
0.1802 |
2007 |
0.2161 |
2008 |
0.2699 |
2009 |
0.3866 |
2010 |
0.3682 |
2011 |
0.2383 |
Average |
0.3014 |
His equation "predicts" 2012 because it was derived from the 2012 data. It's overall performance in years other than 2012 is a little worse than how Year-1 does by itself.
Global Modeling
What do we get if we try to model the entire database as ESSP(year) = a*(year-1) + b*(year-2) + c*(year-3) + d*(year-4)? Here Year = y, Year-1 = x2, Year-2 = x2, etc.
> LinearModel.1 = lm(y ~ x1 + x2 + x3 + x4, data=CorrAll)
> summary(LinearModel.1)
Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = CorrAll)
Residuals:
Min 1Q Median 3Q Max
-0.139067 -0.006588 0.002703 0.011412 0.042301
Coefficients:
Variable Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.33635 0.09443 3.562 0.000411 ***
x1 0.39723 0.06366 6.240 1.09e-09 ***
x2 0.14869 0.07396 2.010 0.045047 *
x3 0.04766 0.05619 0.848 0.396821
x4 0.03565 0.05040 0.707 0.479819
Residual standard error: 0.01855 on 411 degrees of freedom
Multiple R-squared: 0.1128, Adjusted R-squared: 0.1042
F-statistic: 13.06 on 4 and 411 DF, p-value: 5.036e-10
> anova(LinearModel.1)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 0.015708 0.0157081 45.6259 4.906e-11 ***
x2 1 0.001751 0.0017508 5.0854 0.02465 *
x3 1 0.000359 0.0003589 1.0425 0.30785
x4 1 0.000172 0.0001722 0.5002 0.47982
Residuals 411 0.141499 0.0003443
The R-squared is 0.1128, meaning we are explaining only 11.28% of the variability. Since Year-3 and Year-4 (x3 and x4) really aren't contributing to the model, we can remove these without losing anything.
> LinearModel.2 = lm(y ~ x1 + x2, data=CorrAll)
> summary(LinearModel.2)
Call:
lm(formula = y ~ x1 + x2, data = CorrPlus)
Residuals:
Min 1Q Median 3Q Max
-0.138896 -0.006842 0.002741 0.011437 0.042763
Coefficients:
Variable Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.39441 0.08165 4.830 1.92e-06 ***
x1 0.40171 0.06333 6.343 5.92e-10 ***
x2 0.16424 0.07279 2.256 0.0246 *
Residual standard error: 0.01854 on 413 degrees of freedom
Multiple R-squared: 0.1095, Adjusted R-squared: 0.1052
F-statistic: 25.38 on 2 and 413 DF, p-value: 4.006e-11
> anova (LinearModel.1, LinearModel.2)
Analysis of Variance Table
Model 1: y ~ x1 + x2 + x3 + x4
Model 2: y ~ x1 + x2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 411 0.14150
2 413 0.14203 -2 -0.00053111 0.7713 0.4631
Career Save Percentage
You might argue that Eric T.'s formula is an approximation of Career Save Percentage. What if you just use Career Save Percentage? To do this, I used both the straight-forward Frequentist approach, which is just observed saves/observed shots and a Bayesian approach, which adjusts the Frequentist result towards the average since average goalies are the most common. The adjustments are generally small. These goalies all have 5 seasons of data or they wouldn't be in the analysis. Most of the differences between the two estimates are in the range of 0.002 to 0.004.
Before I even did this part, I was sure that (1) if the Bayesian approach was better the effect would be negligible. A single season of data isn't enough to see a difference between a 0.920 goalie and a 0.930 goalie. It certainly won't see a difference in a prediction based on a 0.926 goalie versus a prediction based on a 0.923 goalie. (2) Neither approach would be better than a correlation of about 0.3 on average.
Year |
Bayes |
Frequency |
2001 |
0.0955 |
0.1900 |
2002 |
0.0782 |
0.1691 |
2003 |
0.1828 |
0.3047 |
2005 |
0.2049 |
0.3879 |
2006 |
0.5311 |
0.4863 |
2007 |
0.3261 |
0.1623 |
2008 |
0.3216 |
0.1481 |
2009 |
0.1914 |
0.0644 |
2010 |
0.2789 |
0.1897 |
2011 |
0.2567 |
0.1055 |
2012 |
0.4746 |
0.4802 |
Average |
0.2674 |
0.2444 |
I love it when I'm right.
> LinearModel.4 = lm(y ~ FCareer, data=CorrPlus)
> summary(LinearModel.4)
Call:
lm(formula = y ~ FCareer, data = CorrPlus)
Residuals:
Min 1Q Median 3Q Max
-0.156420 -0.006965 0.002247 0.011310 0.050006
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2250 0.1378 1.633 0.103
FCareer 0.7494 0.1500 4.995 8.71e-07 ***
Residual standard error: 0.01906 on 414 degrees of freedom
Multiple R-squared: 0.05683, Adjusted R-squared: 0.05455
F-statistic: 24.95 on 1 and 414 DF, p-value: 8.713e-07
> LinearModel.5 = lm(y ~ BAYES, data=CorrPlus)
> summary(LinearModel.5)
Call:
lm(formula = y ~ BAYES, data = CorrPlus)
Residuals:
Min 1Q Median 3Q Max
-0.152609 -0.006546 0.002930 0.011241 0.042726
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09502 0.17877 -0.532 0.595
BAYES 1.09724 0.19455 5.640 3.16e-08 ***
Residual standard error: 0.01891 on 414 degrees of freedom
Multiple R-squared: 0.07135, Adjusted R-squared: 0.06911
F-statistic: 31.81 on 1 and 414 DF, p-value: 3.156e-08
Mistaking Randomness for a Pattern
Finally let's look at a simulation. Lets look at 600 goalies. Each goalie faces 1200 shots in each of 2 seasons. Plotting Year1 against Year2 we get
Pearson's product-moment correlation
data: CorrSim$Year1 and CorrSim$Year2
t = 7.3861, df = 598, p-value = 5.103e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2140495 0.3608331
sample estimates:
cor
0.2891399
The correlation is about what Eric T's method gets. But this is just a random number generator. Here's where the correlation comes from. I think the NHL goalie population is about 10% elite (0.930) goalies, 80% average (0.920) goalies, and 10% below-average (0.910) goalies. This sample population mirrored that distribution. Isolating the average goalies:
Pearson's product-moment correlation
data: Corr920$Year1 and Corr920$Year2
t = -1.3745, df = 478, p-value = 0.1699
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.15139486 0.02690697
sample estimates:
cor
-0.06274459
Nothing there. Now the others:
Pearson's product-moment correlation
data: Corr930$Year1 and Corr930$Year2
t = 10.9827, df = 118, p-value 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6093943 0.7895948
sample estimates:
cor
0.7109766
The talent distribution in the population is creating the apparent correlation. We're not predicting anything beyond the reality that an elite goalie is likely to wind up with a higher save percentage than a below-average goalie.
So how to predict goalies?
You can use Eric T.'s formula. You can use 0.336 + 0.397*(Year-1) + 0.149*(Year-2) + 0.48*(Year-3) + 0.36*(Year-4). You could use 0.394 + 0.402*(Year-1) + 0.164.*(Year-2). You could use the Frequentist Career Save Percentage (specifically, 0.225 + 0.749*FCareer). You could use the Bayesian Career Save Percentage (specifically, -0.095 + 1.097*Bayes). Hell, you could just use last year's save percentage. It really doesn't matter. None of these formulas work very well.
The problem is a signal to noise issue. These formulas are trying to predict the differences between goalies. Unfortunately, the differences are much smaller than the random variations from season to season.