Up next, can you predict Corsi? Going into this I expect that the answer is yes (to a degree). I expect things to work better than predicting goalies. Although we don't have as many years of data that we have with save percentage, a single season of Corsi can separate skaters into above-average, average, and below-average. To the extent that Corsi production reflects an underlying individual skill, I would expect the effect to persist from season to season.
Does last year's Corsi predict this years Corsi?
It would make sense that it might. On the one hand, I think that Corsi reflects a skill that probably stays pretty constant. On the other hand, there is a contribution from the team. As teams get better (or worse) Corsi production will tend to trend with that. We saw that in Kopitar and Phaneuf when I looked at Corsi and Age. Once again, the correlations, at least with Year-1 and Year-2, look pretty promising.
Correlation
|
Corsi2008
|
Corsi2009
|
Corsi2010
|
Corsi2011
|
Corsi2012
|
Corsi2013
|
Corsi2007
|
0.5540634
|
0.3611624
|
0.4104081
|
0.2822338
|
0.3422943
|
0.2945781
|
Corsi2008
|
|
0.4710181
|
0.4296215
|
0.3784631
|
0.300416
|
0.3622723
|
Corsi2009
|
|
|
0.5272602
|
0.5038071
|
0.3489298
|
0.3933111
|
Corsi2010
|
|
|
|
0.5890743
|
0.4596954
|
0.3975749
|
Corsi2011
|
|
|
|
|
0.5676378
|
0.5376208
|
Corsi2012
|
|
|
|
|
|
0.5329138
|
One of the problems here, though, is that R tosses out anybody with missing values. Thus this correlation is not looking at all 1500 or so players in the database but only the 308 who have played in all 7 seasons.
Modeling
First, I just looked at the 308 players who played in all 7 seasons. If we use every prior year to predict 2013 we get:
> LinearModel.2 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + Corsi2010 +
+ Corsi2009 + Corsi2008 + Corsi2007 + 0, data=All7)
> summary(LinearModel.2)
Call:
lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + Corsi2010 +
Corsi2009 + Corsi2008 + Corsi2007 + 0, data = All7)
Residuals:
Min 1Q Median 3Q Max
-39.590 -6.040 -0.738 4.909 31.497
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Corsi2012 0.31622 0.05621 5.626 4.22e-08 ***
Corsi2011 0.30002 0.07001 4.285 2.46e-05 ***
Corsi2010 -0.01986 0.08182 -0.243 0.8084
Corsi2009 0.09721 0.06925 1.404 0.1614
Corsi2008 0.10954 0.05763 1.901 0.0583 .
Corsi2007 0.02143 0.06379 0.336 0.7371
Residual standard error: 8.369 on 302 degrees of freedom
(1242 observations deleted due to missingness)
Multiple R-squared: 0.3862, Adjusted R-squared: 0.374
F-statistic: 31.67 on 6 and 302 DF, p-value: < 2.2e-16
So Year-1 and Year-2 are highly significant. The Year-5 is in the gray zone. What if we just use Year-1 and Year-2?
> LinearModel.2 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data=All7)
> summary(LinearModel.2)
Call:
lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data = All7)
Residuals:
Min 1Q Median 3Q Max
-40.026 -5.882 -0.427 5.090 32.975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Corsi2012 0.33795 0.05494 6.151 2.41e-09 ***
Corsi2011 0.37503 0.06062 6.186 1.97e-09 ***
Residual standard error: 8.464 on 306 degrees of freedom
Multiple R-squared: 0.3639, Adjusted R-squared: 0.3597
F-statistic: 87.52 on 2 and 306 DF, p-value: < 2.2e-16
So the explanatory power of the model has dropped off a bit. R-squared has gone from 0.3862 to 0.3639. It's statistically significant but not of practical significance. However, when we use this model on all 592 players with data in 2011-2013, the results drop way off.
> LinearModel.3 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data=PredCorsi)
> summary(LinearModel.3)
Call:
lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data = PredCorsi)
Residuals:
Min 1Q Median 3Q Max
-73.306 -6.350 -0.035 4.813 36.148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Corsi2012 0.31981 0.03982 8.032 5.24e-15 ***
Corsi2011 0.23964 0.04132 5.800 1.08e-08 ***
Residual standard error: 10.07 on 590 degrees of freedom
(958 observations deleted due to missingness)
Multiple R-squared: 0.2181, Adjusted R-squared: 0.2154
F-statistic: 82.28 on 2 and 590 DF, p-value: < 2.2e-16
R-squared drops to 0.2181. Next, I recoded the data and expanded the analysis to all years
> LinearModel.2 = lm(Current ~ Year-1 + Year-2 + 0, data=CorsiYear)
> summary(LinearModel.2)
Call:
lm(formula = Current ~ Year-1 + Year-2 + 0, data = CorsiYear)
Residuals:
Min 1Q Median 3Q Max
-74.718 -6.031 -0.394 4.631 44.023
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Year-1 0.36107 0.01794 20.12 <2e-16 ***
Year-2 0.18226 0.01688 10.80 <2e-16 ***
Residual standard error: 9.62 on 3009 degrees of freedom
Multiple R-squared: 0.216, Adjusted R-squared: 0.2155
F-statistic: 414.6 on 2 and 3009 DF, p-value: < 2.2e-16
Not great. Plus, this is an overestimate of the explanatory power of this equation. It is optimized to this data set and may not generalize. The magnitude of the coefficients suggest that a lot of regression to the mean is happening, too.
Requiring 300+ Minutes
I removed all seasons with less than 300 minutes and reran the analysis. Not surprisingly, it did better, explaining about 34% of the total variability.
> LinearModel.1 = lm(Current ~ Year-1 + Year-2 + 0, data=CorsiYr300)
> summary(LinearModel.1)
Call:
lm(formula = Current ~ Year-1 + Year-2 + 0, data = CorsiYr300)
Residuals:
Min 1Q Median 3Q Max
-28.618 -4.985 -0.081 4.469 30.136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Year-1 0.47468 0.02215 21.430 <2e-16 ***
Year-2 0.19478 0.02296 8.485 <2e-16 ***
Residual standard error: 7.135 on 2091 degrees of freedom
Multiple R-squared: 0.3413, Adjusted R-squared: 0.3407
F-statistic: 541.8 on 2 and 2091 DF, p-value: < 2.2e-16
Does Career Corsi predict this years Corsi?
Once again, if you look at the correlations, it looks pretty good. As before, these correlations are not across the board but only for the 308 players who have played in all 7 seasons.
Year
|
Correlation
|
2008
|
0.554034
|
2009
|
0.4802796
|
2010
|
0.5648495
|
2011
|
0.559835
|
2012
|
0.5339196
|
2013
|
0.5659859
|
Modeling
If we use Career Corsi to predict this year's Corsi, we get
> LinearModel.1 = lm(Current ~ Career + 0, data=CorsiCareer)
> summary(LinearModel.1)
Call:
lm(formula = Current ~ Career + 0, data = CorsiCareer)
Residuals:
Min 1Q Median 3Q Max
-145.966 -6.537 -0.655 5.138 101.528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Career 0.39940 0.01631 24.48 <2e-16 ***
Residual standard error: 11.55 on 4405 degrees of freedom
Multiple R-squared: 0.1198, Adjusted R-squared: 0.1196
F-statistic: 599.4 on 1 and 4405 DF, p-value: < 2.2e-16
The coefficient here, 0.39940 suggests that even career numbers are regressing to the mean quite a bit. Although the model is highly significant, it is only explaining 12% of the total variability seen.
Conclusions
You can (sort of) predict Corsi using either the last two years or Corsi for the player's career. Using the last two years gives better answers. Limiting the analysis to players with at least 300 minutes helps.
I have a couple of thoughts on ways to construct psuedo-Bayesian estimates of Career Corsi. I suspect they will work a little better than the unadjusted career numbers. It will take me a while to run them, but once I do I will post the results.