Up next, can you predict Corsi? Going into this I expect that the answer is yes (to a degree). I expect things to work better than predicting goalies. Although we don't have as many years of data that we have with save percentage, a single season of Corsi can separate skaters into aboveaverage, average, and belowaverage. To the extent that Corsi production reflects an underlying individual skill, I would expect the effect to persist from season to season.
Does last year's Corsi predict this years Corsi?
It would make sense that it might. On the one hand, I think that Corsi reflects a skill that probably stays pretty constant. On the other hand, there is a contribution from the team. As teams get better (or worse) Corsi production will tend to trend with that. We saw that in Kopitar and Phaneuf when I looked at Corsi and Age. Once again, the correlations, at least with Year1 and Year2, look pretty promising.
Correlation 
Corsi2008 
Corsi2009 
Corsi2010 
Corsi2011 
Corsi2012 
Corsi2013 
Corsi2007 
0.5540634 
0.3611624 
0.4104081 
0.2822338 
0.3422943 
0.2945781 
Corsi2008 

0.4710181 
0.4296215 
0.3784631 
0.300416 
0.3622723 
Corsi2009 


0.5272602 
0.5038071 
0.3489298 
0.3933111 
Corsi2010 



0.5890743 
0.4596954 
0.3975749 
Corsi2011 




0.5676378 
0.5376208 
Corsi2012 





0.5329138 
One of the problems here, though, is that R tosses out anybody with missing values. Thus this correlation is not looking at all 1500 or so players in the database but only the 308 who have played in all 7 seasons.
Modeling
First, I just looked at the 308 players who played in all 7 seasons. If we use every prior year to predict 2013 we get:
> LinearModel.2 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + Corsi2010 +
+ Corsi2009 + Corsi2008 + Corsi2007 + 0, data=All7)
> summary(LinearModel.2)
Call:
lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + Corsi2010 +
Corsi2009 + Corsi2008 + Corsi2007 + 0, data = All7)
Residuals:
Min 1Q Median 3Q Max
39.590 6.040 0.738 4.909 31.497
Coefficients:
Estimate Std. Error t value Pr(>t)
Corsi2012 0.31622 0.05621 5.626 4.22e08 ***
Corsi2011 0.30002 0.07001 4.285 2.46e05 ***
Corsi2010 0.01986 0.08182 0.243 0.8084
Corsi2009 0.09721 0.06925 1.404 0.1614
Corsi2008 0.10954 0.05763 1.901 0.0583 .
Corsi2007 0.02143 0.06379 0.336 0.7371
Residual standard error: 8.369 on 302 degrees of freedom
(1242 observations deleted due to missingness)
Multiple Rsquared: 0.3862, Adjusted Rsquared: 0.374
Fstatistic: 31.67 on 6 and 302 DF, pvalue: < 2.2e16
So Year1 and Year2 are highly significant. The Year5 is in the gray zone. What if we just use Year1 and Year2?
> LinearModel.2 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data=All7)
> summary(LinearModel.2)
Call:
lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data = All7)
Residuals:
Min 1Q Median 3Q Max
40.026 5.882 0.427 5.090 32.975
Coefficients:
Estimate Std. Error t value Pr(>t)
Corsi2012 0.33795 0.05494 6.151 2.41e09 ***
Corsi2011 0.37503 0.06062 6.186 1.97e09 ***
Residual standard error: 8.464 on 306 degrees of freedom
Multiple Rsquared: 0.3639, Adjusted Rsquared: 0.3597
Fstatistic: 87.52 on 2 and 306 DF, pvalue: < 2.2e16
So the explanatory power of the model has dropped off a bit. Rsquared has gone from 0.3862 to 0.3639. It's statistically significant but not of practical significance. However, when we use this model on all 592 players with data in 20112013, the results drop way off.
> LinearModel.3 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data=PredCorsi)
> summary(LinearModel.3)
Call:
lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data = PredCorsi)
Residuals:
Min 1Q Median 3Q Max
73.306 6.350 0.035 4.813 36.148
Coefficients:
Estimate Std. Error t value Pr(>t)
Corsi2012 0.31981 0.03982 8.032 5.24e15 ***
Corsi2011 0.23964 0.04132 5.800 1.08e08 ***
Residual standard error: 10.07 on 590 degrees of freedom
(958 observations deleted due to missingness)
Multiple Rsquared: 0.2181, Adjusted Rsquared: 0.2154
Fstatistic: 82.28 on 2 and 590 DF, pvalue: < 2.2e16
Rsquared drops to 0.2181. Next, I recoded the data and expanded the analysis to all years
> LinearModel.2 = lm(Current ~ Year1 + Year2 + 0, data=CorsiYear)
> summary(LinearModel.2)
Call:
lm(formula = Current ~ Year1 + Year2 + 0, data = CorsiYear)
Residuals:
Min 1Q Median 3Q Max
74.718 6.031 0.394 4.631 44.023
Coefficients:
Estimate Std. Error t value Pr(>t)
Year1 0.36107 0.01794 20.12 <2e16 ***
Year2 0.18226 0.01688 10.80 <2e16 ***
Residual standard error: 9.62 on 3009 degrees of freedom
Multiple Rsquared: 0.216, Adjusted Rsquared: 0.2155
Fstatistic: 414.6 on 2 and 3009 DF, pvalue: < 2.2e16
Not great. Plus, this is an overestimate of the explanatory power of this equation. It is optimized to this data set and may not generalize. The magnitude of the coefficients suggest that a lot of regression to the mean is happening, too.
Requiring 300+ Minutes
I removed all seasons with less than 300 minutes and reran the analysis. Not surprisingly, it did better, explaining about 34% of the total variability.
> LinearModel.1 = lm(Current ~ Year1 + Year2 + 0, data=CorsiYr300)
> summary(LinearModel.1)
Call:
lm(formula = Current ~ Year1 + Year2 + 0, data = CorsiYr300)
Residuals:
Min 1Q Median 3Q Max
28.618 4.985 0.081 4.469 30.136
Coefficients:
Estimate Std. Error t value Pr(>t)
Year1 0.47468 0.02215 21.430 <2e16 ***
Year2 0.19478 0.02296 8.485 <2e16 ***
Residual standard error: 7.135 on 2091 degrees of freedom
Multiple Rsquared: 0.3413, Adjusted Rsquared: 0.3407
Fstatistic: 541.8 on 2 and 2091 DF, pvalue: < 2.2e16
Does Career Corsi predict this years Corsi?
Once again, if you look at the correlations, it looks pretty good. As before, these correlations are not across the board but only for the 308 players who have played in all 7 seasons.
Year 
Correlation 
2008 
0.554034 
2009 
0.4802796 
2010 
0.5648495 
2011 
0.559835 
2012 
0.5339196 
2013 
0.5659859 
Modeling
If we use Career Corsi to predict this year's Corsi, we get
> LinearModel.1 = lm(Current ~ Career + 0, data=CorsiCareer)
> summary(LinearModel.1)
Call:
lm(formula = Current ~ Career + 0, data = CorsiCareer)
Residuals:
Min 1Q Median 3Q Max
145.966 6.537 0.655 5.138 101.528
Coefficients:
Estimate Std. Error t value Pr(>t)
Career 0.39940 0.01631 24.48 <2e16 ***
Residual standard error: 11.55 on 4405 degrees of freedom
Multiple Rsquared: 0.1198, Adjusted Rsquared: 0.1196
Fstatistic: 599.4 on 1 and 4405 DF, pvalue: < 2.2e16
The coefficient here, 0.39940 suggests that even career numbers are regressing to the mean quite a bit. Although the model is highly significant, it is only explaining 12% of the total variability seen.
Conclusions
You can (sort of) predict Corsi using either the last two years or Corsi for the player's career. Using the last two years gives better answers. Limiting the analysis to players with at least 300 minutes helps.
I have a couple of thoughts on ways to construct psuedoBayesian estimates of Career Corsi. I suspect they will work a little better than the unadjusted career numbers. It will take me a while to run them, but once I do I will post the results.
Loading comments...