clock menu more-arrow no yes mobile

Filed under:

Predicting Corsi

You can (sort of) predict Corsi using either the last two years or Corsi for the player's career.

Steen and Oshie generating Corsi events.
Steen and Oshie generating Corsi events.
Dennis Wierzbicki-USA TODAY Sports

Up next, can you predict Corsi? Going into this I expect that the answer is yes (to a degree). I expect things to work better than predicting goalies. Although we don't have as many years of data that we have with save percentage, a single season of Corsi can separate skaters into above-average, average, and below-average. To the extent that Corsi production reflects an underlying individual skill, I would expect the effect to persist from season to season.

Does last year's Corsi predict this years Corsi?

It would make sense that it might. On the one hand, I think that Corsi reflects a skill that probably stays pretty constant. On the other hand, there is a contribution from the team. As teams get better (or worse) Corsi production will tend to trend with that. We saw that in Kopitar and Phaneuf when I looked at Corsi and Age. Once again, the correlations, at least with Year-1 and Year-2, look pretty promising.

Correlation

Corsi2008

Corsi2009

Corsi2010

Corsi2011

Corsi2012

Corsi2013

Corsi2007

0.5540634

0.3611624

0.4104081

0.2822338

0.3422943

0.2945781

Corsi2008

0.4710181

0.4296215

0.3784631

0.300416

0.3622723

Corsi2009

0.5272602

0.5038071

0.3489298

0.3933111

Corsi2010

0.5890743

0.4596954

0.3975749

Corsi2011

0.5676378

0.5376208

Corsi2012

0.5329138

One of the problems here, though, is that R tosses out anybody with missing values. Thus this correlation is not looking at all 1500 or so players in the database but only the 308 who have played in all 7 seasons.

Modeling

First, I just looked at the 308 players who played in all 7 seasons. If we use every prior year to predict 2013 we get:

> LinearModel.2 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + Corsi2010 +

+ Corsi2009 + Corsi2008 + Corsi2007 + 0, data=All7)

> summary(LinearModel.2)

Call:

lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + Corsi2010 +

Corsi2009 + Corsi2008 + Corsi2007 + 0, data = All7)

Residuals:

Min 1Q Median 3Q Max

-39.590 -6.040 -0.738 4.909 31.497

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Corsi2012 0.31622 0.05621 5.626 4.22e-08 ***

Corsi2011 0.30002 0.07001 4.285 2.46e-05 ***

Corsi2010 -0.01986 0.08182 -0.243 0.8084

Corsi2009 0.09721 0.06925 1.404 0.1614

Corsi2008 0.10954 0.05763 1.901 0.0583 .

Corsi2007 0.02143 0.06379 0.336 0.7371

Residual standard error: 8.369 on 302 degrees of freedom

(1242 observations deleted due to missingness)

Multiple R-squared: 0.3862, Adjusted R-squared: 0.374

F-statistic: 31.67 on 6 and 302 DF, p-value: < 2.2e-16

So Year-1 and Year-2 are highly significant. The Year-5 is in the gray zone. What if we just use Year-1 and Year-2?

> LinearModel.2 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data=All7)

> summary(LinearModel.2)

Call:

lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data = All7)

Residuals:

Min 1Q Median 3Q Max

-40.026 -5.882 -0.427 5.090 32.975

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Corsi2012 0.33795 0.05494 6.151 2.41e-09 ***

Corsi2011 0.37503 0.06062 6.186 1.97e-09 ***

Residual standard error: 8.464 on 306 degrees of freedom

Multiple R-squared: 0.3639, Adjusted R-squared: 0.3597

F-statistic: 87.52 on 2 and 306 DF, p-value: < 2.2e-16

So the explanatory power of the model has dropped off a bit. R-squared has gone from 0.3862 to 0.3639. It's statistically significant but not of practical significance. However, when we use this model on all 592 players with data in 2011-2013, the results drop way off.

> LinearModel.3 = lm(Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data=PredCorsi)

> summary(LinearModel.3)

Call:

lm(formula = Corsi2013 ~ Corsi2012 + Corsi2011 + 0, data = PredCorsi)

Residuals:

Min 1Q Median 3Q Max

-73.306 -6.350 -0.035 4.813 36.148

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Corsi2012 0.31981 0.03982 8.032 5.24e-15 ***

Corsi2011 0.23964 0.04132 5.800 1.08e-08 ***

Residual standard error: 10.07 on 590 degrees of freedom

(958 observations deleted due to missingness)

Multiple R-squared: 0.2181, Adjusted R-squared: 0.2154

F-statistic: 82.28 on 2 and 590 DF, p-value: < 2.2e-16

R-squared drops to 0.2181. Next, I recoded the data and expanded the analysis to all years

> LinearModel.2 = lm(Current ~ Year-1 + Year-2 + 0, data=CorsiYear)

> summary(LinearModel.2)

Call:

lm(formula = Current ~ Year-1 + Year-2 + 0, data = CorsiYear)

Residuals:

Min 1Q Median 3Q Max

-74.718 -6.031 -0.394 4.631 44.023

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Year-1 0.36107 0.01794 20.12 <2e-16 ***

Year-2 0.18226 0.01688 10.80 <2e-16 ***

Residual standard error: 9.62 on 3009 degrees of freedom

Multiple R-squared: 0.216, Adjusted R-squared: 0.2155

F-statistic: 414.6 on 2 and 3009 DF, p-value: < 2.2e-16

Not great. Plus, this is an overestimate of the explanatory power of this equation. It is optimized to this data set and may not generalize. The magnitude of the coefficients suggest that a lot of regression to the mean is happening, too.

Requiring 300+ Minutes

I removed all seasons with less than 300 minutes and reran the analysis. Not surprisingly, it did better, explaining about 34% of the total variability.

> LinearModel.1 = lm(Current ~ Year-1 + Year-2 + 0, data=CorsiYr300)

> summary(LinearModel.1)

Call:

lm(formula = Current ~ Year-1 + Year-2 + 0, data = CorsiYr300)

Residuals:

Min 1Q Median 3Q Max

-28.618 -4.985 -0.081 4.469 30.136

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Year-1 0.47468 0.02215 21.430 <2e-16 ***

Year-2 0.19478 0.02296 8.485 <2e-16 ***

Residual standard error: 7.135 on 2091 degrees of freedom

Multiple R-squared: 0.3413, Adjusted R-squared: 0.3407

F-statistic: 541.8 on 2 and 2091 DF, p-value: < 2.2e-16

Does Career Corsi predict this years Corsi?

Once again, if you look at the correlations, it looks pretty good. As before, these correlations are not across the board but only for the 308 players who have played in all 7 seasons.

Year

Correlation

2008

0.554034

2009

0.4802796

2010

0.5648495

2011

0.559835

2012

0.5339196

2013

0.5659859

Modeling

If we use Career Corsi to predict this year's Corsi, we get

> LinearModel.1 = lm(Current ~ Career + 0, data=CorsiCareer)

> summary(LinearModel.1)

Call:

lm(formula = Current ~ Career + 0, data = CorsiCareer)

Residuals:

Min 1Q Median 3Q Max

-145.966 -6.537 -0.655 5.138 101.528

Coefficients:

Estimate Std. Error t value Pr(>|t|)

Career 0.39940 0.01631 24.48 <2e-16 ***

Residual standard error: 11.55 on 4405 degrees of freedom

Multiple R-squared: 0.1198, Adjusted R-squared: 0.1196

F-statistic: 599.4 on 1 and 4405 DF, p-value: < 2.2e-16

The coefficient here, 0.39940 suggests that even career numbers are regressing to the mean quite a bit. Although the model is highly significant, it is only explaining 12% of the total variability seen.

Conclusions

You can (sort of) predict Corsi using either the last two years or Corsi for the player's career. Using the last two years gives better answers. Limiting the analysis to players with at least 300 minutes helps.

I have a couple of thoughts on ways to construct psuedo-Bayesian estimates of Career Corsi. I suspect they will work a little better than the unadjusted career numbers. It will take me a while to run them, but once I do I will post the results.