/cdn.vox-cdn.com/uploads/chorus_image/image/33052583/489635683.0.jpg)
Does Corsi change with age? It's actually a good question. When you try to answer this you run into a lot of problems. First, a player's Corsi varies a lot from season to season. Second, there is a lot of "censoring". Censoring is when there is data out there that we haven't captured. By convention, the arrow of time flies from left to right. We have "right censored" data, in that most players in the database will play next season and in seasons after that. Corsi data only starts in 2007, so we also have "left censored" data. There are a large number of players in the database who played prior to 2007 but we have no way to know what their results were. Finally, the data has cohort issues. Players who were 18 in 2007 were only 24 in 2013. They have no overlap with players who were 25 or older in 2007. Trying to compare players at age 20 to players at age 28 may be apples and oranges.
Why it might change with age
Young players have to learn the ropes. They have to fill out their lanky frames, grow into their bodies, and pay their dues. Older players lose a step. I'm sure there are other adages I'm forgetting. You might think the curve of Corsi versus Age looks like:
Why it might not That all sounds good, but the NHL is a cut-throat business. Teams generally don't have the leeway to let players learn on the job or fade away gracefully. Even if the Corsi versus Age curve truly looks like figure 1, the NHL part of it is probably the broad flat top. We don't get to see the ends because they take place in other leagues. Models In all these analyses, I'm looking at Corsi as a rate. Average = 0. First, all players, unweighted: > LinearModel.1 = lm(CORSION ~ Age, data=CorsiAges) > summary(LinearModel.1) Call: lm(formula = CORSION ~ Age, data = CorsiAges) Residuals: Min 1Q Median 3Q Max -143.443 -6.295 0.828 7.370 101.517 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.36217 1.01118 -2.336 0.0195 * Age 0.01980 0.03725 0.532 0.5950 Next all players but weighted by minutes played: > LinearModel.2 = lm(CORSION ~ Age, data=CorsiAges, weights=Minutes) > summary(LinearModel.2) Call: lm(formula = CORSION ~ Age, data = CorsiAges, weights = Minutes) Weighted Residuals: Min 1Q Median 3Q Max -824.62 -156.34 -22.65 113.37 967.59 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.34161 0.69307 3.379 0.000733 *** Age -0.07941 0.02493 -3.185 0.001455 ** So maybe. If so, a 20 year-old has an expected Corsi of 0.75 and a 40 year-old has an expected Corsi of -0.83. If real, that's not enough to worry about. Plus the apples to oranges issue. Next, nesting age within each player and weighting by minutes: > LinearModel.13 = lm(CORSION ~ NAME/Age, data=CorsiAges, weights=Minutes) > anova (LinearModel.13) Analysis of Variance Table Response: CORSION Df Sum Sq Mean Sq F value Pr(>F) NAME 1549 173078767 111736 4.5021 < 2.2e-16 *** NAME:Age 1256 49032978 39039 1.5730 < 2.2e-16 *** Residuals 3150 78178478 24819 It looks like the NAME:Age interaction is significant off the charts. But this is an overparameterized model. Looking at the NAME term, the top 5 and bottom 5 players are not exactly a Who's Who of Corsi. Variable Estimate StdError tvalue Pr(>|t|) TREVORGILLIES 7.031e+02 1.368e+03 0.514 0.607313 MIKEIGGULDEN 1.067e+03 2.093e+03 0.51 0.610228 KYLEGREENTREE 8.946e+02 1.793e+03 0.499 0.617921 DARRENMCCARTY 8.703e+02 1.803e+03 0.483 0.62943 IVANVISHNEVSKIY 6.704e+02 1.442e+03 0.465 0.641978 MIKKOLEHTONEN -2.346e+03 2.427e+03 -0.967 0.3337 RICKARDRAKELL -1.313e+03 1.317e+03 -0.997 0.31908 BARRYTALLACKSON -1.847e+03 1.760e+03 -1.05 0.293805 JONMATSUMOTO -2.779e+03 2.133e+03 -1.303 0.192674 JAREDROSS -2.054e+03 1.554e+03 -1.322 0.186429 The top5/bottom 5 on the NAME:Age term list looks more familiar. Variable Estimate StdError tvalue Pr(>|t|) ANZEKOPITAR:Age 4.679e+00 8.827e-01 5.301 1.23e-07 DUSTINBROWN:Age 3.864e+00 9.345e-01 4.135 3.64e-05 RYANOREILLY:Age 1.037e+01 2.594e+00 3.999 6.52e-05 PATRICEBERGERON:Age 4.482e+00 1.199e+00 3.738 0.000189 EVGENIMALKIN:Age 3.539e+00 9.584e-01 3.693 0.000226 MANNYMALHOTRA:Age -4.798e+00 1.156e+00 -4.15 3.41e-05 NIKOLAIKULEMIN:Age -5.760e+00 1.264e+00 -4.555 5.42e-06 ANDREASLILJA:Age -7.884e+00 1.690e+00 -4.666 3.20e-06 ALEXOVECHKIN:Age -4.353e+00 8.701e-01 -5.003 5.96e-07 DIONPHANEUF:Age -5.222e+00 8.221e-01 -6.351 2.44e-10 Just looking at Kopitar and Phaneuf, yes the trends are there. Looking at Corsi Rel, it suggests that some of this is being driven by changes in the quality of their teams. Year NAME CORSIREL CORSION CORSIOFF Minutes Birth Age 2007 ANZEKOPITAR 5.5 -4.99 -10.45 1225.9 1987 20 2008 ANZEKOPITAR 12.3 9.5 -2.77 1175.06 1987 21 2009 ANZEKOPITAR 11.2 8.94 -2.24 1267.72 1987 22 2010 ANZEKOPITAR 8.3 8.68 0.4 1140.75 1987 23 2011 ANZEKOPITAR 13.8 19.35 5.5 1225.08 1987 24 2012 ANZEKOPITAR 18.4 25.43 7.06 726.62 1987 25 2013 ANZEKOPITAR 15.4 25.24 9.88 1205.4 1987 26 2007 DIONPHANEUF 8 9.25 1.22 1446.48 1985 22 2008 DIONPHANEUF -1.6 10.45 12.01 1395.2 1985 23 2009 DIONPHANEUF 5.6 6.36 0.74 1415.07 1985 24 2010 DIONPHANEUF -1.5 -6.2 -4.73 1228.92 1985 25 2011 DIONPHANEUF 3.3 -0.36 -3.66 1521.1 1985 26 2012 DIONPHANEUF -7.3 -18.16 -10.85 859.2 1985 27 2013 DIONPHANEUF -5.8 -20.04 -14.27 1326.4 1985 28 About as many players go up with age as go down with age. The pattern doesn't look any different in players under 30 versus players over 30. Another year or two of data will help sort this out. I doubt Kopitar keeps going up, up, up or Phaneuf down, down, down. Finally, looking at age cohorts For this part, I limited the anlysis players who were in the league in 2007-08 and broke the data into age cohorts. I separated the players by their age in 2007 into 18-20, 21-25, 26-30, 31-35, and 35+. I then did the analysis separately for each group. For the three youngest cohorts, there is no age effect (weighted or unweighted). > LinearModel.1 = lm(CORSION ~ Age, data=Cohort1820) > summary(LinearModel.1) Call: lm(formula = CORSION ~ Age, data = Cohort1820) Residuals: Min 1Q Median 3Q Max -54.865 -6.236 0.856 6.776 41.859 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.6241 6.6880 -1.140 0.255 Age 0.3683 0.2961 1.244 0.214 Residual standard error: 10.87 on 320 degrees of freedom Multiple R-squared: 0.004812, Adjusted R-squared: 0.001703 F-statistic: 1.547 on 1 and 320 DF, p-value: 0.2144 > LinearModel.3 = lm(CORSION ~ Age, data=Cohort2125) > summary(LinearModel.3) Call: lm(formula = CORSION ~ Age, data = Cohort2125) Residuals: Min 1Q Median 3Q Max -143.678 -6.203 1.166 7.219 87.351 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.21332 3.51140 0.061 0.952 Age -0.07019 0.13613 -0.516 0.606 Residual standard error: 13.25 on 1718 degrees of freedom Multiple R-squared: 0.0001547, Adjusted R-squared: -0.0004273 F-statistic: 0.2658 on 1 and 1718 DF, p-value: 0.6062 > LinearModel.5 = lm(CORSION ~ Age, data=Cohort2630) > summary(LinearModel.5) Call: lm(formula = CORSION ~ Age, data = Cohort2630) Residuals: Min 1Q Median 3Q Max -79.126 -6.011 0.341 6.581 36.899 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0374 4.2088 0.722 0.471 Age -0.1349 0.1388 -0.972 0.331 Residual standard error: 11.16 on 1178 degrees of freedom Multiple R-squared: 0.0008006, Adjusted R-squared: -4.763e-05 F-statistic: 0.9438 on 1 and 1178 DF, p-value: 0.3315 In the two older cohorts, there seems to be an age effect (weighted or unweighted). Interestingly, Corsi goes up as these players get older. > LinearModel.7 = lm(CORSION ~ Age, data=Cohort3135) > summary(LinearModel.7) Call: lm(formula = CORSION ~ Age, data = Cohort3135) Residuals: Min 1Q Median 3Q Max -34.790 -6.116 0.241 5.764 32.481 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -19.7605 6.7889 -2.911 0.00375 ** Age 0.5191 0.1969 2.636 0.00863 ** Residual standard error: 9.555 on 552 degrees of freedom Multiple R-squared: 0.01243, Adjusted R-squared: 0.01064 F-statistic: 6.948 on 1 and 552 DF, p-value: 0.008628 > LinearModel.9 = lm(CORSION ~ Age, data=Cohort35up) > summary(LinearModel.9) Call: lm(formula = CORSION ~ Age, data = Cohort35up) Residuals: Min 1Q Median 3Q Max -38.249 -7.743 0.234 6.841 34.104 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -54.4234 19.7084 -2.761 0.00675 ** Age 1.3990 0.5139 2.722 0.00755 ** Residual standard error: 10.84 on 109 degrees of freedom Multiple R-squared: 0.06366, Adjusted R-squared: 0.05507 F-statistic: 7.41 on 1 and 109 DF, p-value: 0.007552 I think some of this is selection bias. Some of this is Chris Chelios skewing the data. Looking at the scatter plot tends to confirm this. (It also tends to highlight what an outlier Chris Chelios was. Those 3 dots out at 45, 46, and 47 are him.) Eric T looked a this a couple months ago and said "The average player peaks at a bit over 51 percent Corsi, which is something like 60th percentile among regulars. By age 34 or 35, he's dropped to around 47 percent, which would be about 20th percentile. " Here's his figure for this. I've made one little change. Obviously, I'm very skeptical about this. But let's suppose that somehow he has managed to stumble upon The Truth and the average player really does peak at 51% and gradually drop off to 47%. The line I drew in is the lower limit of the 95% Confidence Interval for a full season of Corsi. The upper limit didn't fit on his figure, but it is up at 57%, which is roughly where it says "Eric T looked a this". So a "change" from 51% to 47% is a lot less than the magnitude of the randomness. It's not significant. So if a team is considering acquiring a new player, and they have three options, a 29 year-old, a 31 year-old, and a 33 year-old, I would not choose among them based on a concern that Corsi would change as they age.