RAPM aging curve
RAPM aging curve
Hey
Here's the analysis regarding influence of age on player performance I originally submitted to SSAC '13 but did not present. It uses matchupdata from '00 to Dec. 2012.
What I had done is treat player age as an additional player on the floor, then compute coefficients for each age with RAPM.
So, essentially, instead of the matchupfiles looking like
LeBron Wade Bosh .. ..  Anthony Smith Chandler .. ..  1 0
they'd look like
LeBron AGE_28 Wade AGE_31 Bosh AGE_28 .. ..  Anthony AGE_28 Smith AGE_27 Chandler AGE_30 .. ..  1 0
Survivor bias is a problem here. These numbers don't represent "expected performance", but instead "expected performance IF the coach actually decides to give that player minutes"
The individual lines for offense and defense are jerkier.
It also has to be noted that the yaxis actually represents influence per 200 possessions, not 100.
That is if one, like me, assumes ~192 possessions total in a game.
Here's the analysis regarding influence of age on player performance I originally submitted to SSAC '13 but did not present. It uses matchupdata from '00 to Dec. 2012.
What I had done is treat player age as an additional player on the floor, then compute coefficients for each age with RAPM.
So, essentially, instead of the matchupfiles looking like
LeBron Wade Bosh .. ..  Anthony Smith Chandler .. ..  1 0
they'd look like
LeBron AGE_28 Wade AGE_31 Bosh AGE_28 .. ..  Anthony AGE_28 Smith AGE_27 Chandler AGE_30 .. ..  1 0
Survivor bias is a problem here. These numbers don't represent "expected performance", but instead "expected performance IF the coach actually decides to give that player minutes"
The individual lines for offense and defense are jerkier.
It also has to be noted that the yaxis actually represents influence per 200 possessions, not 100.
That is if one, like me, assumes ~192 possessions total in a game.

 Posts: 237
 Joined: Sat Feb 16, 2013 11:56 am
Re: RAPM aging curve
I'll assume the low value for age 19 is because of the lack of priors, right?
The polynomial fit says peak NBA age is 28, which close to the consensus (at least that I've seen.)
I assumed you used integer ages instead of decimal (incorporating days.) Then ages 19 and older then would all be dummy variables. Did you think about making age a continuous variable? If you want to model a polynomial of that shape, you can use something easy like beta1*agebeta2*age^2+beta3.
Do offense and defense peak in different ages?
The polynomial fit says peak NBA age is 28, which close to the consensus (at least that I've seen.)
I assumed you used integer ages instead of decimal (incorporating days.) Then ages 19 and older then would all be dummy variables. Did you think about making age a continuous variable? If you want to model a polynomial of that shape, you can use something easy like beta1*agebeta2*age^2+beta3.
Do offense and defense peak in different ages?
Re: RAPM aging curve
Jerry,
This is a pretty cool concept. Is everything still done in a "bayesian" updating in this iteration of RAPM? If so, I would guess that this graph could be used to represent yeartoyear deltas.
If not, the deltas would be something more like Yvalue(year1) minus Yvalue(year2).
This is a pretty cool concept. Is everything still done in a "bayesian" updating in this iteration of RAPM? If so, I would guess that this graph could be used to represent yeartoyear deltas.
If not, the deltas would be something more like Yvalue(year1) minus Yvalue(year2).
Re: RAPM aging curve
In this specific analysis there are no priors involved except for the standard zero prior everyone gets in standard RAPM.
It's one large regression over multiple years of data, with coefficients for age added in
It's one large regression over multiple years of data, with coefficients for age added in
Re: RAPM aging curve
Just wanted to add a few observations (mostly questions) about this very interesting result.
(1) Does it make sense that the average (fitted) performance of players is positive between about ages of 20 and 35? Isn't there some addingup constraint that makes overall average performance equal zero? The vast majority of possessions are played by that cohort, and shouldn't that be balanced out by the negative performances of the young and old? What am I missing?
(2) Do ageadjustments really matter "much"? I get that including them improves the fit of +/ regressions, but how much of an adjustment does it impart to an average player rating?
Consider the case of Kevin Durant. He is a young player, one whose ratings are diminished by not taking into account his agerelated improvement. As I read Jeremias' plot, the average 24 year old was expected to be about 0.3 points per 100 possessions better than the year before. How much might his ageadjusted (x)RAPM improve by incorporating this fact?
Here's my stab at an answer. KD played 42% of his 201213 minutes as part of the following lineup: Durant (age 24), Ibaka (23), Perkins (28), Sefolosha (29), and Westbrook (24). And this is a pretty good representation of the age structure of the remaining 58%. And what you get when you take the average of the expected agerelated changes of all these players and subtract these from those of the individuals is, well, not much.
KD's expected 24 year old improvement of about 0.3 is shared in part with Perkins and Sefolosha, but he in turn is subsidized to smaller degree by Ibaka (23 year olds having a slightly larger expected improvement than 24 year olds). In the instance, given how I eyeballed the plot, I get that KD's rating, per this lineup, is about 0.06 lower than a nonageadjusted would be. And this would seem to be a pretty good representative figure for the rest of his minutes.
Now, such crosssubidies will vary in lineups depending upon the age structure. Pity the young player who only plays with teammates significantly over 29 years of age and above (and whose average competition is also over the hill). He would be more screwed. But such results should be very rare, and the upperbound is still not "large".
Is my intuition incorrect on this point? Is it possible for ageadjusted +/ measures to differ more from their unadjusted counterparts, and if so, why?
(3) Coaches are the real beneficiaries of the failure to ageadjust +/ regressions.
The average age in the NBA is about 27, and according to Jeremias' results (assuming I am interpreting them correctly) such players are expected to improve by about 0.1 per 100 possessions. Perhaps I am not thinking about this correctly, but when you then throw coaches into an ageunadjusted +/ regression, all expected agerelated player improvements will be assigned to coaches, implying that on average coaches' ratings are bumped up by about 0.5.
Now, J.E.'s coaches regression is no longer available, but as I recall the results, the average rating of a coach was already slightly negative (about 0.5 or so). This argument implies that the "true" average coaching contribution is lower still. (This average negative result, by the way, is consistent with other research I have seen.)
And I suppose in this context I should make particular note of Gregg Popovich, because I do recall having previously commented on his slightly negative rating in the aforementioned regression. Gregg Popovich, of course, is well known (or should be) for having consistently coached above average age rosters. I am too lazy to check at the moment, but I think they have averaged slightly above 29 years. As such, in line with the argument above, GP might not in fact be a "net negative" coach. Rather, adjusting for his slightly pastprime players, he could be expected to have an approximately zero rating, which would make him a bit above average.
(1) Does it make sense that the average (fitted) performance of players is positive between about ages of 20 and 35? Isn't there some addingup constraint that makes overall average performance equal zero? The vast majority of possessions are played by that cohort, and shouldn't that be balanced out by the negative performances of the young and old? What am I missing?
(2) Do ageadjustments really matter "much"? I get that including them improves the fit of +/ regressions, but how much of an adjustment does it impart to an average player rating?
Consider the case of Kevin Durant. He is a young player, one whose ratings are diminished by not taking into account his agerelated improvement. As I read Jeremias' plot, the average 24 year old was expected to be about 0.3 points per 100 possessions better than the year before. How much might his ageadjusted (x)RAPM improve by incorporating this fact?
Here's my stab at an answer. KD played 42% of his 201213 minutes as part of the following lineup: Durant (age 24), Ibaka (23), Perkins (28), Sefolosha (29), and Westbrook (24). And this is a pretty good representation of the age structure of the remaining 58%. And what you get when you take the average of the expected agerelated changes of all these players and subtract these from those of the individuals is, well, not much.
KD's expected 24 year old improvement of about 0.3 is shared in part with Perkins and Sefolosha, but he in turn is subsidized to smaller degree by Ibaka (23 year olds having a slightly larger expected improvement than 24 year olds). In the instance, given how I eyeballed the plot, I get that KD's rating, per this lineup, is about 0.06 lower than a nonageadjusted would be. And this would seem to be a pretty good representative figure for the rest of his minutes.
Now, such crosssubidies will vary in lineups depending upon the age structure. Pity the young player who only plays with teammates significantly over 29 years of age and above (and whose average competition is also over the hill). He would be more screwed. But such results should be very rare, and the upperbound is still not "large".
Is my intuition incorrect on this point? Is it possible for ageadjusted +/ measures to differ more from their unadjusted counterparts, and if so, why?
(3) Coaches are the real beneficiaries of the failure to ageadjust +/ regressions.
The average age in the NBA is about 27, and according to Jeremias' results (assuming I am interpreting them correctly) such players are expected to improve by about 0.1 per 100 possessions. Perhaps I am not thinking about this correctly, but when you then throw coaches into an ageunadjusted +/ regression, all expected agerelated player improvements will be assigned to coaches, implying that on average coaches' ratings are bumped up by about 0.5.
Now, J.E.'s coaches regression is no longer available, but as I recall the results, the average rating of a coach was already slightly negative (about 0.5 or so). This argument implies that the "true" average coaching contribution is lower still. (This average negative result, by the way, is consistent with other research I have seen.)
And I suppose in this context I should make particular note of Gregg Popovich, because I do recall having previously commented on his slightly negative rating in the aforementioned regression. Gregg Popovich, of course, is well known (or should be) for having consistently coached above average age rosters. I am too lazy to check at the moment, but I think they have averaged slightly above 29 years. As such, in line with the argument above, GP might not in fact be a "net negative" coach. Rather, adjusting for his slightly pastprime players, he could be expected to have an approximately zero rating, which would make him a bit above average.
Re: RAPM aging curve
schtevie, for the aging curve I artificially adjusted the chart so the data points (somewhat) run from 2 to +2. It's supposed to be interpreted so that a player who is a +X (in xRAPM or whatever) now is projected to perform at
+X+(y_coordinate of player age next season  y_coordinate of player age last season)
(and take the whole thing *0.85 because of regression to the mean)
+X+(y_coordinate of player age next season  y_coordinate of player age last season)
(and take the whole thing *0.85 because of regression to the mean)
I think the coach ratings back then were just not centered correctly, so the average coach rating should probably not have been negativeNow, J.E.'s coaches regression is no longer available, but as I recall the results, the average rating of a coach was already slightly negative (about 0.5 or so). This argument implies that the "true" average coaching contribution is lower still. (This average negative result, by the way, is consistent with other research I have seen.)
Re: RAPM aging curve
Small update
Reran the numbers with more data from 201213 and 201314
The coefficients for Offense make sense, for the most part, until you reach age 41
I'm happy with the fact there are very few conflicting data points. On the upslope (1823) each year has a more positive coefficient than the preceding year, and on the downslope (31 and after) most years have a more negative coefficient than their preceding year, with the exceptions of coeff(39)>coeff(38)>coeff(37)
After removing the coefficients for 41 and over, polynomial fitting (thanks to http://www.arachnoid.com/polysolve/) leads to
and
Unfortunately the coefficients for Defense are not as 'pretty'
Goes up almost steadily until age 29, then steadily drops until age 33. After that, the coefficients are all over the place.
I've decided to not use coefficients for ages 37 and over for the polynomial fit
There are two reasons for the inconsistent coefficients at age 37 and above:
 Sample size: There simply aren't many players that play after age 37, let alone age 40. The smaller the number of players of that age group in our sample, the harder it becomes for the regression to estimate a reasonable coefficient.
Example: Suppose we had only one single player that played at age 42 and 43. For one single player it is not entirely unlikely that he, for random reasons, has better +/ numbers (after adjusting for teammates) at age 43 compared to 42. Since the regression has only his performance to go by for 42/43yearolds, the coefficient for 'Age_43' would be higher than the coefficient for 'Age_42'. If we had more players that had played at 42 and 43, chances are that most of them played a little worse at age 43 and the coefficients would look more reasonable
 Survivor bias: Only those players that play exceptionally well up to a very high age do get some playing time at high age, and are thus in our sample. Players which were more heavily (negatively) affected by age don't remain in the league as long, are thus not in the matchupdata and not in the sample. This skews results
Reran the numbers with more data from 201213 and 201314
The coefficients for Offense make sense, for the most part, until you reach age 41
I'm happy with the fact there are very few conflicting data points. On the upslope (1823) each year has a more positive coefficient than the preceding year, and on the downslope (31 and after) most years have a more negative coefficient than their preceding year, with the exceptions of coeff(39)>coeff(38)>coeff(37)
After removing the coefficients for 41 and over, polynomial fitting (thanks to http://www.arachnoid.com/polysolve/) leads to
and
Code: Select all
def age_infl_off(age):
return 5.1855886560913811e001 * pow(age,0)
+ 4.9112390028866172e002 * pow(age,1)
+ 1.4598588208904030e003 * pow(age,2)
+ 1.3428060693723941e005 * pow(age,3)
Unfortunately the coefficients for Defense are not as 'pretty'
Goes up almost steadily until age 29, then steadily drops until age 33. After that, the coefficients are all over the place.
I've decided to not use coefficients for ages 37 and over for the polynomial fit
Code: Select all
def age_infl_def(age):
return 1.3905924679440346e001 * pow(age,0)
+ 1.2958074760491843e002 * pow(age,1)
+ 3.5330169150904782e004 * pow(age,2)
+ 2.9414942568037581e006 * pow(age,3)
There are two reasons for the inconsistent coefficients at age 37 and above:
 Sample size: There simply aren't many players that play after age 37, let alone age 40. The smaller the number of players of that age group in our sample, the harder it becomes for the regression to estimate a reasonable coefficient.
Example: Suppose we had only one single player that played at age 42 and 43. For one single player it is not entirely unlikely that he, for random reasons, has better +/ numbers (after adjusting for teammates) at age 43 compared to 42. Since the regression has only his performance to go by for 42/43yearolds, the coefficient for 'Age_43' would be higher than the coefficient for 'Age_42'. If we had more players that had played at 42 and 43, chances are that most of them played a little worse at age 43 and the coefficients would look more reasonable
 Survivor bias: Only those players that play exceptionally well up to a very high age do get some playing time at high age, and are thus in our sample. Players which were more heavily (negatively) affected by age don't remain in the league as long, are thus not in the matchupdata and not in the sample. This skews results
Re: RAPM aging curve
Hey, this is great. Regarding aging and Offense:
The equivalent defensive plateau age range looks like 2532  just about 2 years later/older than for offense. Intuitively about right.
A single integer for age in a given season is rather arbitrary. A lot of smoothing could be had by assigning, for example to age 25: the sum of 1/4 of the value of the age 24 group, 1/2 of age 25, and 1/4 of age 26.
Oftentimes a player goes 1/4 or more of a season after a birthday.
The curve is intuitively appealing, but it looks like a straightline increase up to age 23, and a straight dropoff by year after 30. Basically a plateau from 2330.On the upslope (1823) each year has a more positive coefficient than the preceding year, and on the downslope (31 and after) most years have a more negative coefficient than their preceding year, ..
The equivalent defensive plateau age range looks like 2532  just about 2 years later/older than for offense. Intuitively about right.
A single integer for age in a given season is rather arbitrary. A lot of smoothing could be had by assigning, for example to age 25: the sum of 1/4 of the value of the age 24 group, 1/2 of age 25, and 1/4 of age 26.
Oftentimes a player goes 1/4 or more of a season after a birthday.
Re: RAPM aging curve
The two issues are related. A player on the down side of his career will only get to play in year Y+1 if he was good (luckily good) in Y. So the observed deltas will be biased large.J.E. wrote:  Sample size: There simply aren't many players that play after age 37, let alone age 40. The smaller the number of players of that age group in our sample, the harder it becomes for the regression to estimate a reasonable coefficient.
Example: Suppose we had only one single player that played at age 42 and 43. For one single player it is not entirely unlikely that he, for random reasons, has better +/ numbers (after adjusting for teammates) at age 43 compared to 42. Since the regression has only his performance to go by for 42/43yearolds, the coefficient for 'Age_43' would be higher than the coefficient for 'Age_42'. If we had more players that had played at 42 and 43, chances are that most of them played a little worse at age 43 and the coefficients would look more reasonable
 Survivor bias: Only those players that play exceptionally well up to a very high age do get some playing time at high age, and are thus in our sample. Players which were more heavily (negatively) affected by age don't remain in the league as long, are thus not in the matchupdata and not in the sample. This skews results
There are several possible solutions discussed at length in baseball research, where the randomness effect is far more pronounced.
Another question: are your polynomial fit curves weighted by the number of observations of each delta, or just based on the points shown with no weighting?
Great work once again, J.E.!
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
GodismyJudgeOK.com/DStats/
Twitter.com/DSMok1
APBRmetrics Forum Administrator
GodismyJudgeOK.com/DStats/
Twitter.com/DSMok1
Re: RAPM aging curve
Maybe I'm misunderstanding, but doesn't the polynomial fit provide the smoothing we need? If one of the coefficients seems "out of line", like in this case the one for 'age 39' on offense, the fitted curve provides a more reasonable numberMike G wrote:A lot of smoothing could be had by assigning, for example to age 25: the sum of 1/4 of the value of the age 24 group, 1/2 of age 25, and 1/4 of age 26.
Not weighted. Good point though. On my todo listDSMok1 wrote:Another question: are your polynomial fit curves weighted by the number of observations of each delta, or just based on the points shown with no weighting?
Do you have links to papers or blogposts that deal with this issue?There are several possible solutions discussed at length in baseball research, where the randomness effect is far more pronounced.

 Posts: 237
 Joined: Sat Feb 16, 2013 11:56 am
Re: RAPM aging curve
I was going to say weighing by number of possessions would help the model fit the points for ages 39 and higher. I've done that a lot to deal with the extreme ranges of datasets.
So you're using an integer age and not decimal? Maybe you can use "exact" decimal age by putting in the polynomial factors (age, age^2, age^3, etc.) in the RAPM model instead of separate coefficients for each year.
The problem with survivor bias is say there are three players: A, B, and C. Player A is a weird Stocktontype who plays well until he's 40. Player B falls off sharply at age 35 and is out of the league by 36. Player B declines every year from age 32 to 35 before retiring. He tries to make a team at age 36 but no one signs him.
We know that age 36 isn't a good age given our set of players, but say Player A has a better season at age 36 than 35 (or hardly declines.) That means the model doesn't see age 36 as a problem at all. But we know from the other two players there's a significant decline at age 36 or else they'd be able to play. Or imagine a new player, D, who plays a lot during age 35 but barely at all at age 36 because he's much worse. Yet because he has a limited amount of possessions, he won't change the model results (I think, given what you're doing.)
So there has to be a way to penalize an age that causes player to retire/drop off in playing time. I'm not sure how to do that, however.... (Helpful, I know.) But this is an area of stats I do want to learn more about.
So you're using an integer age and not decimal? Maybe you can use "exact" decimal age by putting in the polynomial factors (age, age^2, age^3, etc.) in the RAPM model instead of separate coefficients for each year.
The problem with survivor bias is say there are three players: A, B, and C. Player A is a weird Stocktontype who plays well until he's 40. Player B falls off sharply at age 35 and is out of the league by 36. Player B declines every year from age 32 to 35 before retiring. He tries to make a team at age 36 but no one signs him.
We know that age 36 isn't a good age given our set of players, but say Player A has a better season at age 36 than 35 (or hardly declines.) That means the model doesn't see age 36 as a problem at all. But we know from the other two players there's a significant decline at age 36 or else they'd be able to play. Or imagine a new player, D, who plays a lot during age 35 but barely at all at age 36 because he's much worse. Yet because he has a limited amount of possessions, he won't change the model results (I think, given what you're doing.)
So there has to be a way to penalize an age that causes player to retire/drop off in playing time. I'm not sure how to do that, however.... (Helpful, I know.) But this is an area of stats I do want to learn more about.
Re: RAPM aging curve
Obviously not the same level of research, as I am a sophomore in college and had very limited time and slightly incomplete data, but I did a project on the aging of offensive and defensive xRAPM and got some (obviously flawed but) potentially interesting results (link below).
Obviously sample bias was the overriding issue because of the qualifier, and the discussion about getting around it is more important than the results themselves, but I still think there are some useful takeaways, at least for further research.
First off: more evidence that age is a very significant predictor projecting NBA performance from college, as younger players are generally better talents, hence they are drafted earlier.
Interesting but may or may not be real: it looks like the discrepancy in defense between players drafted at younger ages vs. older ages is, on average, much bigger than the discrepancy in offense (without accounting for variance, which will be larger for offense). It also looks like, on average, defense declines much more gradually than offense. Of course part of this will be explained by the fact that players drafted at younger ages are higher picks and more athletic. And the usual caveats with defensive metrics apply.
Potential improvements include: adding an Injury variable, adding a Cumulative Minutes Played or Cumulative Possessions variable (getting rid of the Experience term), making Possessions a lag variable, estimating the rate of change rather than the actual value, and using "total value over replacement" rather than the rate stat. The real value is going to be found in adjusting for position/player type, as most empirical and anecdotal evidence says offense peaks earlier and defense (particularly big man defense) peaks later and declines more gradually.
Link includes summary and paper. Paper is long and technical as the intended audience was the professor; summary is much more to the point.
https://www.dropbox.com/s/qelbcxgga6wm4 ... BPaper.pdf
Would love to hear what everyone thinks. Thanks.
Obviously sample bias was the overriding issue because of the qualifier, and the discussion about getting around it is more important than the results themselves, but I still think there are some useful takeaways, at least for further research.
First off: more evidence that age is a very significant predictor projecting NBA performance from college, as younger players are generally better talents, hence they are drafted earlier.
Interesting but may or may not be real: it looks like the discrepancy in defense between players drafted at younger ages vs. older ages is, on average, much bigger than the discrepancy in offense (without accounting for variance, which will be larger for offense). It also looks like, on average, defense declines much more gradually than offense. Of course part of this will be explained by the fact that players drafted at younger ages are higher picks and more athletic. And the usual caveats with defensive metrics apply.
Potential improvements include: adding an Injury variable, adding a Cumulative Minutes Played or Cumulative Possessions variable (getting rid of the Experience term), making Possessions a lag variable, estimating the rate of change rather than the actual value, and using "total value over replacement" rather than the rate stat. The real value is going to be found in adjusting for position/player type, as most empirical and anecdotal evidence says offense peaks earlier and defense (particularly big man defense) peaks later and declines more gradually.
Link includes summary and paper. Paper is long and technical as the intended audience was the professor; summary is much more to the point.
https://www.dropbox.com/s/qelbcxgga6wm4 ... BPaper.pdf
Would love to hear what everyone thinks. Thanks.
Re: RAPM aging curve
Did weighing by # of observations, and figured out the 'optimal' polynomial degree empirically through OutOfSampleTesting, instead of choosing it arbitrarily.
For defense, polynomial degree of 2 had the lowest outofsampleerror. For offense it was 3
For defense, polynomial degree of 2 had the lowest outofsampleerror. For offense it was 3

 Posts: 306
 Joined: Sat Apr 16, 2011 7:40 am
 Location: Cambridge, MA
 Contact:
Re: RAPM aging curve
Jeremias, can you post a graph w. O + D?
http://pointsperpossession.com/
@PPPBasketball
@PPPBasketball
Re: RAPM aging curve
Excellent work on the new curves!
http://nhlnumbers.com/2012/12/6/goalie ... rshipbias
(A real quick look at it)
http://tangotiger.com/index.php/site/co ... witterfeed
http://www.insidethebook.com/ee/index.p ... ias_issue/
http://www.insidethebook.com/ee/index.p ... ing_study/
http://www.insidethebook.com/ee/index.p ... ing_curve/
(Read comments, follow links on all of these)
There are many more threads at Tango's blog covering this issue, but I can't uncover them all.
Well, I'd consider Tangotiger the best public authority on this, but his blog can be hard to search. Some recent articles on this issue, known as survivor or survivorship bias.J.E. wrote:Do you have links to papers or blogposts that deal with this issue?There are several possible solutions discussed at length in baseball research, where the randomness effect is far more pronounced.
http://nhlnumbers.com/2012/12/6/goalie ... rshipbias
(A real quick look at it)
http://tangotiger.com/index.php/site/co ... witterfeed
http://www.insidethebook.com/ee/index.p ... ias_issue/
http://www.insidethebook.com/ee/index.p ... ing_study/
http://www.insidethebook.com/ee/index.p ... ing_curve/
(Read comments, follow links on all of these)
There are many more threads at Tango's blog covering this issue, but I can't uncover them all.
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
GodismyJudgeOK.com/DStats/
Twitter.com/DSMok1
APBRmetrics Forum Administrator
GodismyJudgeOK.com/DStats/
Twitter.com/DSMok1