Predicting RAPM with stats.nba

jc114 · Post by **jc114** » Mon May 29, 2023 3:48 am

Hi,
Just wanted to write a small note: following my previous post about calculating RAPM, I used 3 year look-back for RAPM for the end of season 16-17 season to 21-22, with 22-23 as test set. Inputs are stats from stats.nba with some pre-processing.

With traditional stats + linear regression (had some issues with linear regression going beyond traditional box stats even with extremely large regularization): https://docs.google.com/spreadsheets/d/ ... CS/pubhtml
Top 15

Code: Select all

name                   | Pred ORAPM | Pred DRAPM | Overall pred RAPM
---------------------------------------------------------------------
Nikola Jokic           | 3.761047   | 1.328958   | 5.090005
Damian Lillard         | 5.549969   |-0.772136   | 4.777833
Jimmy Butler           | 3.791392   | 0.597812   | 4.389204
James Harden           | 3.558098   | 0.804808   | 4.362906
Joel Embiid            | 3.177790   | 1.020164   | 4.197954
Luka Doncic            | 3.352444   | 0.472218   | 3.824662
Tyrese Haliburton      | 3.616887   | 0.077168   | 3.694054
Chris Paul             | 2.130101   | 1.465629   | 3.595731
Kyrie Irving           | 3.432706   |-0.168301   | 3.264405
Jayson Tatum           | 2.970075   | 0.260106   | 3.230181
Shai Gilgeous-Alexander| 3.273488   |-0.056528   | 3.216960
Kristaps Porzingis     | 2.214370   | 0.976559   | 3.190929
Stephen Curry          | 3.604152   |-0.422366   | 3.181787
Fred VanVleet          | 2.258881   | 0.910750   | 3.169631
Domantas Sabonis       | 2.050337   | 1.050200   | 3.100536

With all stats + xgboost: https://docs.google.com/spreadsheets/d/ ... lS/pubhtml
Top 15

Code: Select all

name                    | ORAPM_pred | DRAPM_pred | RAPM_pred
--------------------------------------------------------------
Nikola Jokic            | 4.042871   | 1.870391   | 5.913263
Joel Embiid             | 2.923295   | 2.552115   | 5.475410
Kevin Durant            | 4.286930   | 0.696512   | 4.983441
James Harden            | 4.180523   | 0.768122   | 4.948646
Stephen Curry           | 5.023733   |-0.450640   | 4.573093
Kawhi Leonard           | 3.970355   | 0.561263   | 4.531618
Giannis Antetokounmpo   | 2.378320   | 1.748083   | 4.126403
Darius Garland          | 3.167522   | 0.715953   | 3.883475
Steven Adams            | 1.584521   | 2.242042   | 3.826563
Jayson Tatum            | 3.381459   | 0.427889   | 3.809348
Anthony Davis           | 1.978206   | 1.805436   | 3.783642
Luka Doncic             | 3.776522   |-0.136686   | 3.639837
Jalen Brunson           | 3.875503   |-0.247337   | 3.628166
Damian Lillard          | 5.427719   |-1.883968   | 3.543751
Paul George             | 3.007780   | 0.393586   | 3.401366

Xgboost has considerably lower MAE and MSE in both train and test. Output seems relatively reasonable.

Calculating RAPM with x year lookback on every game day with each stat of each player being an exponential moving average would probably be better but I haven't had the time to write code for that yet.

Crow · Post by **Crow** » Mon May 29, 2023 1:06 pm

Using this data to talk about contenders:

The 2 methods yield an average of 3 conference finalist teams with a top 5 rated player. By historical standards, that appears to be the list of qualified / "true" / final title contenders. Nuggets and Sixers made both top 5s. Celtics and Heat made one.

12 teams have a top 12 guy on one of the lists. So they may seem close... but ultimately probably not close enough to the first / most important criteria, at least this year. If you don't have a top 12 guy, you probably are not close enough to close enough.

Raptors, Knicks and Kings are close to top 12 but if really going for title, need to see that close to close enough to close enough is probably too far away as is. Hawks are not close enough to close enough to close enough. Neither are the Nets, Bulls, Pelicans or Timberwolves. If realistic, they need major change, maybe as far as essentially starting over.

The bottom 10 certainly need major change and most or all should essentially start over, if really focused on a legit chance for a title.

v-zero · Post by **v-zero** » Mon May 29, 2023 6:02 pm

Are you including the same players in the test set as the training set? I am aware you are using a later season for the test set, but if there is overlap between the train and test sets in terms of players, then you have a leakage issue which xgboost will be able to take advantage of more than the linear regression - xgboost's nonlinear tree-learning structure will lend itself to identifying the same player rather than necessarily identifying what it is that makes players good or bad.

If my hunch is correct about how you have run this, then I would suggest instead splitting your train and test sets by players rather than by years, this will be far more robust.

jc114 · Post by **jc114** » Mon May 29, 2023 6:52 pm

v-zero wrote: ↑Mon May 29, 2023 6:02 pm Are you including the same players in the test set as the training set? I am aware you are using a later season for the test set, but if there is overlap between the train and test sets in terms of players, then you have a leakage issue which xgboost will be able to take advantage of more than the linear regression - xgboost's nonlinear tree-learning structure will lend itself to identifying the same player rather than necessarily identifying what it is that makes players good or bad.

If my hunch is correct about how you have run this, then I would suggest instead splitting your train and test sets by players rather than by years, this will be far more robust.

Yeah that's a good point. I spent some time thinking about this issue but the reason I decided to do early-later season split instead of player split is four fold: a) there aren't enough *really* top players and particularly because they come in different archetypes it might be challenging to leave a proportion of them out, b) with the right subsampling and tree depth hyperparameters it should hopefully be more robust, c) This more closely mimics the true training procedure where you could re-train every day, d) The MAE and MSE is still pretty significant and the train-test gap is not too large.

Let me try player-split instead as well and see how it goes. Thanks!

v-zero · Post by **v-zero** » Mon May 29, 2023 8:45 pm

I may as well just tell you that I have done both, and that the hyperparameter tuning tends towards trees that identify individuals rather than trees that identify player archetypes if they are tuned with the same players in the train and test sets. Frankly I would be avoiding a tree depth greater than three. I would suggest you find your hyperparameters by using cross-validation leaving out 10% of all players on each run, and stratify your players by RAPM rating when you create your splits.

It's also worth mentioning, I think, that I haven't found regression trees to be as good at generalising player performance as I have well-formed penalized linear regression models with interaction terms. However, this is all just me.

Lastly, it's not as big an issue as you might think that there is a lack of top players, as boosted trees are additive rather than being bound by the average of a set of leaves. In other words, if the trees find positive (or negative) value in two different training samples (A and B) for two different characteristics, but both of those characteristics are present in player C, then C will benefit both from the trees that favour players like A, and trees that favour players like B, but only when those trees aren't being grown to a level of specificity (depth) that destroys their generality.

P.S. It's nice to see somebody playing around with all of this, nobody ever learns anything new by following all of the advice people give them.

jc114 · Post by **jc114** » Tue May 30, 2023 12:31 am

v-zero wrote: ↑Mon May 29, 2023 8:45 pm I may as well just tell you that I have done both, and that the hyperparameter tuning tends towards trees that identify individuals rather than trees that identify player archetypes if they are tuned with the same players in the train and test sets. Frankly I would be avoiding a tree depth greater than three. I would suggest you find your hyperparameters by using cross-validation leaving out 10% of all players on each run, and stratify your players by RAPM rating when you create your splits.

It's also worth mentioning, I think, that I haven't found regression trees to be as good at generalising player performance as I have well-formed penalized linear regression models with interaction terms. However, this is all just me.

Lastly, it's not as big an issue as you might think that there is a lack of top players, as boosted trees are additive rather than being bound by the average of a set of leaves. In other words, if the trees find positive (or negative) value in two different training samples (A and B) for two different characteristics, but both of those characteristics are present in player C, then C will benefit both from the trees that favour players like A, and trees that favour players like B, but only when those trees aren't being grown to a level of specificity (depth) that destroys their generality.

P.S. It's nice to see somebody playing around with all of this, nobody ever learns anything new by following all of the advice people give them.

Ohh very interesting. Thanks for all the advice and discussion. Indeed I found that too with tree depth - in that training setting I ended up with tree depth=3, subsample=0.8, colsubsample by tree=0.8 and by level=0.8.

I was also thinking about including interaction features/feature cross, but I'm worried about the feature space size relative to training set size with my current training setup using only season end stats and RAPM. I'm thinking if I compute EMA of stats and recompute RAPM on every game day with suitable lookback that'll increase the training set significantly but I'll need to rewrite my code a bit.

jc114 · Post by **jc114** » Tue May 30, 2023 3:07 am

Crow wrote: ↑Mon May 29, 2023 1:06 pm Using this data to talk about contenders:

The 2 methods yield an average of 3 conference finalist teams with a top 5 rated player. By historical standards, that appears to be the list of qualified / "true" / final title contenders. Nuggets and Sixers made both top 5s. Celtics and Heat made one.

12 teams have a top 12 guy on one of the lists. So they may seem close... but ultimately probably not close enough to the first / most important criteria, at least this year. If you don't have a top 12 guy, you probably are not close enough to close enough.

Raptors, Knicks and Kings are close to top 12 but if really going for title, need to see that close to close enough to close enough is probably too far away as is. Hawks are not close enough to close enough to close enough. Neither are the Nets, Bulls, Pelicans or Timberwolves. If realistic, they need major change, maybe as far as essentially starting over.

The bottom 10 certainly need major change and most or all should essentially start over, if really focused on a legit chance for a title.

To add, using the original methodology with xgboost except with leave one season out gets the following stats for playoffs starting in 2016-2017 season using the rank of the best player, second best player and third best player on that team

Made it to and were eliminated in round 1 (first round)
Best XGB-PM mean 26.895833333333332 Best XGB-PM median 24.5 Worst XGB-PM 100
Second best XGB-PM mean 52.458333333333336 Second best XGB-PM median 48.0 Worst XGB-PM 167
Third best XGB-PM mean 80.39583333333333 Third best XGB-PM median 72.0 Worst XGB-PM 228

Made it to and were eliminated in round 2 (conference semis)
Best XGB-PM mean 9.0 Best XGB-PM median 7.0 Worst XGB-PM 24
Second best XGB-PM mean 25.416666666666668 Second best XGB-PM median 24.0 Worst XGB-PM 52
Third best XGB-PM mean 37.416666666666664 Third best XGB-PM median 37.5 Worst XGB-PM 71

Made it to and were eliminated in round 3 (conference finals)
Best XGB-PM mean 13.583333333333334 Best XGB-PM median 10.5 Worst XGB-PM 38
Second best XGB-PM mean 25.833333333333332 Second best XGB-PM median 23.0 Worst XGB-PM 75
Third best XGB-PM mean 46.5 Third best XGB-PM median 41.5 Worst XGB-PM 90

Made it to and were eliminated in round 4 (finals)
Best XGB-PM mean 6.833333333333333 Best XGB-PM median 6.5 Worst XGB-PM 12
Second best XGB-PM mean 24.0 Second best XGB-PM median 25.0 Worst XGB-PM 35
Third best XGB-PM mean 33.0 Third best XGB-PM median 32.5 Worst XGB-PM 50

Won championship
Best XGB-PM mean 1.8333333333333333 Best XGB-PM median 1.5 Worst XGB-PM 4
Second best XGB-PM mean 9.666666666666666 Second best XGB-PM median 10.0 Worst XGB-PM 16
Third best XGB-PM mean 22.5 Third best XGB-PM median 21.0 Worst XGB-PM 41

These teams in the past seasons satisfied the criteria of best player top 5, second best player top 20, third best top 50:
GSW 2017
Year 2016-17
Team GSW
Final round 5

CLE 2017
Year 2016-17
Team CLE
Final round 4

LAC 2017
Year 2016-17
Team LAC
Final round 1

GSW 2018
Year 2017-18
Team GSW
Final round 5

HOU 2018
Year 2017-18
Team HOU
Final round 3

TOR 2019
Year 2018-19
Team TOR
Final round 5

GSW 2019
Year 2018-19
Team GSW
Final round 4

MIL 2019
Year 2018-19
Team MIL
Final round 3

LAL 2020
Year 2019-20
Team LAL
Final round 5

LAC 2020
Year 2019-20
Team LAC
Final round 2

MIL 2020
Year 2019-20
Team MIL
Final round 2

LAC 2021
Year 2020-21
Team LAC
Final round 3

BKN 2021

MIL 2021
Year 2020-21
Team MIL
Final round 5

GSW 2022
Year 2021-22
Team GSW
Final round 5

MIL 2022
Year 2021-22
Team MIL
Final round 2

PHX 2023

However, notice this list doesn't include DEN 2023, probably partially because my RAPM calculation has some train test bleed but also because Jokic is so ridiculously good.

jc114 · Post by **jc114** » Tue May 30, 2023 5:42 pm

jc114 wrote: ↑Tue May 30, 2023 12:31 am
v-zero wrote: ↑Mon May 29, 2023 8:45 pm I may as well just tell you that I have done both, and that the hyperparameter tuning tends towards trees that identify individuals rather than trees that identify player archetypes if they are tuned with the same players in the train and test sets. Frankly I would be avoiding a tree depth greater than three. I would suggest you find your hyperparameters by using cross-validation leaving out 10% of all players on each run, and stratify your players by RAPM rating when you create your splits.

It's also worth mentioning, I think, that I haven't found regression trees to be as good at generalising player performance as I have well-formed penalized linear regression models with interaction terms. However, this is all just me.

Lastly, it's not as big an issue as you might think that there is a lack of top players, as boosted trees are additive rather than being bound by the average of a set of leaves. In other words, if the trees find positive (or negative) value in two different training samples (A and B) for two different characteristics, but both of those characteristics are present in player C, then C will benefit both from the trees that favour players like A, and trees that favour players like B, but only when those trees aren't being grown to a level of specificity (depth) that destroys their generality.

P.S. It's nice to see somebody playing around with all of this, nobody ever learns anything new by following all of the advice people give them.

Ohh very interesting. Thanks for all the advice and discussion. Indeed I found that too with tree depth - in that training setting I ended up with tree depth=3, subsample=0.8, colsubsample by tree=0.8 and by level=0.8.

I was also thinking about including interaction features/feature cross, but I'm worried about the feature space size relative to training set size with my current training setup using only season end stats and RAPM. I'm thinking if I compute EMA of stats and recompute RAPM on every game day with suitable lookback that'll increase the training set significantly but I'll need to rewrite my code a bit.

Top 10 using new train/test split:

Code: Select all

0	Nikola Jokic		4.818526	2.113113	6.931639
1	Stephen Curry		5.411965	-0.364840	5.047125
2	Kawhi Leonard		4.316070	0.693176	5.009245
3	Joel Embiid	        2.745814	2.096955	4.842770
4	Giannis Antetokounmpo	3.483582	1.271373	4.754956
5	Paul George		2.829726	1.809602	4.639328
6	Kevin Durant		3.974816	0.503653	4.478468
7	Jayson Tatum		4.258130	0.008529	4.266659
8	Jrue Holiday		3.215147	0.880576	4.095723
9	Alex Caruso		0.964562	3.039121	4.003683
10	Anthony Davis		1.791420	2.208642	4.000063

v-zero · Post by **v-zero** » Tue May 30, 2023 7:50 pm

Looks good, did you find much of a change hyperparameter wise? Did you run a penalized linear regression using the same splits to compare?

jc114 · Post by **jc114** » Tue May 30, 2023 9:35 pm

v-zero wrote: ↑Tue May 30, 2023 7:50 pm Looks good, did you find much of a change hyperparameter wise? Did you run a penalized linear regression using the same splits to compare?

Hyperparameters were mostly the same but the biggest change was much larger regularization which supports what you previous mentioned about memorizing players.

I did re-run penalized linear regression without any additional feature engineering. The aggregate statistics are a bit better than xgboost. However, the quality of the results depends on filtering out players with low minutes whereas xgboost takes care of that automatically. For example, I filtered out players with <5 min/game and <30 games a season, but then Boban slipped through and made it to my top 5 for 2022-2023

v-zero · Post by **v-zero** » Wed May 31, 2023 8:11 am

I'd be curious to see what your features are, are they total statistics, or per 100 statistics, or both? Do you include mpg as a feature, and whether somebody starts? It is difficult for a purely linear model to capture players well in the box score, especially without a fiddle like many stats use to force players to sum to team margin, SRS or similar. Using higher dimensionality does help to alleviate this, but it's far from perfect.

Edit: in case you are interested here is what my season-ending pure box ratings look like, I have been playing around with them recently so I thought I would post them for comparison. They aren't adjusted to any team margins or anything like that, and use per-100 stats which are a moving average, so they are more of a now-cast of the final day of the regular season than they are an average of the season as a whole (hence Embiid over Jokic etc).

Code: Select all

Name		 Box Rating
Joel Embiid		7.5
Nikola Jokic		6.8
Shai Gilgeous-Alexander	6.1
Giannis Antetokounmpo	5.2
Jaren Jackson Jr.	5.0
Anthony Davis		4.9
Luka Doncic		4.7
Kevin Durant		4.6
Kristaps Porzingis	4.6
Jimmy Butler		4.5

colts18 · Post by **colts18** » Wed May 31, 2023 1:48 pm

Kinda OT but still relevant to this topic,

Nothing more frustrating to me than people not rounding up their stats to the proper decimal place. Nobody needs to need to see an RAPM down to 10 decimal places. No one is making decision based on that level of precision. Keep it at 2 decimal places at most so its readable. Too many numbers makes people skip your dataset.

jc114 · Post by **jc114** » Wed May 31, 2023 10:05 pm

colts18 wrote: ↑Wed May 31, 2023 1:48 pm Kinda OT but still relevant to this topic,

Nothing more frustrating to me than people not rounding up their stats to the proper decimal place. Nobody needs to need to see an RAPM down to 10 decimal places. No one is making decision based on that level of precision. Keep it at 2 decimal places at most so its readable. Too many numbers makes people skip your dataset.

For sure, I'm just used to doing it since in my field a 0.001% improvement is already significant $$$. Will keep that in mind.

jc114 · Post by **jc114** » Wed May 31, 2023 10:13 pm

v-zero wrote: ↑Wed May 31, 2023 8:11 am I'd be curious to see what your features are, are they total statistics, or per 100 statistics, or both? Do you include mpg as a feature, and whether somebody starts? It is difficult for a purely linear model to capture players well in the box score, especially without a fiddle like many stats use to force players to sum to team margin, SRS or similar. Using higher dimensionality does help to alleviate this, but it's far from perfect.

Edit: in case you are interested here is what my season-ending pure box ratings look like, I have been playing around with them recently so I thought I would post them for comparison. They aren't adjusted to any team margins or anything like that, and use per-100 stats which are a moving average, so they are more of a now-cast of the final day of the regular season than they are an average of the season as a whole (hence Embiid over Jokic etc).
Code: Select all
Name		 Box Rating
Joel Embiid		7.5
Nikola Jokic		6.8
Shai Gilgeous-Alexander	6.1
Giannis Antetokounmpo	5.2
Jaren Jackson Jr.	5.0
Anthony Davis		4.9
Luka Doncic		4.7
Kevin Durant		4.6
Kristaps Porzingis	4.6
Jimmy Butler		4.5

Cool, seems like your projection is pretty darn good

The vast majority of the stats I got off stats.nba.com I then turned into per possession stats and then mean 0 std 1:

Per possession stats:
'CONTESTED_SHOTS', 'CONTESTED_SHOTS_2PT', 'CONTESTED_SHOTS_3PT',
'DEFLECTIONS', 'CHARGES_DRAWN', 'SCREEN_ASSISTS', 'SCREEN_AST_PTS',
'OFF_LOOSE_BALLS_RECOVERED', 'DEF_LOOSE_BALLS_RECOVERED',
'LOOSE_BALLS_RECOVERED', 'OFF_BOXOUTS', 'DEF_BOXOUTS',
'BOX_OUT_PLAYER_TEAM_REBS', 'BOX_OUT_PLAYER_REBS', 'BOX_OUTS'

'PTS_OFF_TOV', 'PTS_2ND_CHANCE', 'PTS_FB', 'PTS_PAINT',
'OPP_PTS_OFF_TOV', 'OPP_PTS_2ND_CHANCE', 'OPP_PTS_FB', 'OPP_PTS_PAINT',
'BLK', 'BLKA', 'PF', 'PFD',

'E_OFF_RATING', 'E_DEF_RATING',
'E_NET_RATING', 'AST_PCT', 'AST_TOV', 'AST_RATIO',
'OREB_PCT', 'DREB_PCT', 'REB_PCT', 'TM_TOV_PCT', 'EFG_PCT', 'TS_PCT',
'E_USG_PCT', 'PIE',
'FG3M', 'FG3A', 'FTM', 'FTA','FG2M', 'FG2A',
'OREB', 'DREB', 'REB', 'AST', 'STL', 'TO', 'PTS'

Non-per possession stats:
'PCT_AST_2PM', 'PCT_UAST_2PM', 'PCT_AST_3PM', 'PCT_UAST_3PM',
'PCT_PTS_2PT', 'PCT_PTS_2PT_MR',
'PCT_PTS_3PT', 'PCT_PTS_FB', 'PCT_PTS_FT', 'PCT_PTS_OFF_TOV',
'PCT_PTS_PAINT', 'MIN_PER_GAME', 'GP', 'GS', 'FT_PCT', 'FG3_PCT', 'FG2_PCT', 'PLAYER_AGE'

There's some that I didn't deal with properly e.g. TS_PCT shouldn't be converted to per possession the way I did.

v-zero · Post by **v-zero** » Thu Jun 01, 2023 9:51 am

Yeah managing to get that model from the box score alone has been quite pleasing. It's not reinventing the wheel, but combined with my streaming variant of a plus minus model it has pretty strong predictive power (though I have a model which uses an extended play-by-play box score which is what I usually use, this was mostly for fun and to see what adding higher-dimensionality could do for a box score model).

Interesting to see the dataset you're using, I can now understand how you managed to get Caruso to rate so highly. He's an excellent player, but box score metrics really struggle to hone in on that fact, but with those additional stats in there I can see what your model has managed to latch onto.

Is this all mostly a curiosity for you, or do you have intention for this work to go somewhere in particular?

APBRmetrics

Predicting RAPM with stats.nba

Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba

Re: Predicting RAPM with stats.nba