Guides to Creating RAPM

ampersand5 · Post by **ampersand5** » Tue Feb 24, 2015 1:36 am

As far as I know, these are all the guides that I can find for creating RAPM

The latest by Evan Zamir (2015):
http://nbviewer.ipython.org/gist/EvanZ/ ... eb14f28d58

This is for the 2014-2015 season and onwards

Jacob Frankel's guide to creating RAPM (2014):
http://www.hickory-high.com/how-to-calculate-rapm/

This is theoretically for all season as it relies on using your own play by play data (Frankel is using only play by play data from BasketballValue which is no longer updated)

Eli Witus's guide for creating the original adjusted plus minus stat (2008) using basketballvalue data:
http://www.countthebasket.com/blog/2008 ... lus-minus/

kpascual wrote: I would also like to point out that my NBA data scraping scripts are open-sourced, so you can get the data on your own computer. https://github.com/kpascual/nbascrape

------
As far as I know, these are all of the guides for creating RAPM. Please let me know if there are any others that I have missed. Moreover, if anyone else wants to include a guide to scrapping data so users can create their own play by play stats (guides already exist, I just don't have any links right now), please feel free.
Similarly, if anyone has any comments/feedbacks on these guides, please post them here - or if anyone wants to try and follow these guides and post their progress here, that would also be most welcomed.

Crow · Post by **Crow** » Tue Feb 24, 2015 4:21 am

Good to tack this reference up. Thanks.

dwm8 · Post by **dwm8** » Thu Jul 30, 2015 6:45 pm

Does anyone have an alternative way to access Jacob Frankel's guide on Hickory High? The site appears to be down. On this same note, anyone know why the site isn't working?

Crow · Post by **Crow** » Thu Jul 30, 2015 7:27 pm

http://nyloncalculus.com/2015/07/29/gue ... -a-how-to/

Crow · Post by **Crow** » Thu Jul 30, 2015 7:45 pm

I sent Jacob a tweet asking for a copy.

Crow · Post by **Crow** » Thu Jul 30, 2015 7:54 pm

https://web.archive.org/web/20140717131 ... late-rapm/

How To Calculate RAPM

Posted by Jacob Frankel on Apr 21, 2014 | 1 comment

Flickr | adavey

Flickr | adavey

The general concepts behind ESPN’s new stat, RPM, have been pretty well covered. Here’s their introductory post and here’s Kevin Ferrigan with some more on how it works. But I’ve seen a lot of people wondering about the actual mechanics of how it’s calculated, so I’ll delve into that, showing people how to actually calculate the bare bones form of RAPM. View it as an updated (for RAPM) and more easily implementable version of this breakdown of APM by Eli Witus, now of the Houston Rockets.

We’ll calculate RAPM from the 2010-11 season and playoffs, the last non-lockout year that matchup data is available on basketballvalue. I’ve already cleaned up the data, and you can download it in zipped form here.

The basis of the APM approach is a massive regression. A regression is a statistical method to explain how one or more sets of variables (the independent variables) affect another singular variable (the dependent variable). The APM regression tries to explain margin of victory (dependent variable) over stints of possessions, with no substitutions, with who is on the court (the independent variables). In our dataset there are about 37,000 of these stints, each of which will show up as a row on the excel sheet you downloaded above. The regression also weights each stint by it’s number of possessions. The ridge element of the regression regresses all the independent variable coefficients farther towards zero.

Now we have to set up the independent variables. Each of the 400 players who play in our dataset is an independent variable and there’s a column for each independent variable. Download this sheet, that has a list of player IDs. Copy and use the transpose paste option to put that list along the top row of the initial sheet (should be J1:RC1), then fill in this formula in cells J2:RC37264:

=IF(ISNUMBER(FIND(" "&J$1&" ",$B2)),1,IF(ISNUMBER(FIND(" "&J$1&" ",$C2)),-1,0))

This step is pretty computer processor intensive, so it’d be best to turn off other applications running and fill the formula in in chunks. This formula basically looks at the two units on the floor, puts a one if the player for the column the formula is on the home unit, negative-one if he’s on the away unit, and zero if he’s not on the floor at all. At this point your sheet should look something like this (click to expand):

Now, we’re going to actually calculate RAPM. I do it in the free, open-source statistical program R. It takes a bit of time to get used to, especially for those, like me, who have trouble thinking of stats outside of the context of a spreadsheet. But R is an incredibly powerful tool, and even though you don’t need to know much about it to calculate RAPM, I’d encourage everybody to at least try it out. You can download R at the above link, and it should be pretty easy to get running.

You’ll need to install one package to calculate RAPM, the glmnet package. Installing it is as simple as entering this command in R:

install.packages("glmnet")

You’ll also need to get the data we’ve put together above into a more importable setting. Open a new excel file and copy and paste the possession, rebound rate, and margin columns. Copy all the player columns and use the paste values option to put those columns values into the new document. Save this file as a csv. If you want to skip all the above steps, you can download that CSV here. Now we can run the regression. Here’s my R code, with annotations behind the pound symbols:

1 2 3 4 5 6 7 8 9 10 11 12 13

library(glmnet) #load glmnet package
data=read.csv("/users/jfrankel16/Desktop/importrapm.csv") #imports csv and makes it a data frame. you'll have to change the file directory
Marg=data$MarginPer100 #create a separate vector for margin
Poss=data$Possessions #create a separate vector for possessions
RebMarg=(data$RebRateHome-(100-data$RebRateHome)) #create a separate vector for rebound rate differential
data$Possessions=NULL #remove the possessions column from the data frame
data$RebRateHome=NULL #remove the home rebound rate column from the data frame
data$MarginPer100=NULL #remove the margin column from the data frame
x=data.matrix(data) #turn the data frame (which is now just 1s, -1s, and 0s) into a matrix
lambda=cv.glmnet(x,RebMarg,weights=Poss,nfolds=5) #find the lambda values. these determine how far towards 0 the coefficients are shrunk
lambda.min=lambda$lambda.min #store the lambda value that gives the smallest error in an object called lambda.min
ridge=glmnet(x,RebMarg,family=c("gaussian"),Poss,alpha=0,lambda=lambda.min) #run the ridge regression. x is the matrix of independent variables, Marg is the dependent variable, Poss are the weights. alpha=0 indicates the ridge penalty.
coef(ridge,s=lambda.min) #extract the coefficient for each of the independent variables (players) for the lambda with the minimum error

view raw
gistfile1.r hosted with ❤ by GitHub

That last command spits out all the player IDs along with their coefficients in predicting point margin, AKA RAPM. You can then just copy and paste this into the player ID spreadsheet and match the RAPMs with names.

Here were my top-20 for the 2010-11 season and playoffs:

There’s a Jeff Foster here, a Chase Budinger, but subjectively everything looks pretty good.

Remember, this is just the bare bones RAPM framework. There are endless possibilities as to what can be incorporated. Coaches, arenas, number of rest days, point differential at the time of the possession can all be added. You can experiment using different lambda values. And the “x” in xRAPM comes from a box score prior. The box score gives us a decent amount of knowledge on how good players are, so instead of the ridge penalty regressing everybody to zero, it regresses them to our prior knowledge of their abilities.

The RAPM framework can be used for other stats too, like rebounding. You can sub the RebMarg vector in where Marg is used to calculate a rebounding plus-minus. In that, somebody like Nene can get his due. While he ranked only 90th in the league in TRB%, the ridge regression estimates that he makes a 7.5% impact on TRB% margin, which ranks 17th in the league. Jeremias Englemann, the guy behind RPM, has posted all sorts of cool stuff like this at his site.

I hope this explainer can become a resource for those interested in calculating their own RAPM and those just curious about the actual mechanics behind it. If you have any questions or if you try to repeat what I did and something goes wrong, let me know.

Crow · Post by **Crow** » Fri Jul 31, 2015 12:27 am

Jacob sent this https://docs.google.com/document/d/1Tp5 ... obilebasic

dwm8 · Post by **dwm8** » Mon Aug 03, 2015 4:41 pm

Awesome, thanks for the help! I don't think I came across it in any of these guides, but does anyone know the steps required to add a prior to the regression?

sndesai1 · Post by **sndesai1** » Tue Aug 04, 2015 3:49 pm

based on reading the glmnet documentation, i would think it involves creating a vector of your priors and using it as the "offset" parameter

dwm8 · Post by **dwm8** » Tue Aug 04, 2015 6:52 pm

sndesai1 wrote:based on reading the glmnet documentation, i would think it involves creating a vector of your priors and using it as the "offset" parameter

Thanks for your response. I was looking through that as well, but noticed that it says the offset vector is "A vector of length nobs," which doesn't make sense to me. Wouldn't a vector for priors be a vector of length nvars, i.e. a vector with each value corresponding to a player, not a lineup?

sndesai1 · Post by **sndesai1** » Wed Aug 05, 2015 1:08 am

sorry, i totally skimmed over that.
i'm pretty sure my math is wrong, but would using the predicted margin based on priors for each 10 man stint as the offset work? and then once you finish running the regression, add the resulting coefficients for each player back to their priors? i might just be babbling nonsense...

maybe bayesglm using prior.mean?
https://cran.r-project.org/web/packages/arm/arm.pdf

EvanZ · Post by **EvanZ** » Wed Aug 05, 2015 4:10 am

I've never done priors with RAPM, but I would think you could just add a feature for each player for each season. So for example if you were doing 2-year RAPM, one matchup might be GSW vs CLE (from the finals):

Steph15 + b*Steph14 + Klay15 + b*Klay14 + Barnes15 + b*Barnes14 + ...

LeBron15 + b*LeBron14 + TT15 + b*TT14 + Mozgov15 + b*Mozgov14 + ...

So for each stint you would have 10 features representing players in 2015 and 10 players (weighted by a parameter b) representing their values from 2014. The parameter b could be chosen by cross-validation and would represent some suitable time decay from one year to the next.

Alternatively, one could run a regression for the previous (single) year and add those ratings explicitly into the values for the features given above. Maybe this is the more obvious way how to do it? Not sure if these two ways of doing it are equivalent or not.

mystic · Post by **mystic** » Fri Aug 07, 2015 11:14 am

dwm8 wrote:
sndesai1 wrote:based on reading the glmnet documentation, i would think it involves creating a vector of your priors and using it as the "offset" parameter
Thanks for your response. I was looking through that as well, but noticed that it says the offset vector is "A vector of length nobs," which doesn't make sense to me. Wouldn't a vector for priors be a vector of length nvars, i.e. a vector with each value corresponding to a player, not a lineup?

I also just saw your PM and will reply in more detail there, but I want to reply here to that specific question. You basically calculate the expected value of a specific matchup (5 vs. 5) and then subtract it from the real value in order to create a prior-informed RAPM. In order to do that in GLMNET you can create the vector of the prior values for each player and then combine that with the design matrix via matrix multiplaction (design matrix %*% prior). This creates the necessary vector with the length nobs and would be used as offset in order to calculate the prior informed RAPM.

Assuming you want RAPM for year y with a prior based only on year x, in that case you can include the data from year x to the set and run the regression based on the whole sample. The results should be nearly identical and differences are explained by rounding.

I want to add, that when using GLMNET for RAPM, I do the crossvalidation with nfolds=10, that's when the resulting lambda is usually constant. Also, lambda.1se should be used for running the ridge regression in order to get better predictive values. Using lambda.min with a big enough sample would result into not much different values than using no lambda at all (which means APM values instead).

dwm8 · Post by **dwm8** » Thu Aug 13, 2015 2:29 pm

Thanks for the help everyone, and especially thanks to mystic with whom I've received a lot of tips via pm. I sent this message to him yesterday, but I thought I'd post it here in case anyone had an idea of what I may be doing wrong with my RAPM calculations.

So I downloaded the 2012 regular season matchup data from basketballvalue.com and attempted to calculated RAPM using that season's BPM as a prior. I cleaned up the data a bit in MATLAB to get all of the matrices in the correct form and then exported everything to R to analyze using glmnet. Unfortunately, my results were a bit wacky, and they can be seen in the link below:

https://docs.google.com/spreadsheets/d/ ... sp=sharing

That spreadsheet also includes tabs to show what my x, y, and weight matrices look like (the x matrix is only the first 250 rows of the lineup data, and I didn't have enough space to include the offset vector in the sheet). As you can see, Keith Benson and Hamady N'Diaye with +70 RAPM's lead me to believe I may have done something wrong. Have you gotten results that are usually this noisy? I expected some low-minutes guys to be all over the place, but even high-minute stars have weird results (LeBron is at 0.2, Durant at -4.4). Below is the exact code I used in R after importing x, y, weights, and p0:

Code: Select all

> lambda=cv.glmnet(x,y,weights=weights,offset=p0,nfolds=100)
> lambda.1se=lambda$lambda.1se
> ridge=glmnet(x,y,family=c("gaussian"),weights=weights,offset=p0,alpha=0,lambda=lambda.1se)
> coef(ridge,s=lambda.1se)

The results are equally bad if not worse when I exclude the offset vector. Do you see anything out of the ordinary? I know that single year RAPM is supposed to be noisy, but not this noisy, right?

DSMok1 · Post by **DSMok1** » Thu Aug 13, 2015 2:46 pm

dwm8 wrote:Thanks for the help everyone, and especially thanks to mystic with whom I've received a lot of tips via pm. I sent this message to him yesterday, but I thought I'd post it here in case anyone had an idea of what I may be doing wrong with my RAPM calculations.

So I downloaded the 2012 regular season matchup data from basketballvalue.com and attempted to calculated RAPM using that season's BPM as a prior. I cleaned up the data a bit in MATLAB to get all of the matrices in the correct form and then exported everything to R to analyze using glmnet. Unfortunately, my results were a bit wacky, and they can be seen in the link below:

https://docs.google.com/spreadsheets/d/ ... sp=sharing

That spreadsheet also includes tabs to show what my x, y, and weight matrices look like (the x matrix is only the first 250 rows of the lineup data, and I didn't have enough space to include the offset vector in the sheet). As you can see, Keith Benson and Hamady N'Diaye with +70 RAPM's lead me to believe I may have done something wrong. Have you gotten results that are usually this noisy? I expected some low-minutes guys to be all over the place, but even high-minute stars have weird results (LeBron is at 0.2, Durant at -4.4). Below is the exact code I used in R after importing x, y, weights, and p0:
Code: Select all
> lambda=cv.glmnet(x,y,weights=weights,offset=p0,nfolds=100)
> lambda.1se=lambda$lambda.1se
> ridge=glmnet(x,y,family=c("gaussian"),weights=weights,offset=p0,alpha=0,lambda=lambda.1se)
> coef(ridge,s=lambda.1se)
The results are equally bad if not worse when I exclude the offset vector. Do you see anything out of the ordinary? I know that single year RAPM is supposed to be noisy, but not this noisy, right?

These results look like no regression to the prior/mean is occurring (i.e. this is pure APM). I'm almost certain that's what's happening. I don't know enough about the code to troubleshoot it though. Anyone else?

APBRmetrics

Guides to Creating RAPM

Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM

Re: Guides to Creating RAPM