https://web.archive.org/web/20140717131 ... late-rapm/
How To Calculate RAPM
Posted by Jacob Frankel on Apr 21, 2014 | 1 comment
Flickr | adavey
Flickr | adavey
The general concepts behind ESPN’s new stat, RPM, have been pretty well covered. Here’s their introductory post and here’s Kevin Ferrigan with some more on how it works. But I’ve seen a lot of people wondering about the actual mechanics of how it’s calculated, so I’ll delve into that, showing people how to actually calculate the bare bones form of RAPM. View it as an updated (for RAPM) and more easily implementable version of this breakdown of APM by Eli Witus, now of the Houston Rockets.
We’ll calculate RAPM from the 2010-11 season and playoffs, the last non-lockout year that matchup data is available on basketballvalue. I’ve already cleaned up the data, and you can download it in zipped form here.
The basis of the APM approach is a massive regression. A regression is a statistical method to explain how one or more sets of variables (the independent variables) affect another singular variable (the dependent variable). The APM regression tries to explain margin of victory (dependent variable) over stints of possessions, with no substitutions, with who is on the court (the independent variables). In our dataset there are about 37,000 of these stints, each of which will show up as a row on the excel sheet you downloaded above. The regression also weights each stint by it’s number of possessions. The ridge element of the regression regresses all the independent variable coefficients farther towards zero.
Now we have to set up the independent variables. Each of the 400 players who play in our dataset is an independent variable and there’s a column for each independent variable. Download this sheet, that has a list of player IDs. Copy and use the transpose paste option to put that list along the top row of the initial sheet (should be J1:RC1), then fill in this formula in cells J2:RC37264:
=IF(ISNUMBER(FIND(" "&J$1&" ",$B2)),1,IF(ISNUMBER(FIND(" "&J$1&" ",$C2)),-1,0))
This step is pretty computer processor intensive, so it’d be best to turn off other applications running and fill the formula in in chunks. This formula basically looks at the two units on the floor, puts a one if the player for the column the formula is on the home unit, negative-one if he’s on the away unit, and zero if he’s not on the floor at all. At this point your sheet should look something like this (click to expand):
Now, we’re going to actually calculate RAPM. I do it in the free, open-source statistical program R. It takes a bit of time to get used to, especially for those, like me, who have trouble thinking of stats outside of the context of a spreadsheet. But R is an incredibly powerful tool, and even though you don’t need to know much about it to calculate RAPM, I’d encourage everybody to at least try it out. You can download R at the above link, and it should be pretty easy to get running.
You’ll need to install one package to calculate RAPM, the glmnet package. Installing it is as simple as entering this command in R:
install.packages("glmnet")
You’ll also need to get the data we’ve put together above into a more importable setting. Open a new excel file and copy and paste the possession, rebound rate, and margin columns. Copy all the player columns and use the paste values option to put those columns values into the new document. Save this file as a csv. If you want to skip all the above steps, you can download that CSV here. Now we can run the regression. Here’s my R code, with annotations behind the pound symbols:
1 2 3 4 5 6 7 8 9 10 11 12 13
library(glmnet) #load glmnet package
data=read.csv("/users/jfrankel16/Desktop/importrapm.csv") #imports csv and makes it a data frame. you'll have to change the file directory
Marg=data$MarginPer100 #create a separate vector for margin
Poss=data$Possessions #create a separate vector for possessions
RebMarg=(data$RebRateHome-(100-data$RebRateHome)) #create a separate vector for rebound rate differential
data$Possessions=NULL #remove the possessions column from the data frame
data$RebRateHome=NULL #remove the home rebound rate column from the data frame
data$MarginPer100=NULL #remove the margin column from the data frame
x=data.matrix(data) #turn the data frame (which is now just 1s, -1s, and 0s) into a matrix
lambda=cv.glmnet(x,RebMarg,weights=Poss,nfolds=5) #find the lambda values. these determine how far towards 0 the coefficients are shrunk
lambda.min=lambda$lambda.min #store the lambda value that gives the smallest error in an object called lambda.min
ridge=glmnet(x,RebMarg,family=c("gaussian"),Poss,alpha=0,lambda=lambda.min) #run the ridge regression. x is the matrix of independent variables, Marg is the dependent variable, Poss are the weights. alpha=0 indicates the ridge penalty.
coef(ridge,s=lambda.min) #extract the coefficient for each of the independent variables (players) for the lambda with the minimum error
view raw
gistfile1.r hosted with ❤ by GitHub
That last command spits out all the player IDs along with their coefficients in predicting point margin, AKA RAPM. You can then just copy and paste this into the player ID spreadsheet and match the RAPMs with names.
Here were my top-20 for the 2010-11 season and playoffs:
There’s a Jeff Foster here, a Chase Budinger, but subjectively everything looks pretty good.
Remember, this is just the bare bones RAPM framework. There are endless possibilities as to what can be incorporated. Coaches, arenas, number of rest days, point differential at the time of the possession can all be added. You can experiment using different lambda values. And the “x” in xRAPM comes from a box score prior. The box score gives us a decent amount of knowledge on how good players are, so instead of the ridge penalty regressing everybody to zero, it regresses them to our prior knowledge of their abilities.
The RAPM framework can be used for other stats too, like rebounding. You can sub the RebMarg vector in where Marg is used to calculate a rebounding plus-minus. In that, somebody like Nene can get his due. While he ranked only 90th in the league in TRB%, the ridge regression estimates that he makes a 7.5% impact on TRB% margin, which ranks 17th in the league. Jeremias Englemann, the guy behind RPM, has posted all sorts of cool stuff like this at his site.
I hope this explainer can become a resource for those interested in calculating their own RAPM and those just curious about the actual mechanics behind it. If you have any questions or if you try to repeat what I did and something goes wrong, let me know.