How To Calculate RAPM

Posted by Jacob Frankel on Apr 21, 2014 | 1 comment

Flickr | adavey

Flickr | adavey

The general concepts behind ESPN’s new stat, RPM, have been pretty well covered. Here’s their introductory post and here’s Kevin Ferrigan with some more on how it works. But I’ve seen a lot of people wondering about the actual mechanics of how it’s calculated, so I’ll delve into that, showing people how to actually calculate the bare bones form of RAPM. View it as an updated (for RAPM) and more easily implementable version of this breakdown of APM by Eli Witus, now of the Houston Rockets.

We’ll calculate RAPM from the 2010-11 season and playoffs, the last non-lockout year that matchup data is available on basketballvalue. I’ve already cleaned up the data, and you can download it in zipped form here.

The basis of the APM approach is a massive regression. A regression is a statistical method to explain how one or more sets of variables (the independent variables) affect another singular variable (the dependent variable). The APM regression tries to explain margin of victory (dependent variable) over stints of possessions, with no substitutions, with who is on the court (the independent variables). In our dataset there are about 37,000 of these stints, each of which will show up as a row on the excel sheet you downloaded above. The regression also weights each stint by it’s number of possessions. The ridge element of the regression regresses all the independent variable coefficients farther towards zero.

Now we have to set up the independent variables. Each of the 400 players who play in our dataset is an independent variable and there’s a column for each independent variable. Download this sheet, that has a list of player IDs. Copy and use the transpose paste option to put that list along the top row of the initial sheet (should be J1:RC1), then fill in this formula in cells J2:RC37264:

=IF(ISNUMBER(FIND(" "&J$1&" ",$B2)),1,IF(ISNUMBER(FIND(" "&J$1&" ",$C2)),-1,0))

This step is pretty computer processor intensive, so it’d be best to turn off other applications running and fill the formula in in chunks. This formula basically looks at the two units on the floor, puts a one if the player for the column the formula is on the home unit, negative-one if he’s on the away unit, and zero if he’s not on the floor at all. At this point your sheet should look something like this (click to expand):

Now, we’re going to actually calculate RAPM. I do it in the free, open-source statistical program R. It takes a bit of time to get used to, especially for those, like me, who have trouble thinking of stats outside of the context of a spreadsheet. But R is an incredibly powerful tool, and even though you don’t need to know much about it to calculate RAPM, I’d encourage everybody to at least try it out. You can download R at the above link, and it should be pretty easy to get running.

You’ll need to install one package to calculate RAPM, the glmnet package. Installing it is as simple as entering this command in R:

install.packages("glmnet")

You’ll also need to get the data we’ve put together above into a more importable setting. Open a new excel file and copy and paste the possession, rebound rate, and margin columns. Copy all the player columns and use the paste values option to put those columns values into the new document. Save this file as a csv. If you want to skip all the above steps, you can download that CSV here. Now we can run the regression. Here’s my R code, with annotations behind the pound symbols:

1 2 3 4 5 6 7 8 9 10 11 12 13

library(glmnet) #load glmnet package

data=read.csv("/users/jfrankel16/Desktop/importrapm.csv") #imports csv and makes it a data frame. you'll have to change the file directory

Marg=data$MarginPer100 #create a separate vector for margin

Poss=data$Possessions #create a separate vector for possessions

RebMarg=(data$RebRateHome-(100-data$RebRateHome)) #create a separate vector for rebound rate differential

data$Possessions=NULL #remove the possessions column from the data frame

data$RebRateHome=NULL #remove the home rebound rate column from the data frame

data$MarginPer100=NULL #remove the margin column from the data frame

x=data.matrix(data) #turn the data frame (which is now just 1s, -1s, and 0s) into a matrix

lambda=cv.glmnet(x,RebMarg,weights=Poss,nfolds=5) #find the lambda values. these determine how far towards 0 the coefficients are shrunk

lambda.min=lambda$lambda.min #store the lambda value that gives the smallest error in an object called lambda.min

ridge=glmnet(x,RebMarg,family=c("gaussian"),Poss,alpha=0,lambda=lambda.min) #run the ridge regression. x is the matrix of independent variables, Marg is the dependent variable, Poss are the weights. alpha=0 indicates the ridge penalty.

coef(ridge,s=lambda.min) #extract the coefficient for each of the independent variables (players) for the lambda with the minimum error

view raw

gistfile1.r hosted with ❤ by GitHub

That last command spits out all the player IDs along with their coefficients in predicting point margin, AKA RAPM. You can then just copy and paste this into the player ID spreadsheet and match the RAPMs with names.

Here were my top-20 for the 2010-11 season and playoffs:

There’s a Jeff Foster here, a Chase Budinger, but subjectively everything looks pretty good.

Remember, this is just the bare bones RAPM framework. There are endless possibilities as to what can be incorporated. Coaches, arenas, number of rest days, point differential at the time of the possession can all be added. You can experiment using different lambda values. And the “x” in xRAPM comes from a box score prior. The box score gives us a decent amount of knowledge on how good players are, so instead of the ridge penalty regressing everybody to zero, it regresses them to our prior knowledge of their abilities.

The RAPM framework can be used for other stats too, like rebounding. You can sub the RebMarg vector in where Marg is used to calculate a rebounding plus-minus. In that, somebody like Nene can get his due. While he ranked only 90th in the league in TRB%, the ridge regression estimates that he makes a 7.5% impact on TRB% margin, which ranks 17th in the league. Jeremias Englemann, the guy behind RPM, has posted all sorts of cool stuff like this at his site.

I hope this explainer can become a resource for those interested in calculating their own RAPM and those just curious about the actual mechanics behind it. If you have any questions or if you try to repeat what I did and something goes wrong, let me know.

Statistics: Posted by Crow — Thu Jul 30, 2015 7:54 pm

]]>

]]>

Statistics: Posted by Crow — Thu Jul 30, 2015 7:27 pm

]]>

]]>

Statistics: Posted by EvanZ — Thu Jul 30, 2015 4:54 pm

]]>

Your article had not been shared here previously (to my recollection) and I hadn't seen it.

There are a number of good deeper analytic articles out there that don't get posted here or discussed. Many get found outside here, but some don't. Always author choice, but the decline in article posting is one part of the current situation. I used to post more of them. I am less inclined to do that, for others. It is not that hard but it is not that hard for others either.

Statistics: Posted by Crow — Thu Jul 30, 2015 3:24 pm

]]>

Statistics: Posted by rlee — Thu Jul 30, 2015 2:52 pm

]]>

The way the Kings are being run right now is more of a farce than anything else. It's unfortunate that this happened to Dean

The list of missteps this franchise has taken (under the new owner) in terms of trades/signings is long. Sometimes they're so bad that I wonder whether Kings management is trying to entertain all the non-Kings fans

It's insane. Why/how do so many teams make so many obvious bad decisions?

I hope they have to pay him every penny of that contract no matter what he does now.

Statistics: Posted by Statman — Thu Jul 30, 2015 1:41 pm

]]>

I am not ready to do a 2016 title contention analysis. This thread can be used for that, if / when any are inclined. But I wanted a place to put the following data, fwiw.

Golden State's net off. - def. efficiency ratings in playoffs was 4th best of last ten titlists, but the change in net rtg from regular season to playoffs was 3rd worst (2008 Celtics and 2007 Spurs had worse) at about -4. The last two repeat champs had about a -1 and 0 rs to playoff falloff in first title season. That net change improved in one case and slip in other but was still small. The 2000 Lakers had a -6 slippage though. In 2001 the slippage was a massive 10 pts. In 2002 it was a more modest -3.5. Regular season to playoff net rtg change may not mean that much. Obvioudly it can be affected by strength of opponents faced and injuries in either period. But I thought I'd check to see if the data has a strong lean or not. It is a potentially worrisome sign but it is apparently survivable for first titles or additional ones, especially if the regular season performance base is real strong. GSW comes out looking conflicted and middling for recent champs by these data points.

Golden State had an adjusted efficiency differential of +13.7, one of the better playoff runs recently. See my post at http://godismyjudgeok.com/DStats/2015/n ... m-ratings/

That's several points above their regular season number, which was one of the best of all time.

Got to account for strength of schedule!

Statistics: Posted by DSMok1 — Thu Jul 30, 2015 12:43 pm

]]>

This has the potential to make a big impact on our community. Thank you so incredibly much Evan!

Statistics: Posted by ampersand5 — Thu Jul 30, 2015 3:43 am

]]>

]]>

Golden State's net off. - def. efficiency ratings in playoffs was 4th best of last ten titlists, but the change in net rtg from regular season to playoffs was 3rd worst (2008 Celtics and 2007 Spurs had worse) at about -4. The last two repeat champs had about a -1 and 0 rs to playoff falloff in first title season. That net change improved in one case and slip in other but was still small. The 2000 Lakers had a -6 slippage though. In 2001 the slippage was a massive 10 pts. In 2002 it was a more modest -3.5. Regular season to playoff net rtg change may not mean that much. Obvioudly it can be affected by strength of opponents faced and injuries in either period. But I thought I'd check to see if the data has a strong lean or not. It is a potentially worrisome sign but it is apparently survivable for first titles or additional ones, especially if the regular season performance base is real strong. GSW comes out looking conflicted and middling for recent champs by these data points.

Statistics: Posted by Crow — Wed Jul 29, 2015 10:02 pm

]]>

I think this is good demonstration for organizations, but I agree that using the machine learning labs on AWS or Azure is a good place for the enthusiast that probably doesn't have their own personal cluster. I just starting using what Azure has for a Kaggle project and have enjoyed using it thus far.

Statistics: Posted by nileriver — Wed Jul 29, 2015 8:23 pm

]]>

I see that AWS means Amazon Web Services http://aws.amazon.com/free/

I haven't gone that far yet.

Statistics: Posted by Crow — Wed Jul 29, 2015 6:52 pm

]]>

I think AWS gives something like 750 free computer hours to new users. You could try that out, if you haven't already used them up.

Statistics: Posted by EvanZ — Wed Jul 29, 2015 6:41 pm

]]>