Page 1 of 1

A few Questions about RAPM

Posted: Fri Dec 12, 2014 11:32 pm
by Blackmill
Hello, I recently became very interested in RAPM, and I was hoping to generate some multi-year and weighted results. However, I am fairly unexperienced at coding and generating more complex models, which brings me to my questions:

1. Firstly, I was wondering where I might be able to buy/downloading the needed data sets. I have found a few websites but none provide play-by-play data going back further than 2005.
2. If I wanted to generate the average RAPM over multiple years (lets say, 2001-2010) would I need to create one data set containing each season?
3. I have found the r-code for basic rapm but I was wondering how to weight certain seasons or just the playoffs more.

Code: Select all

library(glmnet)

Marg <- data$MarginPer100

Poss <- data$Possessions

RebMarg <- (data$RebRateHome-(100-data$RebRateHome))

data$Possessions=NULL

data$RebRateHome=NULL

data$MarginPer100=NULL

x <- data.matrix(data)

lambda <- cv.glmnet(x,RebMarg,weights=Poss,nfolds=5)

lambda.min <- lambda$lambda.min

ridge <- glmnet(x,RebMarg,family=c("gaussian"),Poss,alpha=0,lambda=lambda.min)
 
coef(ridge,s=lambda.min)
4. I found that certain websites yield different rapm results. For example, the 2002 data according to http://stats-for-the-nba.appspot.com is very different from https://sites.google.com/site/rapmstats/2002-rapm. Which website should I use to cross-reference my (single-season) results? Are the differences because of what priors were chosen?
5. Looking at pre-existing rapm data, I noticed that some seasons seem to have depressed results, and I was wondering how that might impact multi-year rapm. Would it be best to normalize the results (standard deviations away from the mean) and if so how would I do that in r-studio?

Thanks for any answers.

Re: A few Questions about RAPM

Posted: Sat Dec 13, 2014 12:43 am
by permaximum
1. Certain people have matchup and play-by-play data between 1996-2014 but they don't share it. Atm there's only 2005-2012 in basketballvalue.com. If I had, I would share it but unfortunately I don't.
2. Yes.
3. There was something in R for different weighting but I don't remember atm. For now, just multiply possessions. Some people weight playoffs twice. Some don't differentiate it from regular season.
4. http://stats-for-the-nba.appspot.com has a different version of RAPM. It's the most advanced one. However, even the basic RAPM can be different because of the lambda used in ridge regression. Also there are different methods of calculating penalized regresssions. For example the results of glmnet package should be a bit different than others.
5. You don't need to normalize the results of RAPM. You probably checked the values in http://stats-for-the-nba.appspot.com. It involves box-score data (height, minute included), previous seasons' RAPM and box-score data, adjustments such as age, point margin etc.

Re: A few Questions about RAPM

Posted: Sat Dec 13, 2014 2:22 am
by Blackmill
You mentioned that play-by-play data goes back to the 1996 but http://stats-for-the-nba.appspot.com has values dating back to 1991. Are these results just estimates using aspm? If so, for what years are the rapm results "true" rapm?

Nevermind I just found the thread which I think discuses how rapm was computed for the 90s.

Re: A few Questions about RAPM

Posted: Sat Dec 13, 2014 9:24 am
by permaximum
In that site, 2001-2014 are based on real play-by-play data. So 1996-2000 results are fake too. BTW the runner of that version of RAPM calls it xRAPM or RPM.

Re: A few Questions about RAPM

Posted: Sat Dec 13, 2014 11:46 am
by mystic
1. https://bbmetrics.wordpress.com/pbp-and ... 2001-2006/

Those are pbp and matchupfiles contained in 7z-files (it just renamed as gif to be able to upload them). The matchupfiles isn't particular good (I assume they are old matchupfiles once uploaded by J.E.), but the pbp files should be fine. Unfortunately I had a hdd crash last year when I moved, and was not able to recover the better matchupfiles. I did not check each of those pbp files, but they seem to be all ok.

Other than that you can find pbp on NBA.com starting from 1997 season, like this one: http://stats.nba.com/game/#!/0029600596/playbyplay/
Also, basketball-reference.com has pbp starting in 2001.

2. It depends. If you have a useful development curve included, you can use a multiyear file (containing all wanted years) to run the regression on. If you don't have that development curve, you will have trouble with player developments and see shifts for player (positive or negative) for that respective interval (if you understand superposition principle for waves, you understand that issue likely better). You could also run the regression on the yearly basis, then normalize the values and find an appropriate linear combination for those years. The last possibility is running the regression while using prior season results as prior vector in the regression. If I understand the description for glmnet correctly, that would be the offset vector you can include.

3. You use weights already for possessions, thus you would need to add the desired additional weight to that possession vector before running the regression, if you want to use glmnet. Another hint here: The idea of calculating a lambda via cross validation is actually that you do that on a sufficient sample and then subsequently use that calculated lambda for the ridge regression. That saves computing time and when the lambda is calculated correctly, it is supposed to work for similar samples. It makes no sense to run the cv each time again and again and then the ridge also, because the cv is already calculating the coefficients based on the different lambda values anyway. And especially for a big multiyear sample that may just take hours for your computer to come up with a result. Also, glmnet tends to calculate the lambda in a way that overfitting may occur when using the lambda.min (especially with bigger sample size). It may even end up not being much different than OLS with the same kind of issues involved. I strongly suggest using lambda.1se instead, because that would be the biggest lambda for which the error would be within 1 standard deviation of the best fit. That should give you sufficient reliable coefficients sets when running the ridge regression.
Another hint: Check out parallel computing with R. glmnet has a parallel option, and there are some packages which open up the possibility. Here are some links:
http://cran.r-project.org/web/views/Hig ... uting.html
http://www.rparallel.org/
http://cran.r-project.org/web/packages/ ... index.html

Additionally, you may not want to use all of the cores for such procedure, that may freeze up your computer. And depending on the sample size, you may need to handle the memory limits manually. Read up on that topic, if needed. (That is all written under the assumption that you don't have a modern supercomputer at your disposal.)


4. and 5. First, when you want to compare different results, it is better to normalize the values first (using z-scores should be the best way to do that). Then you get different results depending on the used sample (How good is the matchupfile? How are low-possession players handled? What kind of weighting scheme is used (playoffs, close-game situations, garbage time)?), then the results directly depend on the choosen penalties (lambda, some may call there elastic net/lasso also RAPM, which means they have a alpha != 0), and possible distribution (glmnet lets you choose different distribution via "family"). Also, as you said, the results will differ when different prior values are used. Some may use boxscore-based information or prior season information or maybe even both (like xRAPM/RPM by J.E. now presented daily updated on ESPN). There may also be various adjustments for age, coaches, player roles, etc. be included, which obviously will change the outcome of those regressions. So, I don't know whether there is the "best" RAPM present somewhere, which you can use as reference.
For prior-informed I suspect those values to be the reference: http://www.gotbuckets.com/statistics/rapm/2014-rapm/
talkingpractice is providing those RAPM values, and given their ressources, they should have the best matchupfiles available as well as their posts here (and comments somewhere else) suggest that they know what they are doing. Other than that, you can check your results via out-of-sample test.

Regarding http://stats-for-the-nba.appspot.com/: The values (except for those especially named pure RAPM and pure SPM) are xRAPM, which is using a prior based on boxscore-information as well as prior season results. J.E. only has real pbp data for 2001 and later, everything else before is based on simulated matchups. xRAPM (or subsequently RPM) presents and improvement over pure RAPM in terms of predictive power. Comparing your pure RAPM results to those values doesn't make much sense. J.E. had pure RAPM and prior-informed RAPM previously presented on his website, and some of that is contained within colts18's google spreadsheet and can also be found here: http://shutupandjam.net/nba-ncaa-stats/npi-rapm/ (both, NPI and PI RAPM seems to be collected from J.E., but that might be a better overview than the google spreadsheet). Those NPI and PIRAPM might be a good starting point for your desired cross reference.