Using Spark to calculate RAPM

EvanZ · Post by **EvanZ** » Sun Aug 09, 2015 4:28 am

bchaikin wrote: which is an honest straight-forward question - how should the results be interpreted? if someone is going to present a listing - and it is not a listing of who is better or who played better - then what is it a list of?...

I don't feel the need to say the same thing twice. I post for the people who are interested in calculating these types of ratings. They know who they are.

EvanZ · Post by **EvanZ** » Sun Aug 09, 2015 4:31 am

For now, I hope you think of this article more as a recipe or strategy than a final final thingy. It’s obviously not that. But the idea going forward should be pretty clear. Get a bunch of data and computers and modelize all the things. Winning basketball will ensue as a result.

As far as disclaimers go, I thought this one was fairly clear.

This post was about applying a new technology to a standard technique. If people have problems with RAPM, they can take it up with the dozens of other people over the years who have worked on it.

If there are Spark-related questions, I'm happy to entertain those.

mystic · Post by **mystic** » Sun Aug 09, 2015 12:30 pm

EvanZ wrote:Mike, the way to interpret the rankings is that they are the result of a calculation with very explicit (unbiased) methodology.

Well, mathematically speaking RAPM is explicitly biased, but we know the bias pretty well and is entirely mathematically explained. Other than that your statement is exactly what my answer would look like: The numbers are the results of a calculation. We can specify the meaning by saying that those numbers are presenting the change the on-court presence of a player makes over an artifical average player to the team overall result per 100 poss, but does not attempt to explain how such change is accomplished. In order to justify statements like "player a is better than player b" the rate of the change is not sufficient, because the "how" is not just trivial, but depends on various factors including but not restricted to the role and minutes a player gets assigned to by the coaches.

Overall, and that's probably what the use of the word "unbiased" in your statement is for, there is no specific sentiment of good or bad included, unlike boxscore-based metrics, which give specific attributes to certain entries accompanied by the human perception of good or bad (like getting steals is good (positive value), turning the ball over is bad (negative value)). In fact the algorithm used for ridge regression does not know per se about such things at all. Players are treated like anonymous variables and the algorithm doesn't get changed based on the resulting coefficients for individual players.

EvanZ wrote: If you would take the time to understand the methodology and view the results as simply that, maybe you could come to peace with RAPM.

I completely agree with that, and I tried to explain the math behind that in (for my taste) sufficient detail here: http://www.apbr.org/metrics/viewtopic.p ... 239#p15811

To give a specific answer to Mike:

Mike G wrote:Well please enlighten us on how to interpret the rankings?

It is not per se a ranking going from best to worst, but just a list of numbers, which are the result of the calculation and should be treated as such. The advantage those numbers are giving comes from the fact that they predict the outcome of future games better than other methods (specifically just OLS-based calculation or boxscore-based approaches). A player A with a higher number just changes the result of the game per 100 poss to a higher degree in his specific role and minutes than a player B with a lower number assigned to him. Interpreting such situation as "player A is better than player B" reduces the meaning of "better" in terms of "playing basketball in a 5on5 situation" to degree, which the algorithm itself does not allow nor is designed to attempt.

bchaikin wrote: why did you post this if you did not want a discussion of it's results? not everyone understands the methodology but can't we discuss the listing as you presented and ask what it means?.

What those numbers mean was already said by Evan in the article itself, thus it makes little sense to discuss the values in a fashion which would exceed their validity range. And it is tough to discuss results in a meaningful way, if the methodoly is not appreciated. What can we discuss? We can discuss whether Evan's approach gives a better prediction than other ridge regression based approaches. For that we would need a sufficient test, but ultimately that's what the usage of ridge regression is for, not to argue that player A is per se better than player B, because the coefficient says so.
We could also discuss, whether specific changes to the algorithm can be made in order to increase the predictive power, like J.E. did with his RPM metric for example, which used additional information, which makes a determination on who is better as a player more plausible, even though a higher RPM value shouldn't be interpreted as a per se better player either. Certain nuances are just not incorporated and have to be applied during interpretation in order to make use of such numbers in a meaningful way (but that is rather trivial from my perspective and has to be applied to every numerical attempt of calculation a specific player value).

bchaikin · Post by **bchaikin** » Sun Aug 09, 2015 6:00 pm

We can specify the meaning by saying that those numbers are presenting the change the on-court presence of a player makes

that a single player "makes"? this is a bold claim - or is it in reality simply that which "occurs" when a single player plays in specific time intervals? inferring the first is far different than what the second actually measures...

but does not attempt to explain how such change is accomplished

"does not attempt" and "can't" are two different things, which in fact is it? can this methodology "attempt" to explain why players have high or low numbers as presented?...

In order to justify statements like "player a is better than player b" the rate of the change is not sufficient, because the "how" is not just trivial, but depends on various factors including but not restricted to the role and minutes a player gets assigned to by the coaches.

what "various factors" are you alluding to? if 2 players have identical raw stats, including player defense outside of steals, blocks, and defensive rebounding (as best we can measure it), can they have different "numbers" in your methodology? even widely different numbers? and if the answer is yes, how does your methodology explain that?...

The advantage those numbers are giving comes from the fact that they predict the outcome of future games better than other methods

again this is a bold claim - just how would you explain this to an NBA GM or coach?...

A player A with a higher number just changes the result of the game per 100 poss to a higher degree in his specific role and minutes than a player B with a lower number assigned to him. Interpreting such situation as "player A is better than player B" reduces the meaning of "better" in terms of "playing basketball in a 5on5 situation" to degree, which the algorithm itself does not allow nor is designed to attempt.

so does this mean the methodology, based on the numbers presented, says a team won more last season with lamarcus aldridge, dirk nowitzki, zach randolph, or draymond green, than with anthony davis, based on last season's numbers?...

And it is tough to discuss results in a meaningful way, if the methodoly is not appreciated.

is this how you would approach an NBA GM or coach? i am often asked by my employer whether any of the numerous iterations of adjusted plus/minus should be considered, and in playing devil's advocate the responses i get are often more defensive and evasive than explanatory...

how does one do due diligence to appreciate the results of this methodology? telling my superiors they have to take the time to "appreciate" the methodology doesn't fly - they want results they themselves can interpret...

when i try to explain for example espn's real plus/minus to them, but they then see zaza pachulia listed 16th in the league (and 2nd among Cs), the conversation stops. so just what is that ranking supposed to be telling us?...

it listed pachulia as 5th in ORPM among Cs - despite shooting less than 46% on 2s, not being able to draw a foul, and with a low scoring rate. he was a good offensive rebounder. however andre drummond shot much better, drew far more fouls, was more efficient on offense, and scored better, but ranks much lower among Cs in ORPM. i can understand that by happenstance the bucks as a team played better on offense than detroit did with drummond, but to then give credit for that to pachulia makes little sense. and the line that pachulia does more of the unmeasureable things to help his team on offense than does drummond doesn't fly either, especially if one can't explain what those things are...

same here with this methodology and aldridge, nowitzki, randolph, and green ranked above anthony davis...

Certain nuances are just not incorporated and have to be applied during interpretation in order to make use of such numbers in a meaningful way

excellent - just what are these "certain nuances"? and how exactly - as you say - are they applied during interpretation to make use of these numbers in a meaningful way?...

mystic · Post by **mystic** » Sun Aug 09, 2015 8:45 pm

bchaikin wrote:We can specify the meaning by saying that those numbers are presenting the change the on-court presence of a player makes

that a single player "makes"? this is a bold claim - or is it in reality simply that which "occurs" when a single player plays in specific time intervals? inferring the first is far different than what the second actually measures...

Well, I used the word presence for a reason. The presence makes the change. If I would have wanted to say that the player itself makes that change, I would have said so.

bchaikin wrote: "does not attempt" and "can't" are two different things, which in fact is it? can this methodology "attempt" to explain why players have high or low numbers as presented?...

Well, the results depend on the raw data. It is really that simple, but how is the algorithm supposed to know what the player is doing on the court, which can explain the change? That is a pretty good example of a question completely unnecessary, if you would understand the method used. Unfortunately, it seems like you rather quibble about something instead of reading up on the math.

bchaikin wrote: what "various factors" are you alluding to?

Well, I gave two examples for a reason.

bchaikin wrote: if 2 players have identical raw stats, including player defense outside of steals, blocks, and defensive rebounding (as best we can measure it), can they have different "numbers" in your methodology? even widely different numbers? and if the answer is yes, how does your methodology explain that?...

Well, what does "identical raw stats" mean? The pbp is used as raw data for the regression. Is that included? Or do you only talk about the boxscore? The question would also be: Do they achieve those "identical raw stats" in the same fashion or are there distinct differences in their playing style (like moving off the ball, setting screens, timing for help defense, etc. pp.)?

bchaikin wrote: again this is a bold claim - just how would you explain this to an NBA GM or coach?...

Well, I presented the math in the other thread, which is actually proven. Thus, I know for a fact that RAPM predicts better than APM. If that coach or GM has the necessary basic math knowledge, I can assume, he understand that without me saying anything. If not, he probably just have to trust me on that (and I could also present some research results regarding that as well).

bchaikin wrote: so does this mean the methodology, based on the numbers presented, says a team won more last season with lamarcus aldridge, dirk nowitzki, zach randolph, or draymond green, than with anthony davis, based on last season's numbers?...

No, the methodology itself doesn't say anything at all besides what I previously described. If you want to estimate the team wins from those numbers, you better take the amount of possession someone played into account as well.

bchaikin wrote: is this how you would approach an NBA GM or coach?

Why would I do that? Makes no sense to even ask such a question, because HOW I explain something is dependent on how much the person knows. When I discuss such a thing on this message board, I do not imagine talking to some mysterious coach or GM, but hope to engage in a discussion with people who have the basic motivation to actually understand what they want to talk about. Right now I do not get that impression from you at all, but as I mentioned before, it seems like you like quibbling way too much for my taste.

bchaikin wrote: i am often asked by my employer whether any of the numerous iterations of adjusted plus/minus should be considered, and in playing devil's advocate the responses i get are often more defensive and evasive than explanatory...

I'm not part of your discussions ... no idea how you are able to communicate that at all. From my experience, knowing what I'm talking about helps tremendously in such a discussion. Also, in a face-to-face conversation, I can take care of things in a different fashion than on a message board.

bchaikin wrote: how does one do due diligence to appreciate the results of this methodology? telling my superiors they have to take the time to "appreciate" the methodology doesn't fly - they want results they themselves can interpret...

Well, there is a big difference between a message board and your reallife conversations, isn't it? I think it is reasonable to expect someone trying to engage in a discussion on a specific topic on this message board to have at least a tiny bit of motivation to learn about the topic. I most certainly don't expect that in a different environment in the same way. So far I was usually able to explain the results of a ridge regression to various people with different math skills and usually the most people appreciated my effort. The few who didn't, were usually fans of specific players and didn't like that their respective favorite player didn't end up with the best values. I did not expect such behaviour on this particular message board, where I see a myriad of people being able to understand the method and engage in a meaningful discussion.

bchaikin wrote: when i try to explain for example espn's real plus/minus to them, but they then see zaza pachulia listed 16th in the league (and 2nd among Cs), the conversation stops. so just what is that ranking supposed to be telling us?...

Well, maybe you aren't as good at explaining it as you believe you are? Anyway, Pachulia played limited minutes in a specific role. Now you can look into his role and can try to access whether Pachulias own action is responsible or maybe the dynamic of the game changed when Pachulia was on the court in different ways, which for example led to better positions for his teammates in order to succeed.
And when the conversation stops at a point where the preconception of the GM or Coach is in disagreement with the results, it becomes really tough. Using math is usually motivated by getting information which are not otherwise easily accessible. In fact, the whole exercise would be meaningless, if the GM and Coach are only paying attention, when the tool supposed to give further information only results in information they already possess. And I seriously doubt that a coach or GM who only looks for confirmation of his own preconception is pretty good at his job.

bchaikin wrote: it listed pachulia as 5th in ORPM among Cs - despite shooting less than 46% on 2s, not being able to draw a foul, and with a low scoring rate. he was a good offensive rebounder. however andre drummond shot much better, drew far more fouls, was more efficient on offense, and scored better, but ranks much lower among Cs in ORPM. i can understand that by happenstance the bucks as a team played better on offense than detroit did with drummond, but to then give credit for that to pachulia makes little sense.

If Pachulia enables the team to play the way they played, while a different player is not changing the dynamic in the same fashion, sure as hell Pachulia should get the credit for that as well.

Overall, your reasoning gives me the impression that you don't fully appreciate the fact that the game is actually 5on5 and not some 1on1 excercise between Pachulia and Drummond. The result of the team overall matters, not so much whether player a has the higher FG% (which also heavily depends on the shot distance) or draws more fouls (where drawing more fouls while not converting the necessary foul shots plays a role as well). Pachulia for example is a much better shooter from midrange, which helps with the necessary spacing and allows his teammates to attack the basket from the perimeter. Pachulia also is the much better passer and has better control over the ball. So, the mistake here is to limit the view on the offensive contribution to some selected boxscore entries while ignoring the game itself.

bchaikin wrote:and the line that pachulia does more of the unmeasureable things to help his team on offense than does drummond doesn't fly either, especially if one can't explain what those things are...

Well, you can't explain does equate to "one can't explain".

Obviously we can measure something, and there are various way to explain the measurement within the constrains of a 5on5 basketball game.

bchaikin wrote: excellent - just what are these "certain nuances"? and how exactly - as you say - are they applied during interpretation to make use of these numbers in a meaningful way?...

Well, as I pointed out the role a player is used in is one aspect. Also, the specific skills the teammates possess is another aspect. If I have someone increasing the spacing of the overall team, but the ball handler isn't able to take advantage of that, such skill isn't worth as much to that team as it would be for a different team. There is more, but right now I suspect the effort I put in isn't as much appreciated in order to justify spending even more time on the answer ...

Anyway, for a possible future conversation, it would be nice, if you could use the quote-function on this board. Your posts are often not as easy to read as they could be. Thanks.

APBRmetrics

Using Spark to calculate RAPM

Re: Using Spark to calculate RAPM

Re: Using Spark to calculate RAPM

Re: Using Spark to calculate RAPM

Re: Using Spark to calculate RAPM

Re: Using Spark to calculate RAPM