APBRmetrics

Posted: **Fri Jun 03, 2011 12:25 am**

J.E. wrote:
Crow wrote: How much weight did advanced metrics get in the Boston / OKC deal, the Denver / NY deal, the Orlando deals with Washington and Phoenix, etc.?
Seems like Boston and NY didn't give much weight to (r)APM. Green had a bad rating pretty much every year, and Anthony sure wasn't ranked as a superstar. Not sure about other advanced metrics

Would love if anyone could explain the Boston trade to me. (Paging Mike Z...

)

Posted: **Fri Jun 03, 2011 12:42 am**

Dapo,

Thanks for posting the interesting and provocative article. Nice work! (Would love to see the Appendices, as well, whenever you have a chance to upload them.)

Best wishes,
Steve Ilardi

Posted: **Fri Jun 03, 2011 3:17 am**

@EvanZ: Man if you have that for the 2010-2011 season, that would be awesome. The better a box score matrix R I feed to LambdaPM, the better performance I'll get in the three main metrics I use.

I ended up just using box score data from dougstats.com. But I noticed that hoopdata.com for example has far more complete information. I wasn't too sure about how they'd feel about me extracting that from their website though, so I didn't really bother.

@ilardi: The appendices are in the document I uploaded. They start on page 22. Or you mean code and raw data? Those will be uploaded somewhere within the next week. I just want to add more comments to it, clean it up a bit and also give this minute threshold option that Crow discusses.

Posted: **Fri Jun 03, 2011 3:38 am**

Rhuidean wrote:@EvanZ: Man if you have that for the 2010-2011 season, that would be awesome. The better a box score matrix R I feed to LambdaPM, the better performance I'll get in the three main metrics I use.

I ended up just using box score data from dougstats.com. But I noticed that hoopdata.com for example has far more complete information. I wasn't too sure about how they'd feel about me extracting that from their website though, so I didn't really bother.

Here you go (csv):

https://spreadsheets.google.com/spreads ... output=csv

Glossary is here:

http://thecity2.com/glossary/

I haven't updated the glossary in a while, so there may be some columns that aren't there. "CPOS" is the calculated positions for each player based on actual units from bball-value.

Oh, on bottom page 11 of your manuscript, you refer to Fig. 2, but I think you meant Fig. 3.

Posted: **Fri Jun 03, 2011 3:57 am**

Great, thanks. Hrm, how exactly do you track this counterpart data? Do you pick up on LeBron switching from Deng to Rose, for example? You also have some team variables like Team ORB. I should probably delete those columns, as they aren't specific to the individual, I guess?

Thanks for the mention of that typo.

Posted: **Fri Jun 03, 2011 4:21 am**

Counterpart data is strictly by position, but it still may be useful to your regression. And if it doesn't improve the results that wold also be interesting.

Those are team rebounds, rebounds not credited to individuals. You could split them five ways, or just ignore them, I suppose.

Posted: **Fri Jun 03, 2011 4:36 am**

OK, I'll keep the counterpart data but remove the team columns. I emailed hoopdata.com too, asking them to release a CSV of the additional data they track (charges drawn, # of a player's shots blocked, etc) so I can create a master box score table that includes dougstats data, your data, and hoopdata data.

Then I guess from there I can see how much more this additional box score data helps.

Posted: **Sat Jun 04, 2011 7:34 pm**

I'll post more details on my blog, but I wanted to report that the R^2 = 0.56 between ezPM100 and LambdaPM for 2010-11, and the R^2 = 0.59 between ezPM100 and the LambdaPMBoxScore rating. Finally, the R^2 = 0.8 between LambdaPM and LambdaPMBoxScore.

Posted: **Tue Jun 07, 2011 7:23 pm**

OK, prompted by a question someone asked on another site, I went ahead and compared LambdaPM's ability to predict the margin of victory for the home team to Vegas. Basically, you can consider the Vegas line as an estimator and see how well the two compare.

Of course, LambdaPM "cheats" a bit in the following two ways:

(A) When training the LambdaPM algorithm, it uses a full end of season box score rather than whatever the league box score is up to that point (just because it is a hassle getting the correct league-wide box score for say the Nth game of the season. It isn't impossible to do, just a hassle I didn't want to deal with.) I don't think this is too big a deal though, you don't expect league-wide per 36 minute stats to change that much.
(B) When coming up with the prediction for a game, LambdaPM is given access to how many possessions each player plays in the game. But if you think about it, this isn't a dealbreaker either, because it probably isn't hard to predict how many minutes important players will play in each game.

Anyway, so I wrote a bit of code to grab all the Vegas game predictions for the 2010-2011 season (there is a website covers.com that has them available.)

Then for each of the techniques I considered in the paper (home court advantage predictor, APM, LambdaPM(R), LambdaPM(R,2)) I basically used each of them as a rule to place bets.

Basically, if the technique differs in Cutoff=1, 2,3,4,5 from what Vegas predicts, then I place a "bet." Since we also know what the final margin is, we can see whether the bet was a winning one or a losing one.

As an example, the HCA estimate predicts that the home team will win by roughly 3 points. If the Vegas line is -6, then this means Vegas is predicting the home team to win by 6 points. If my cutoff is 1, then the difference between the HCA prediction and Vegas's prediction is big enough for me to place a bet.

So in short, we train each of these algorithms on the first 820 (and also separately, first 410 and first 205) games of the regular season, then see how well they would have done in the last 410 if we'd used them to gamble (subject to caveats A and B above, of course.)

Here are the results for training on the first 820 games, evaluating on last 410:

The green and the blue rows are the most interesting ones. Let's focus on cutoff=5. Setting the cutoff that high means that we want a disagreement betwen the technique in question and vegas's estimate to be at least 5 before we decide to place a bet. As one sort of expects, using the HCA estimate even with this cutoff is pretty bad; your winning percentage is no better than the coinflipper. APM, LambdaPM(R,1) and LambdaPM(R,2) all do pretty well, getting winning percentages of 54.1%, 57.1%, and 55.1%, respectively.

Of course, we cannot draw too much from this since I'm only evaluating on 410 games...the sample size is too small to say anything. But it is kind of interesting still, I think.

You can also take a look at training_size=410, training_size=205 here:
https://spreadsheets.google.com/spreads ... y=CJn-xegG

Take a look at training_size=205 especially, the third sheet. APM, LambdaPM(R,1), LambdaPM(R,2) do much worse than before, posting winning percentages of 48.5%, 51.7%, and 52.6%, respectively (probably not much better than coin-flipping.)
I have two guesses regarding this poor performance:

1) Perhaps 205 games is simply too few games to build up a good model of the NBA. You could probably improve the performance a lot by incorporating data from the 2009-2010 season. I'm actually a bit curious to see how much this would help.
2) It is also possible that the regularization parameters I chose are poor; I just used the same regularization parameters obtained from cross-validation on 810 games. So there might be some improvement from re-estimating the regularization params.

I also took a look at some of the statistical properties of Vegas's estimate for the final margin of victory versus LambdaPM.

The same story seems to be going on...LambdaPM (arguably) outperforms the Vegas line quite when it has 820 games to train on, but gets its teeth kicked in with only 205 (See Table #2 and Table #7 of the paper linked on the first page: https://docs.google.com/viewer?a=v&pid= ... y=CJ6UzpUB).

I think the next step is for me to go back and take a look at this 2009-2010 NBA dataset, and figure out a good way of pooling old and new data. I guess doing this also means I can have something for this retrodiction challenge.

Posted: **Tue Jun 07, 2011 8:00 pm**

Out of curiosity, are you using Ruby to scrape that website? If so, would you mind sharing that bit of code?

Posted: **Tue Jun 07, 2011 8:13 pm**

Yep, using Ruby to scrape it.

https://docs.google.com/leaf?id=0B8NUaG ... y=CMCq34oK

You'll need to install Nokogiri (a html/xml parser library) if you want the above code to run though. And I think you can use it to scrape as many years as you want just by changing from "2010-2011" to whatever year you want. Also, the code already sorts the games to match the basketballvalue.com conventions.

Posted: **Tue Jun 07, 2011 8:34 pm**

Awesome! Thanks! I figured that is what you're using. A few months ago I started to look into it (Nokogiri, specifically), but other things pushed that to the back of the queue.

Posted: **Tue Jun 07, 2011 8:41 pm**

I added a bit of binomial testing (open office has a function for doing this, it seems):
http://i55.tinypic.com/2hg4ai9.png

So I guess we can say that LambdaPM at the very least is doing better than coin-flipping (well, depending on what you set your significance level.)

EDIT: No problem. Nokogiri is awesome, makes parsing a lot easier.

Posted: **Tue Jun 07, 2011 8:54 pm**

Well, doing consistently worse than coin flipping is probably just as good as doing consistently better than coin flipping. You just place the opposite bets.

Posted: **Tue Jun 07, 2011 9:07 pm**

Yep. But somehow you don't expect anyone to be able to figure out an algorithm that loses say 70% of its bets. Because if they've discovered such a technique, that means they've also figured out one that wins 70% of its bets.

The best you'll do with a bad algorithm is 50% (or something close to this.) Essentially, coin-flipping.

The question is, can you do better than 50%?

APBRmetrics

LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-

Re: LambdaPM: A new way of looking at adjusted +/-