Page 2 of 2

Re: the state of APBR

Posted: Tue Dec 02, 2014 2:40 pm
by Statman
mystic wrote: I think the biggest obstacle will be getting that clean matchupfile in order to generate those numbers. Right now I adjust each pbp-file by adding the lineup formations at the start of each quarter manually (meaning, I have to watch the game video, because I have not found a reliable source to extract those information automatically), that helps a lot with the parsing process and is limiting error sources. If that would be a group effort and it could be made publically available, that would be a really good start.
This is an important issue to me, the quality the data. When it comes down to it - MANY of us are producing results from DIFFERENT data. I've learned this with the college player data - it seems that no one wants to take the time & try to make sure their data is as accurate as possible. There are tons of errors in player full season college data - I can't even imagine the amount of errors in pbp data - or errors in extracting the data (like for one example, applying points from a FT to players that were in the game when the foul happened - and NOT to players who subbed in the game and stood there while the FTs were made).

It seems that many want to get their data and create their metric as quickly as humanly possible. They want a metric that takes seconds to reproduce. This seems to maybe cause many to not take care in ensuring their data is as accurate to real life as possible.

I got into a discussion about Doug McDermott months ago on Twitter with some other draft guys who have their own NBA prospect rankings. I (or I should say my metric) "saw" McDermott's crazy historically low steal & block rates (not to mention tied to a mediocre assist rate) as an indicator of a player who will almost certainly fall short of lottery pick expectations (maybe lack of athleticism/quickness/instincts?). Anyway - one of the guys kept bringing up that McDermott's college stl & bk rate wasn't the lowest ever, John Pinone's was. I knew this wasn't close to true - so I told him to check his data. Every time he mentioned Pinone - I'd tell him to check his data, that he obviously had an error. Finally - I had to actually GIVE Pinone's rates every year of college - to where he finally realized he had the wrong data for John Pinone.

Now, I know no one want's to comb through over 70,000 player seasons (or 70,000 PbPs) to make sure their data is accurate - but at the very least there should be statistical checklists that pop up red flags when things appear to compile weirdly (like, in the example above, Pinone somehow having a 0% stl & block rate in over 900 minutes - that's a pretty darn obvious red flag). I compile player data to the team level - so I am able to check team totals (make sure player minutes sum to a "correct" amount of team minutes, double check my team compiled totals to cbb-reference, etc). In compiling pbp, your compiled game totals should at least match box score data you trust.

I just seems many just run their PbP data miner, run their metric on that data w/o any real checks & balances on the accuracy of the data - & viola', players ranked in 5 minutes. This is why I'm sometimes dubious about results - it's not just the metric I question, but sometimes how the data was compiled to begin with.

Re: the state of APBR

Posted: Tue Dec 02, 2014 5:23 pm
by Crow
17 links about APM in past threads worth reading thread. Search tool can yield others, probably.