Re: the state of APBR
Posted: Tue Dec 02, 2014 2:40 pm
This is an important issue to me, the quality the data. When it comes down to it - MANY of us are producing results from DIFFERENT data. I've learned this with the college player data - it seems that no one wants to take the time & try to make sure their data is as accurate as possible. There are tons of errors in player full season college data - I can't even imagine the amount of errors in pbp data - or errors in extracting the data (like for one example, applying points from a FT to players that were in the game when the foul happened - and NOT to players who subbed in the game and stood there while the FTs were made).mystic wrote: I think the biggest obstacle will be getting that clean matchupfile in order to generate those numbers. Right now I adjust each pbp-file by adding the lineup formations at the start of each quarter manually (meaning, I have to watch the game video, because I have not found a reliable source to extract those information automatically), that helps a lot with the parsing process and is limiting error sources. If that would be a group effort and it could be made publically available, that would be a really good start.
It seems that many want to get their data and create their metric as quickly as humanly possible. They want a metric that takes seconds to reproduce. This seems to maybe cause many to not take care in ensuring their data is as accurate to real life as possible.
I got into a discussion about Doug McDermott months ago on Twitter with some other draft guys who have their own NBA prospect rankings. I (or I should say my metric) "saw" McDermott's crazy historically low steal & block rates (not to mention tied to a mediocre assist rate) as an indicator of a player who will almost certainly fall short of lottery pick expectations (maybe lack of athleticism/quickness/instincts?). Anyway - one of the guys kept bringing up that McDermott's college stl & bk rate wasn't the lowest ever, John Pinone's was. I knew this wasn't close to true - so I told him to check his data. Every time he mentioned Pinone - I'd tell him to check his data, that he obviously had an error. Finally - I had to actually GIVE Pinone's rates every year of college - to where he finally realized he had the wrong data for John Pinone.
Now, I know no one want's to comb through over 70,000 player seasons (or 70,000 PbPs) to make sure their data is accurate - but at the very least there should be statistical checklists that pop up red flags when things appear to compile weirdly (like, in the example above, Pinone somehow having a 0% stl & block rate in over 900 minutes - that's a pretty darn obvious red flag). I compile player data to the team level - so I am able to check team totals (make sure player minutes sum to a "correct" amount of team minutes, double check my team compiled totals to cbb-reference, etc). In compiling pbp, your compiled game totals should at least match box score data you trust.
I just seems many just run their PbP data miner, run their metric on that data w/o any real checks & balances on the accuracy of the data - & viola', players ranked in 5 minutes. This is why I'm sometimes dubious about results - it's not just the metric I question, but sometimes how the data was compiled to begin with.