NCAA->NBA ML using raw text of historical scouting reports: done!

Home for all your discussion of basketball statistical analysis.
pmaymin
Posts: 27
Joined: Sat Oct 01, 2011 2:22 pm

NCAA->NBA ML using raw text of historical scouting reports: done!

Post by pmaymin » Thu Jun 13, 2019 3:05 am

I just finished training nearly one thousand LOOCV individual random forests machine learning models, one for each collegiate NBA draft prospect who was in the ESPN Top 100 prospects list prior to the draft, using their college efficiency and production stats, combine measurement information, high school RSCI and NBA mock draft placements, handedness, ethnicity as estimated from a separate machine learning model based on their name (appeler/ethnicolr on github), and also scouting evaluations and raw scouting text scraped from nbadraft.net for 2006-2019 and processed through term frequency–inverse document frequency (TFIDF) then doubly dimension reduced. The forecast variable is an average of three win production measures (Win Shares, Wins Produced, and Estimated Wins Added) over each player's three years after the draft, or zero for years they did not play.

The correlation between predicted and actual NBA production is 63%. Every team except the Denver Nuggets would have benefited from drafting based on this model rather than the decisions they actually made, even though the model does not have hindsight bias and can only draft NCAA prospects. The average model pick outperformed the actual pick by 70% and the average team lost out on $100 million worth of on-court production.

Bottom line: NBA teams should immediately incorporate their own internal historical scouting reports into their projection models. It's a treasure trove.

Image

Image

eminence
Posts: 141
Joined: Sun Sep 10, 2017 8:20 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by eminence » Thu Jun 13, 2019 3:47 am

Looks interesting for sure, I'm sure I'll think of more questions, but one first one - how do you handle players with negatives in a stat like WS?

vzografos
Posts: 24
Joined: Thu Sep 06, 2018 10:42 am

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by vzografos » Thu Jun 13, 2019 5:33 am

pmaymin wrote:
Thu Jun 13, 2019 3:05 am
The forecast variable is an average of three win production measures (Win Shares, Wins Produced, and Estimated Wins Added) over each player's three years after the draft, or zero for years they did not play.
Interesting model/approach.
Ok I do not really follow (or claim to understand) NBA drafts, but I was wondering what are those three variables? Can you explain what they mean. Especially Estimated Wins Added.

I was really wondering how do you avoid existing team performance bias. In other words, I believe that a good team will do well despite drafting a bad rookie. At least in the first few years. Likewise a bad team might not directly benefit from a good rookie early on. So rookie contribution to team wins it is a delayed variable. Perhaps you can regress individual rookie performance (i.e. player stats) instead of team contribution?

If you really want team contribution, would it be better instead to predict the DIFFERENCE between the wins BEFORE the rookie was drafted and the wins AFTER, over a few years? Or is that what Estimated Wins Added means?

Good stuff though

tarrazu
Posts: 72
Joined: Mon Aug 04, 2014 5:02 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by tarrazu » Thu Jun 13, 2019 6:08 am

pmaymin wrote:
Thu Jun 13, 2019 3:05 am
I just finished training nearly one thousand LOOCV individual random forests machine learning models... and also scouting evaluations and raw scouting text scraped from nbadraft.net for 2006-2019 and processed through term frequency–inverse document frequency (TFIDF) then doubly dimension reduced.
This sounds a lot like what the Astros attempted to do where they combined subjective information from their scouts with quantitative models on their way to winning the 2017 World Series.

This is explored in more detail in the book Astroball: The New Way to Win It All by Ben Reiter:

"When new Astros general manager Jeff Luhnow and his top analyst, the former rocket scientist Sig Mejdal, arrived in Houston in 2011, they had already spent more than half a decade trying to understand how human instinct and expertise could be blended with hard numbers such as on-base percentage and strikeout rate to guide their decision-making. In Houston, they had free rein to remake the club. No longer would scouts, with all their subjective, hard-to-quantify opinions, be forced into opposition with the stats guys. Instead, Luhnow and Sig wanted to correct for the biases inherent in human observation, and then roll their scouts’ critical thoughts into their process. The numbers had value—but so did the gut."

Crow
Posts: 6250
Joined: Thu Apr 14, 2011 11:10 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by Crow » Fri Jun 14, 2019 4:45 pm

If you isolated combine measurement information from everything else, is the correlation with performance stronger or weaker in the playoffs compared to the regular season?


I assume Estimated Wins Added is based off PER as described here http://insider.espn.com/nba/hollinger/s ... sort/VORPe

pmaymin
Posts: 27
Joined: Sat Oct 01, 2011 2:22 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by pmaymin » Fri Jun 14, 2019 8:34 pm

eminence wrote:
Thu Jun 13, 2019 3:47 am
Looks interesting for sure, I'm sure I'll think of more questions, but one first one - how do you handle players with negatives in a stat like WS?
It's just negative. ¯\_(ツ)_/¯

Just means a player contributed net losses to his team through his performance (eg turnovers, fouls, missed shots, etc.).

pmaymin
Posts: 27
Joined: Sat Oct 01, 2011 2:22 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by pmaymin » Fri Jun 14, 2019 8:37 pm

vzografos wrote:
Thu Jun 13, 2019 5:33 am
Interesting model/approach.
Ok I do not really follow (or claim to understand) NBA drafts, but I was wondering what are those three variables? Can you explain what they mean. Especially Estimated Wins Added.
Crow's answer is exactly right about EWA, based off of PER.

All three of these methods try to re-express an arbitrary vector of individual performance stats into a single number that proxies for how many of the wins the team had that year were attributable to that player. They differ in terms of adjustments and weights and other things, but they are basically all translations of box scores to wins.

pmaymin
Posts: 27
Joined: Sat Oct 01, 2011 2:22 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by pmaymin » Fri Jun 14, 2019 8:40 pm

Crow wrote:
Fri Jun 14, 2019 4:45 pm
If you isolated combine measurement information from everything else, is the correlation with performance stronger or weaker in the playoffs compared to the regular season?
That's a really interesting question.

I didn't look at playoffs at all, only regular season NBA performance, to keep it identical for all players.

Crow
Posts: 6250
Joined: Thu Apr 14, 2011 11:10 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by Crow » Fri Jun 14, 2019 9:15 pm

Could this be a factor in the Raptors defeat of the Warriors in the finals?

I am not sure how a fair and thorough analysis of physical dimensions and athletic markers of the two teams would turn out but I have subjective preconceptions about that I'd want to challenge with more information.

eminence
Posts: 141
Joined: Sun Sep 10, 2017 8:20 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by eminence » Fri Jun 14, 2019 10:48 pm

pmaymin wrote:
Fri Jun 14, 2019 8:34 pm
eminence wrote:
Thu Jun 13, 2019 3:47 am
Looks interesting for sure, I'm sure I'll think of more questions, but one first one - how do you handle players with negatives in a stat like WS?
It's just negative. ¯\_(ツ)_/¯

Just means a player contributed net losses to his team through his performance (eg turnovers, fouls, missed shots, etc.).
Hmm, that was kind of what I thought and I'm not sure I agree, as it penalizes players who play but play poorly more harshly than players that don't play at all (if I'm reading it correctly).

pmaymin
Posts: 27
Joined: Sat Oct 01, 2011 2:22 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by pmaymin » Sun Jun 16, 2019 1:48 am

eminence wrote:
Fri Jun 14, 2019 10:48 pm
Hmm, that was kind of what I thought and I'm not sure I agree, as it penalizes players who play but play poorly more harshly than players that don't play at all (if I'm reading it correctly).
That's exactly the right reading. And you are also right about your qualms. It's part of a bigger issue around minutes played. What if you are a more efficient per-minute player, but your stubborn coach won't let you out there? I have also done the same forecasting analysis on a per-48 minute basis too, and there are some changes here and there, but the main takeaways about teams being inefficient with drafting and squandering their priceless historical data are all the same.

Crow
Posts: 6250
Joined: Thu Apr 14, 2011 11:10 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by Crow » Sun Jun 16, 2019 8:53 am

Which do you think is the bigger issue... GMs not having the right analytics, not getting strong enough guidance from analysts or just not using whatever they have well enough because of own thoughts or scouting guidance?

RyanRiot
Posts: 23
Joined: Wed Oct 19, 2016 2:26 am

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by RyanRiot » Mon Jun 17, 2019 4:41 pm

Who does the model like this year?

eminence
Posts: 141
Joined: Sun Sep 10, 2017 8:20 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by eminence » Mon Jun 17, 2019 5:18 pm

RyanRiot wrote:
Mon Jun 17, 2019 4:41 pm
Who does the model like this year?
The important questions

pmaymin
Posts: 27
Joined: Sat Oct 01, 2011 2:22 pm

Re: NCAA->NBA ML using raw text of historical scouting reports: done!

Post by pmaymin » Tue Jun 18, 2019 10:07 am

Crow wrote:
Sun Jun 16, 2019 8:53 am
Which do you think is the bigger issue... GMs not having the right analytics, not getting strong enough guidance from analysts or just not using whatever they have well enough because of own thoughts or scouting guidance?
I think it's a combination of a lack of rigor, illusion of control, and availability-hindsight-attribution bias.

Analytics isn't the answer to a question, it's a process of decision making. If you're not going to subject your entire decision making process to rigorous evaluation, then the analytics department just becomes a justification engine.

The illusion of control is seductive. It feels good, and it's really, really hard to disconfirm it, whether you're chasing your child, or trading stocks, or running a sports team. Risk isn't properly assessed without analytics.

And availability-hindsight-attribution bias lets people quickly recall their own past successes and analytics' past failures, and that quickness to recall is incorrectly interpreted as a likelihood or probability.

Post Reply