I've opened up a subset of my basketball github to the public:
https://github.com/galizur/basketball-public
The code is in R, Ruby, Bash and SQL and presumes a PostgreSQL database. This isn't meant to be anything special or cutting edge, just a little introduction to some data science using basketball. My day job is doing quantitative analysis for the San Diego Padres.
It includes basic college and pro game and performance data, basic feature detection for predicting pro performance from college performance, power rankings for amateur and pro teams. The power rankings for college teams pool teams within divisions, then measure pool strength from 2002-2012. This allows you to measure relative team strength between D1, D2 and D3 and to also estimate all of the NCAA on the same scale. Also included are home/away factors, plus a demonstration that distance traveled by teams impacts performance. Distances are calculated using Yahoo's PlaceFinder API, geocoding cities, then computing great circle distance between cities. There are lots of obvious improvements that can be made.
I've included play-by-play data for 10000 NCAA games in XML, plus parsed versions in CSV files. I haven't done anything with this data yet, however.
My Twitter:
https://twitter.com/octonion
My (sometimes) blog:
http://angrystatistician.blogspot.com
-Chris
Public basketball data and analysis github
Re: Public basketball data and analysis github
Pretty cool but I think it's a little unclear where to find what.
I wasn't able to find the NCAA PBP, for example. A clearer naming scheme might make navigation a little easier.
I wasn't able to find the NCAA PBP, for example. A clearer naming scheme might make navigation a little easier.
Re: Public basketball data and analysis github
Very nice, Chris. I'm not sure what to do with it (yet) but this is an excellent source of data/scripts. Sooner or later I've got to learn SQL...
Re: Public basketball data and analysis github
Yes, it's still a work in progress with respect to user friendliness.J.E. wrote:Pretty cool but I think it's a little unclear where to find what.
I wasn't able to find the NCAA PBP, for example. A clearer naming scheme might make navigation a little easier.
The NCAA play-by-play is under the "cstv" directory. There's a compressed tar archive of the raw XML files together with a compressed tar archive of these parsed into CSV files. I've included the basic Ruby script I wrote to do the parsing.
Sample from one of the XML files:
Code: Select all
<plays format="tokens">
<period number="1" time="00:00">
<special vh="V" pts_to="" pts_ch2="" pts_paint="" pts_fastb="" pts_bench="" ties="" leads="" poss_count="" poss_time="" score_count="" score_time="" large_lead="" large_lead_t=""></special>
<special vh="H" pts_to="" pts_ch2="" pts_paint="" pts_fastb="" pts_bench="" ties="" leads="" poss_count="" poss_time="" score_count="" score_time="" large_lead="" large_lead_t=""></special>
<summary vh="V" fgm="9" fga="29" fgm3="4" fga3="10" ftm="4" fta="5" tp="26" blk="" stl="" ast="" oreb="" dreb="" treb="" pf="9" tf="" to=""></summary>
<summary vh="H" fgm="22" fga="38" fgm3="7" fga3="12" ftm="1" fta="4" tp="52" blk="" stl="" ast="" oreb="" dreb="" treb="" pf="7" tf="" to=""></summary>
<play vh="V" time="19:36" uni="24" team="MANHAT" checkname="BEAMON,GEORGE" action="GOOD" type="JUMPER" paint="Y" fastb="" vscore="2" hscore="0" play_id="1"></play>
<play vh="H" time="19:09" uni="20" team="SYR" checkname="TRICHE,BRANDON" action="MISS" type="JUMPER" play_id="2"></play>
<play vh="H" time="19:09" uni="51" team="SYR" checkname="MELO,FAB" action="REBOUND" type="OFF" play_id="3"></play>
<play vh="H" time="18:35" uni="11" team="SYR" checkname="JARDINE,SCOOP" action="MISS" type="LAYUP" play_id="4"></play
Code: Select all
1000617,1,H,19:53,31,WASH,"ROSS,TERRENCE",GOOD,JUMPER,Y,0,2,1
1000617,1,V,19:26,33,SDSU,"CALLAHAN,GRIFFAN",GOOD,JUMPER,Y,2,2,2
1000617,1,V,19:26,42,SDSU,"DYKSTRA,JORDAN",ASSIST,,,,,3
1000617,1,H,19:11,00,WASH,"GADDY,ABDUL",MISS,JUMPER,,,,4
1000617,1,V,19:11,03,SDSU,"WOLTERS,NATE",REBOUND,DEF,,,,5
1000617,1,V,19:00,33,SDSU,"CALLAHAN,GRIFFAN",GOOD,3PTR,,5,2,6
1000617,1,H,18:57,TM,WASH,TEAM,TIMEOUT,30SEC,,,,7
1000617,1,V,18:33,03,SDSU,"WOLTERS,NATE",FOUL,,,,,8
1000617,1,H,18:31,23,WASH,"WILCOX,CJ",MISS,JUMPER,,,,9
1000617,1,V,18:31,12,SDSU,"CARLSON,BRAYDEN",REBOUND,DEF,,,,10
Re: Public basketball data and analysis github
You don't really want to look at playing time, of course, but there are a variety of statistical measures in the data set that you can regress against. You also need to include the college strength of schedule - the quality of the estimate skyrockets if you do this. Ideally you'd want to adjust the player performance numbers by the strength of offense and defense faced, but it's unclear (to me) the best way to do this.
What I'd suggest doing:
1) Better matching on the college/pro data - you could probably expand the size of the data set by 50% over my automatic matching. Going back a larger number of college years would be easy to do for the NBA given the small size of the draft (vs baseball).
2) Use a better target than playing time as a 1st year pro.
3) Adjust player college stats using college strength of offense and defense faced. You might want to do this on the pro side, too.
4) For this type of research you'd want to estimate college strength of opponents using the entire season. I have it set to exclude March and April games for prediction testing.
-Chris
What I'd suggest doing:
1) Better matching on the college/pro data - you could probably expand the size of the data set by 50% over my automatic matching. Going back a larger number of college years would be easy to do for the NBA given the small size of the draft (vs baseball).
2) Use a better target than playing time as a 1st year pro.
3) Adjust player college stats using college strength of offense and defense faced. You might want to do this on the pro side, too.
4) For this type of research you'd want to estimate college strength of opponents using the entire season. I have it set to exclude March and April games for prediction testing.
-Chris