Public basketball data and analysis github

Home for all your discussion of basketball statistical analysis.
Post Reply
Eternal
Posts: 62
Joined: Sun Nov 11, 2012 11:35 pm
Location: San Diego, CA
Contact:

Public basketball data and analysis github

Post by Eternal »

I've opened up a subset of my basketball github to the public:

https://github.com/galizur/basketball-public

The code is in R, Ruby, Bash and SQL and presumes a PostgreSQL database. This isn't meant to be anything special or cutting edge, just a little introduction to some data science using basketball. My day job is doing quantitative analysis for the San Diego Padres.

It includes basic college and pro game and performance data, basic feature detection for predicting pro performance from college performance, power rankings for amateur and pro teams. The power rankings for college teams pool teams within divisions, then measure pool strength from 2002-2012. This allows you to measure relative team strength between D1, D2 and D3 and to also estimate all of the NCAA on the same scale. Also included are home/away factors, plus a demonstration that distance traveled by teams impacts performance. Distances are calculated using Yahoo's PlaceFinder API, geocoding cities, then computing great circle distance between cities. There are lots of obvious improvements that can be made.

I've included play-by-play data for 10000 NCAA games in XML, plus parsed versions in CSV files. I haven't done anything with this data yet, however.

My Twitter:

https://twitter.com/octonion

My (sometimes) blog:

http://angrystatistician.blogspot.com

-Chris
J.E.
Posts: 852
Joined: Fri Apr 15, 2011 8:28 am

Re: Public basketball data and analysis github

Post by J.E. »

Pretty cool but I think it's a little unclear where to find what.
I wasn't able to find the NCAA PBP, for example. A clearer naming scheme might make navigation a little easier.
DSMok1
Posts: 1119
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Public basketball data and analysis github

Post by DSMok1 »

Very nice, Chris. I'm not sure what to do with it (yet) but this is an excellent source of data/scripts. Sooner or later I've got to learn SQL...
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
Twitter.com/DSMok1
Eternal
Posts: 62
Joined: Sun Nov 11, 2012 11:35 pm
Location: San Diego, CA
Contact:

Re: Public basketball data and analysis github

Post by Eternal »

J.E. wrote:Pretty cool but I think it's a little unclear where to find what.
I wasn't able to find the NCAA PBP, for example. A clearer naming scheme might make navigation a little easier.
Yes, it's still a work in progress with respect to user friendliness.

The NCAA play-by-play is under the "cstv" directory. There's a compressed tar archive of the raw XML files together with a compressed tar archive of these parsed into CSV files. I've included the basic Ruby script I wrote to do the parsing.

Sample from one of the XML files:

Code: Select all

<plays format="tokens">
   <period number="1" time="00:00">
    <special vh="V" pts_to="" pts_ch2="" pts_paint="" pts_fastb="" pts_bench="" ties="" leads="" poss_count="" poss_time="" score_count="" score_time="" large_lead="" large_lead_t=""></special>
    <special vh="H" pts_to="" pts_ch2="" pts_paint="" pts_fastb="" pts_bench="" ties="" leads="" poss_count="" poss_time="" score_count="" score_time="" large_lead="" large_lead_t=""></special>
    <summary vh="V" fgm="9" fga="29" fgm3="4" fga3="10" ftm="4" fta="5" tp="26" blk="" stl="" ast="" oreb="" dreb="" treb="" pf="9" tf="" to=""></summary>
    <summary vh="H" fgm="22" fga="38" fgm3="7" fga3="12" ftm="1" fta="4" tp="52" blk="" stl="" ast="" oreb="" dreb="" treb="" pf="7" tf="" to=""></summary>
    <play vh="V" time="19:36" uni="24" team="MANHAT" checkname="BEAMON,GEORGE" action="GOOD" type="JUMPER" paint="Y" fastb="" vscore="2" hscore="0" play_id="1"></play>
    <play vh="H" time="19:09" uni="20" team="SYR" checkname="TRICHE,BRANDON" action="MISS" type="JUMPER" play_id="2"></play>
    <play vh="H" time="19:09" uni="51" team="SYR" checkname="MELO,FAB" action="REBOUND" type="OFF" play_id="3"></play>
    <play vh="H" time="18:35" uni="11" team="SYR" checkname="JARDINE,SCOOP" action="MISS" type="LAYUP" play_id="4"></play
Sample from one of the resulting CSV files (play.csv):

Code: Select all

1000617,1,H,19:53,31,WASH,"ROSS,TERRENCE",GOOD,JUMPER,Y,0,2,1
1000617,1,V,19:26,33,SDSU,"CALLAHAN,GRIFFAN",GOOD,JUMPER,Y,2,2,2
1000617,1,V,19:26,42,SDSU,"DYKSTRA,JORDAN",ASSIST,,,,,3
1000617,1,H,19:11,00,WASH,"GADDY,ABDUL",MISS,JUMPER,,,,4
1000617,1,V,19:11,03,SDSU,"WOLTERS,NATE",REBOUND,DEF,,,,5
1000617,1,V,19:00,33,SDSU,"CALLAHAN,GRIFFAN",GOOD,3PTR,,5,2,6
1000617,1,H,18:57,TM,WASH,TEAM,TIMEOUT,30SEC,,,,7
1000617,1,V,18:33,03,SDSU,"WOLTERS,NATE",FOUL,,,,,8
1000617,1,H,18:31,23,WASH,"WILCOX,CJ",MISS,JUMPER,,,,9
1000617,1,V,18:31,12,SDSU,"CARLSON,BRAYDEN",REBOUND,DEF,,,,10
-Chris
Crow
Posts: 10533
Joined: Thu Apr 14, 2011 11:10 pm

Re: Public basketball data and analysis github

Post by Crow »

Anyone want to summarize or comment on this?

https://github.com/galizur/basketball-p ... ection.txt
Eternal
Posts: 62
Joined: Sun Nov 11, 2012 11:35 pm
Location: San Diego, CA
Contact:

Re: Public basketball data and analysis github

Post by Eternal »

You don't really want to look at playing time, of course, but there are a variety of statistical measures in the data set that you can regress against. You also need to include the college strength of schedule - the quality of the estimate skyrockets if you do this. Ideally you'd want to adjust the player performance numbers by the strength of offense and defense faced, but it's unclear (to me) the best way to do this.

What I'd suggest doing:

1) Better matching on the college/pro data - you could probably expand the size of the data set by 50% over my automatic matching. Going back a larger number of college years would be easy to do for the NBA given the small size of the draft (vs baseball).
2) Use a better target than playing time as a 1st year pro.
3) Adjust player college stats using college strength of offense and defense faced. You might want to do this on the pro side, too.
4) For this type of research you'd want to estimate college strength of opponents using the entire season. I have it set to exclude March and April games for prediction testing.

-Chris
Post Reply