Page 2 of 3

Re: nba.com now has play by play data back to 1997

Posted: Wed Feb 20, 2013 7:28 pm
by kpascual
sbs wrote:Will some clever programming the boxscore range should be able to resolve all of the problems from the last names only in the play-by-play.
That's what I've been doing, but I wouldn't call it clever:
https://github.com/kpascual/nbascrape/b ... _nbacom.py
colts18 wrote:No problem. Will you be able to parse out the pbp data from 97-00?
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.

Re: nba.com now has play by play data back to 1997

Posted: Wed Feb 20, 2013 10:34 pm
by colts18
kpascual wrote:
colts18 wrote:No problem. Will you be able to parse out the pbp data from 97-00?
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.
What does the parsing out process entail? I have no knowledge on this so I want to know. How long does it usually take to parse out this data?

Re: nba.com now has play by play data back to 1997

Posted: Thu Feb 21, 2013 1:26 am
by EvanZ
kpascual wrote:
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.
Haha. I'm now definitely considering it. :D

It would be great for my site to have all the data going back that far. Although, it could get rather expensive to host what would probably be about 10 GB worth of pbp and matchup data.

Re: nba.com now has play by play data back to 1997

Posted: Thu Feb 21, 2013 8:36 am
by AcrossTheCourt
I just wanted to see I'd love to see the end result of this work. You can estimate the plus/minus value of Bulls Jordan and get a more complete understanding of prime Shaq! That's a gold mine. I've always wanted to attempt an adjusted plus/minus model, but unfortunately I have too much real research work now. (If they pay you, you kinda have to do it.)

Re: nba.com now has play by play data back to 1997

Posted: Thu Feb 21, 2013 12:08 pm
by DSMok1
EvanZ wrote:
kpascual wrote:
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.
Haha. I'm now definitely considering it. :D

It would be great for my site to have all the data going back that far. Although, it could get rather expensive to host what would probably be about 10 GB worth of pbp and matchup data.
The raw, parsed files could be hosted elsewhere (but you were referring to NBAWOWY?)

Re: nba.com now has play by play data back to 1997

Posted: Thu Feb 21, 2013 2:04 pm
by colts18
EvanZ wrote:
kpascual wrote:
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.
Haha. I'm now definitely considering it. :D

It would be great for my site to have all the data going back that far. Although, it could get rather expensive to host what would probably be about 10 GB worth of pbp and matchup data.

How does this parsing out process work? Is there some kind of code to make it work?


Does NBAWOwy extend to before this season? Could this raw parsed out 1997-2000 pbp be added to NBAwowy or do you need a different kind of pbp.

Re: nba.com now has play by play data back to 1997

Posted: Thu Feb 21, 2013 3:00 pm
by EvanZ
@Daniel, the way my site works is that the app is running off Nodejitsu and the database is running off MongoLab. The plan I'm on right now is 1 GB per month storage and that's about $10/mo. Not a big deal. But it goes up steeply from there. 4 GB per month is $40/mo. Every GB after that is another $11, so 10 GB would be over $100/mo. Since I don't make money off the site, that's $1200/yr coming out of my pocket. And for all I know the database might end up being around 15 GB to go back to '97.

@colts18, right now nbawowy is just this season. My play-by-play is coming from NBC Sports which was the cleanest, easiest source to parse that I could find (for example, providing full names on every play). Unfortunately, before I could scrape prior seasons from the site (which I think they had at one point), they took them all down (I'm assuming at the request of the NBA).
How does this parsing out process work? Is there some kind of code to make it work?
Yes, there is some kind of code. The primary challenge of parsing play-by-play is determining who is on the court. Using the NBC dataset, I've got an error rate that is very, very low. It's almost a perfect process with very little manual correction involved. Every day it takes me about 5 minutes to update the site. The one good thing about going back and dealing with old data, is that once you do it, you don't have to mess with it again. I'd love to add it to my site, but like Ken said, it's a matter of finding time.

Hey, Ken. There's a D3 meetup in SF tonight at Trulia. Any chance you're going? I'll be there.

Re: nba.com now has play by play data back to 1997

Posted: Fri Feb 22, 2013 6:46 pm
by colts18
EvanZ wrote:@Daniel, the way my site works is that the app is running off Nodejitsu and the database is running off MongoLab. The plan I'm on right now is 1 GB per month storage and that's about $10/mo. Not a big deal. But it goes up steeply from there. 4 GB per month is $40/mo. Every GB after that is another $11, so 10 GB would be over $100/mo. Since I don't make money off the site, that's $1200/yr coming out of my pocket. And for all I know the database might end up being around 15 GB to go back to '97.

@colts18, right now nbawowy is just this season. My play-by-play is coming from NBC Sports which was the cleanest, easiest source to parse that I could find (for example, providing full names on every play). Unfortunately, before I could scrape prior seasons from the site (which I think they had at one point), they took them all down (I'm assuming at the request of the NBA).
How does this parsing out process work? Is there some kind of code to make it work?
Yes, there is some kind of code. The primary challenge of parsing play-by-play is determining who is on the court. Using the NBC dataset, I've got an error rate that is very, very low. It's almost a perfect process with very little manual correction involved. Every day it takes me about 5 minutes to update the site. The one good thing about going back and dealing with old data, is that once you do it, you don't have to mess with it again. I'd love to add it to my site, but like Ken said, it's a matter of finding time.

Hey, Ken. There's a D3 meetup in SF tonight at Trulia. Any chance you're going? I'll be there.
How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?

Re: nba.com now has play by play data back to 1997

Posted: Fri Feb 22, 2013 7:57 pm
by EvanZ
colts18 wrote:
How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?
The short answer is it doesn't hardly take any time at all once you've written the code. The longer answer is it takes a lot of time to write the code.

And yes, once you do that, you can calculate APM or RAPM or whatever.

Re: nba.com now has play by play data back to 1997

Posted: Fri Feb 22, 2013 10:52 pm
by kpascual
EvanZ wrote: Hey, Ken. There's a D3 meetup in SF tonight at Trulia. Any chance you're going? I'll be there.
Dammit, I missed it. I was signed up to go, but forgot I had a rec league basketball game. But yeah we should actually hang out sometime. Sloan conference? Other meetups?
EvanZ wrote:
colts18 wrote:
How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?
The short answer is it doesn't hardly take any time at all once you've written the code. The longer answer is it takes a lot of time to write the code.

And yes, once you do that, you can calculate APM or RAPM or whatever.
This. The primary constraint isn't time, it's usually effort.

Re: nba.com now has play by play data back to 1997

Posted: Sat Feb 23, 2013 3:17 am
by colts18
EvanZ wrote:
colts18 wrote:
How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?
The short answer is it doesn't hardly take any time at all once you've written the code. The longer answer is it takes a lot of time to write the code.

And yes, once you do that, you can calculate APM or RAPM or whatever.
Do you think you would be able to do it? Or is it too hard?

Re: nba.com now has play by play data back to 1997

Posted: Sat Feb 23, 2013 3:36 am
by EvanZ
It's not about whether it's "hard". For me it's just getting the time to do it. Maybe, maybe not. Can't guarantee anything.

Re: nba.com now has play by play data back to 1997

Posted: Sun Feb 24, 2013 10:41 am
by J.E.
If someone converts the data into text files with the format

gameid TAB linenumber TAB time TAB [team_id] description

I can take a crack at it using the parser I wrote for bbr PBP


(I don't want to do the html-> text conversion because, as I said, I think at some point the PBP will appear on bbr, for which I already have a crawler/converter)

Re: nba.com now has play by play data back to 1997

Posted: Sun Feb 24, 2013 6:55 pm
by colts18
J.E. wrote:If someone converts the data into text files with the format

gameid TAB linenumber TAB time TAB [team_id] description

I can take a crack at it using the parser I wrote for bbr PBP


(I don't want to do the html-> text conversion because, as I said, I think at some point the PBP will appear on bbr, for which I already have a crawler/converter)
How long would it take to do that for the 4 seasons? If its not too long, and someone taught me how to do it, I guess I could try.

Re: nba.com now has play by play data back to 1997

Posted: Sun Feb 24, 2013 7:32 pm
by J.E.
Python with urllib is a good place to start.
Further, you could use Python's Beautifulsoup or go the laborious way with string.split and string.replace