nba.com now has play by play data back to 1997

kpascual · Post by **kpascual** » Wed Feb 20, 2013 7:28 pm

sbs wrote:Will some clever programming the boxscore range should be able to resolve all of the problems from the last names only in the play-by-play.

That's what I've been doing, but I wouldn't call it clever:
https://github.com/kpascual/nbascrape/b ... _nbacom.py

colts18 wrote:No problem. Will you be able to parse out the pbp data from 97-00?

I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.

colts18 · Post by **colts18** » Wed Feb 20, 2013 10:34 pm

kpascual wrote:
colts18 wrote:No problem. Will you be able to parse out the pbp data from 97-00?
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.

What does the parsing out process entail? I have no knowledge on this so I want to know. How long does it usually take to parse out this data?

EvanZ · Post by **EvanZ** » Thu Feb 21, 2013 1:26 am

kpascual wrote:
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.

Haha. I'm now definitely considering it.

It would be great for my site to have all the data going back that far. Although, it could get rather expensive to host what would probably be about 10 GB worth of pbp and matchup data.

AcrossTheCourt · Post by **AcrossTheCourt** » Thu Feb 21, 2013 8:36 am

I just wanted to see I'd love to see the end result of this work. You can estimate the plus/minus value of Bulls Jordan and get a more complete understanding of prime Shaq! That's a gold mine. I've always wanted to attempt an adjusted plus/minus model, but unfortunately I have too much real research work now. (If they pay you, you kinda have to do it.)

DSMok1 · Post by **DSMok1** » Thu Feb 21, 2013 12:08 pm

EvanZ wrote:
kpascual wrote:
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.
Haha. I'm now definitely considering it.

It would be great for my site to have all the data going back that far. Although, it could get rather expensive to host what would probably be about 10 GB worth of pbp and matchup data.

The raw, parsed files could be hosted elsewhere (but you were referring to NBAWOWY?)

colts18 · Post by **colts18** » Thu Feb 21, 2013 2:04 pm

EvanZ wrote:
kpascual wrote:
I can't promise I have time to do it, but I bet EvanZ either has already done it or is doing it right now.
Haha. I'm now definitely considering it.

It would be great for my site to have all the data going back that far. Although, it could get rather expensive to host what would probably be about 10 GB worth of pbp and matchup data.

How does this parsing out process work? Is there some kind of code to make it work?

Does NBAWOwy extend to before this season? Could this raw parsed out 1997-2000 pbp be added to NBAwowy or do you need a different kind of pbp.

EvanZ · Post by **EvanZ** » Thu Feb 21, 2013 3:00 pm

@Daniel, the way my site works is that the app is running off Nodejitsu and the database is running off MongoLab. The plan I'm on right now is 1 GB per month storage and that's about $10/mo. Not a big deal. But it goes up steeply from there. 4 GB per month is $40/mo. Every GB after that is another $11, so 10 GB would be over $100/mo. Since I don't make money off the site, that's $1200/yr coming out of my pocket. And for all I know the database might end up being around 15 GB to go back to '97.

@colts18, right now nbawowy is just this season. My play-by-play is coming from NBC Sports which was the cleanest, easiest source to parse that I could find (for example, providing full names on every play). Unfortunately, before I could scrape prior seasons from the site (which I think they had at one point), they took them all down (I'm assuming at the request of the NBA).

How does this parsing out process work? Is there some kind of code to make it work?

Yes, there is some kind of code. The primary challenge of parsing play-by-play is determining who is on the court. Using the NBC dataset, I've got an error rate that is very, very low. It's almost a perfect process with very little manual correction involved. Every day it takes me about 5 minutes to update the site. The one good thing about going back and dealing with old data, is that once you do it, you don't have to mess with it again. I'd love to add it to my site, but like Ken said, it's a matter of finding time.

Hey, Ken. There's a D3 meetup in SF tonight at Trulia. Any chance you're going? I'll be there.

colts18 · Post by **colts18** » Fri Feb 22, 2013 6:46 pm

EvanZ wrote:@Daniel, the way my site works is that the app is running off Nodejitsu and the database is running off MongoLab. The plan I'm on right now is 1 GB per month storage and that's about $10/mo. Not a big deal. But it goes up steeply from there. 4 GB per month is $40/mo. Every GB after that is another $11, so 10 GB would be over $100/mo. Since I don't make money off the site, that's $1200/yr coming out of my pocket. And for all I know the database might end up being around 15 GB to go back to '97.

@colts18, right now nbawowy is just this season. My play-by-play is coming from NBC Sports which was the cleanest, easiest source to parse that I could find (for example, providing full names on every play). Unfortunately, before I could scrape prior seasons from the site (which I think they had at one point), they took them all down (I'm assuming at the request of the NBA).

How does this parsing out process work? Is there some kind of code to make it work?
Yes, there is some kind of code. The primary challenge of parsing play-by-play is determining who is on the court. Using the NBC dataset, I've got an error rate that is very, very low. It's almost a perfect process with very little manual correction involved. Every day it takes me about 5 minutes to update the site. The one good thing about going back and dealing with old data, is that once you do it, you don't have to mess with it again. I'd love to add it to my site, but like Ken said, it's a matter of finding time.

Hey, Ken. There's a D3 meetup in SF tonight at Trulia. Any chance you're going? I'll be there.

How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?

EvanZ · Post by **EvanZ** » Fri Feb 22, 2013 7:57 pm

colts18 wrote:
How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?

The short answer is it doesn't hardly take any time at all once you've written the code. The longer answer is it takes a lot of time to write the code.

And yes, once you do that, you can calculate APM or RAPM or whatever.

kpascual · Post by **kpascual** » Fri Feb 22, 2013 10:52 pm

EvanZ wrote: Hey, Ken. There's a D3 meetup in SF tonight at Trulia. Any chance you're going? I'll be there.

Dammit, I missed it. I was signed up to go, but forgot I had a rec league basketball game. But yeah we should actually hang out sometime. Sloan conference? Other meetups?

EvanZ wrote:
colts18 wrote:
How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?
The short answer is it doesn't hardly take any time at all once you've written the code. The longer answer is it takes a lot of time to write the code.

And yes, once you do that, you can calculate APM or RAPM or whatever.

This. The primary constraint isn't time, it's usually effort.

colts18 · Post by **colts18** » Sat Feb 23, 2013 3:17 am

EvanZ wrote:
colts18 wrote:
How long does it take to parse out a season's worth of pbp data? Are you able to do an APM on that data?
The short answer is it doesn't hardly take any time at all once you've written the code. The longer answer is it takes a lot of time to write the code.

And yes, once you do that, you can calculate APM or RAPM or whatever.

Do you think you would be able to do it? Or is it too hard?

EvanZ · Post by **EvanZ** » Sat Feb 23, 2013 3:36 am

It's not about whether it's "hard". For me it's just getting the time to do it. Maybe, maybe not. Can't guarantee anything.

J.E. · Post by **J.E.** » Sun Feb 24, 2013 10:41 am

If someone converts the data into text files with the format

gameid TAB linenumber TAB time TAB [team_id] description

I can take a crack at it using the parser I wrote for bbr PBP

(I don't want to do the html-> text conversion because, as I said, I think at some point the PBP will appear on bbr, for which I already have a crawler/converter)

colts18 · Post by **colts18** » Sun Feb 24, 2013 6:55 pm

J.E. wrote:If someone converts the data into text files with the format

gameid TAB linenumber TAB time TAB [team_id] description

I can take a crack at it using the parser I wrote for bbr PBP

(I don't want to do the html-> text conversion because, as I said, I think at some point the PBP will appear on bbr, for which I already have a crawler/converter)

How long would it take to do that for the 4 seasons? If its not too long, and someone taught me how to do it, I guess I could try.

J.E. · Post by **J.E.** » Sun Feb 24, 2013 7:32 pm

Python with urllib is a good place to start.
Further, you could use Python's Beautifulsoup or go the laborious way with string.split and string.replace

APBRmetrics

nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997

Re: nba.com now has play by play data back to 1997