Page 1 of 1

Parsing play-by-play data

Posted: Sun Nov 08, 2015 6:21 am
by ethanluo
Hi I have been working on basketball analytics for quite a while. Some of the data I need needs to be parsed directly from the playbyplay.

A few people have written code to extract the play by play from statsnba.com or espn, but I noticed that they do not ususally have tools to parse the play by play into usable csv for statistics. So I have been implementing my own parser to do that job and I hope to share the codebase through open source with this community to facilitate the process.

I noticed that different sources have different format for the pbp, so what people usually do is to write regular expressions for different sites, which I believe can be hectic. Furthermore, there maybe some outliers. I hope to implement a universal one that can be quickly implemented for different websites. To do that I did some very simple natural language processing and tokenization of the text and after that I will do classification via machine learning.

It works okay at this moment but I definitely need some help. In order to assess the reliability of this parser I need prepared data to complete the parser. I noticed that NBAStuffer has the desired data that I want to learn the parser. But in order for me to complete the parser for websites such as ESPN, I will probabily need someone to manually prepare the data in format similar to that of NBAStuffer. I am not sure whether someone already has it.

Anyone has any idea I I shall proceed from here?

Re: Parsing play-by-play data

Posted: Mon Nov 09, 2015 4:28 pm
by browning
Hey, sounds like a good cause, I'm curious how effective NLP will be against regexes though because each site has a pretty standard format which makes regexes work well.

Anyways, I'm happy to help you get the data, I have a parser for both espn and stats.nba.com play-by-play sites

You can email me at bwbrowning@gmail.com and we can talk about the format you would like the data in.

Cheers