Page 1 of 1

Statistical Team Comps (acollard, 2010)

Posted: Fri Apr 15, 2011 2:50 am
by Crow
page 1 of 2

Author Message
acollard



Joined: 22 Sep 2010
Posts: 49
Location: MA

PostPosted: Wed Dec 29, 2010 11:17 am Post subject: Statistical Team Comps Reply with quote
I've done some work on assessing the similarity between teams. I know Neil Paine did a good job of this last year, using the Four Factors both for and against a team. I thought it might be interesting to use more parameters to capture playing style as well as playing efficiency. I used per 100 possession numbers for every basic stat, so FG, FGA, FG%, 3P, 3PA, 3PT%, AST, BLK, PTS , etc. for every team and their opponents. To scale them I ranked them either within the season (to allow comparisons between eras), or within the entire dataset (to get the closest similarity. A longer explanation of this is in my article at DailyThunder.com.

The spreadsheet of everything is here (big file).

The spreadsheet with just the comp results is here (four methods total).

Some interesting results, (besides the '11 Thunder being very similar to Washington's mid-00s teams).

Heat '11 comps:
Code:
Box ALL Box Season FF ALL FF Season
ORL*2008-09 SAS*1994-95 ORL*2008-09 SAS*2000-01
ORL*2009-10 SAS*2000-01 MIA*2004-05 SAS*1999-00
CLE*2009-10 MIA*2004-05 ORL*2009-10 CLE*2009-10
SAS*2006-07 POR*1998-99 CLE*2009-10 SAS*2001-02
MIA*2005-06 SAS*1999-00 SAS*2000-01 LAL*1990-91
ORL*2007-08 UTA*1991-92 CLE*2008-09 LAL*1989-90
CLE*2008-09 CLE*2008-09 DAL2010-11 UTA*1993-94
LAL*2007-08 MIL*1982-83 SAS*2006-07 DET*1987-88
CLE*2005-06 DET*1987-88 IND*1999-00 UTA*1991-92
LAC*2005-06 SEA*1981-82 ORL*2007-08 IND*1997-98


Mavs '11 comps (same):

Code:
Box ALL Box Season FF ALL FF Season
LAL*2007-08 BOS*1986-87 ORL*2009-10 CHH*1994-95
ATL2010-11 HOU*1992-93 SAS*2006-07 CLE*2009-10
SAS*2007-08 LAL*1990-91 ORL*2008-09 MIA*2004-05
SAS*2006-07 HOU*1993-94 HOU*2008-09 MIA*1998-99
CLE*2009-10 CHH*1994-95 CLE*2009-10 HOU*2004-05
HOU*2008-09 LAL*1986-87 ATL2010-11 SAS*2006-07
SAS2010-11 SEA*1982-83 IND*1999-00 CLE*1992-93
ORL*2009-10 LAL*2007-08 SAS*2009-10 SAS*1992-93
DAL*2009-10 HOU*1996-97 ORL*2007-08 SAS*1999-00
CLE*2008-09 CHI*1988-89 CLE*2008-09 SEA*1982-83


Let me know what you think! Also any tips on network analysis software (preferably in R) would be helpful for visualizations of the data, a la David Sparks here. I'll probably just have to fool around with the sna package until something works.
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 773


PostPosted: Thu Dec 30, 2010 11:42 am Post subject: Reply with quote
What do you think about maybe also including other team level data in the similarity system such as:

1) performance against top 10, all playoff level teams or top 8 in your conference

2) maybe split to against top offenses and defenses as well (I guess you could go to factor level if you really wanted to push it)

3) Home / road W-L

4) Close and blowout games

5) Overall team performance consistency as measured by point differential or efficiency differential.

6) Consistency of the biggest boxscore stats or all of them in addition to the average performances.

?


I'd think that would take the team similarity question even farther and probably very usefully.

(Theoretically a similar extension of most or all of this could help deepen player similarity studies.)
Back to top
View user's profile Send private message
acollard



Joined: 22 Sep 2010
Posts: 49
Location: MA

PostPosted: Thu Dec 30, 2010 11:50 am Post subject: Reply with quote
I think consistency would be something interesting to look at.

However, I envision the performance aspects (close vs. blowout, top 10 team performance, etc.) as more outputs than inputs. You see, you play a certain way, you defend a certain way, with certain tendencies, and these lead to better performance, so I'm hoping to capture that in my similarity scores.

Ideally, when I do finish with the network analysis, I'll get clusters of particular teams together, and visually, using color-coding, I want it to become clear in what areas the best teams excel. Does that make sense?
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 773


PostPosted: Thu Dec 30, 2010 11:59 am Post subject: Reply with quote
I can understand your perspective, for describing regular season data, at season level.

I was mainly thinking in terms of using the similarity dataset to project playoff performance. In that context the additional team data I suggested adding is "input" to me.
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 773


PostPosted: Fri Dec 31, 2010 2:40 pm Post subject: Reply with quote
If I counted right (moving quick I might have miscounted something somewhere) Miami has 3 title winning comps in its set of 40 comps, while the Mavs have 4.

The Celtics have 4, Magic 4, Utah 0, Atlanta 4, Chicago 1, Denver 0, Lakers 7, Hornets 1, OKC 0, Spurs 4.

Atlanta surprises me some being that high.


At least 3 title winning team comps and at least 2 current top 30 players...it depends on what metric you use, but the list shrinks to as few as 4 contenders who meet both qualifications. But switch which player metric is used and the 3 others highest in this title winning team comp criteria can rejoin the 4 and make it 7 main contenders again.
Back to top
View user's profile Send private message
acollard



Joined: 22 Sep 2010
Posts: 49
Location: MA

PostPosted: Tue Jan 25, 2011 12:42 pm Post subject: Reply with quote
Its taken a while, but I've used the team comparisons to make a network diagram of team seasons, and its actually pretty informative. I'll be elaborating on methods, etc. tonight when I have more time to use updated numbers and clean up the graphs, but here's a peek for interest and hopefully for some feedback.

A caveat, these comparisons use numbers that are about 2 weeks old, but here's a look at what teams are nearest to the 2011 Heat.

Gold=Champ, Green=Runnerup, Red=2011 team, Blue= Playoff, Beige=Lottery. Size is proportional to W-L%.

Spotlight Heat '11


Lots of champions nearby. Bodes well. But again, this doesn't take into account the most recent rough patch.




Spotlight OKC '11


Not a great group to be paired with, a lot of teams didn't even make the playoffs! This is further evidence that OKC has slipped this year, despite a good record.



Link to preliminary graph (zoom for clearer picture). Still need to work on overlapping labels. https://docs.google.com/viewer?a=v&pid= ... y=CNSi4NMH


Also, credit goes to David Sparks for a lot of the work he did with player comparisons. Despite being over two years old, http://arbitrarian.wordpress.com/ is still more innovative than anything else I've come across.
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 773


PostPosted: Tue Jan 25, 2011 1:23 pm Post subject: Reply with quote
Thanks for the image of the full cloud. But even at full magnification I can't read some of the labels. Maybe that says more about my eyes but what about the possibility of max blowups of the 4 quadrants (overlapping a bit)? What is the current team in red to the NW of the Heat on their local map?



Are all the stats included being given equal weight as the others? I am not sure if they are but I have that impression. If they are, I'd really suggest a weighting based on relative importance to the scoreboard outcome. That would change the cloud in many ways, the net result of which would be interesting to see. Lack of "impact weighting" is the main issue I have with some player similarity systems. Neil's 4 factor team study was reasonably weighted.

I would think the other performance elements I noted above would also impact the cloud significantly and appropriately, in general and for a team like the Thunder. Would be subjective how to weight them into the mix but start with modest weights and you might be able to improve upon that through trial and error or more sophisticated means. What is improvement? Wouldn't it be the champs & runner-ups at the extremes and / or closer to together (by computed average distances)?
Back to top
View user's profile Send private message
acollard



Joined: 22 Sep 2010
Posts: 49
Location: MA

PostPosted: Tue Jan 25, 2011 2:05 pm Post subject: Reply with quote
Crow, I agree weighting would be better, but I find it hard to figure out how to assing weight, and I'm very averse to weighting arbitrarily. I basically went with the logic that the more important each measurable is, the more it will be included in BB-Reference's database. Which I'm not sure is true or not.

So, take blocks for instance, they are only reported once in the team stats. But FG%, it is used in computing TS% and eFG%. In effect, I guess, FG% is weighted maybe 2x-3x as much as blk%.

I think some of the best proofs of the method is the clustering of teams like Utah, even with very different players, but still under Coach Sloan and a very consistent system.

As for champs and runners up clustering together, hopefully that will happen to a degree but that isn't my end-goal. I'm glad there are different clusters with different types of champs. The CHI teams cluster with some dominant laker's teams, the defensive teams like SAS and bad boy pistons group together, and then outliers like the lockout knicks team and '10 celts (majorly backed into playoffs) are pretty far away.

The team to the NW of MIA is ORL'11. I don't know why the zooming doesn't go further, it should be vector based. I'll fix it before I post the final product tonight.

Thanks for the advice!
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 773


PostPosted: Tue Jan 25, 2011 3:50 pm Post subject: Reply with quote
Thanks for the label identification.

Another option might be a cloud with just teams over 40-45 wins. It would be less crowded, still satisfy the most common interests and the labels could be bigger.
Back to top
View user's profile Send private message
acollard



Joined: 22 Sep 2010
Posts: 49
Location: MA

PostPosted: Tue Jan 25, 2011 4:25 pm Post subject: Reply with quote
I was thinking that, or cut off at 1990 instead of 1980. For full zooming capabilities you can also download the .pdf instead of using google.
Back to top
View user's profile Send private message
EvanZ



Joined: 22 Nov 2010
Posts: 205


PostPosted: Tue Jan 25, 2011 5:43 pm Post subject: Reply with quote
Have you tried this using the four factors on offense and defense? You could use the weights I derived here:

Code:

Factor Weight
Shooting 54%
Turnovers 22%
Rebounding 15%
Foul Rate 10%

_________________
http://www.thecity2.com
http://www.ibb.gatech.edu/evan-zamir
Back to top
View user's profile Send private message
acollard



Joined: 22 Sep 2010
Posts: 49
Location: MA

PostPosted: Tue Jan 25, 2011 6:02 pm Post subject: Reply with quote
I use the four factors (for and against) as part of the data, but not entirely. As I wrote in my article on DailyThunder.

"There have been a couple implementations of NBA statistical similarity used in the past. Neil Paine of Basketball-Reference had an article with NBA team similarity scores based on the Four Factors (eFG%, ORB%, FTR, and TOV%). The article was pretty great, and his methods seem to produce good results comparing the most important parts of NBA success. However, the Four Factors aren’t everything. They capture the areas of efficiency for a team, but they don’t capture much of how the efficiency happens. For example, eFG% is the most important Four Factor, but there are many different ways to achieve a high eFG%. Good 3-pt shooting, post play, or a fast pace are all common methods to generate a high eFG%, but these playing styles are very different. The thinking is, as a season goes on, and as a team enters the playoffs, certain combinations of strengths and weaknesses could perhaps translate better to success, and the Four Factors method may or may not miss this."

I believe Neil used pretty much exactly the process you suggest. I hoped to capture more than that, and I hope I did, in the fact that teams from year to year were so similar to each other. I wanted to distinguish between how teams score, defend, etc.
Back to top
View user's profile Send private message
acollard



Joined: 22 Sep 2010
Posts: 49
Location: MA

PostPosted: Mon Jan 31, 2011 3:16 pm Post subject: Reply with quote
Final network diagram and post at http://www.dailythunder.com Sorry for the huge image. Very Happy

Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 773


PostPosted: Mon Jan 31, 2011 4:44 pm Post subject: Reply with quote
Following your earlier suggestion, I downloaded the .pdf file and it gives the maximum
enlargement and I can read the labels clearly that way.

NBA champions are coded yellow, non-playoff teams are tan / orange. A different color choice
for one would make it easier from the champs to stand-out from the also-rans. Maybe make non-playoff teams white / clear?


3rd close to '11 Magic is the champion '06 Heat with Shaq. 2nd closest to '11 Lakers is the champion '07 Spurs. The '06 and '07 champs didn't get a chance to face-off but maybe the similars might.

Closest to these two among '11 teams are the Heat, Spurs and Mavs.

Being higher is clearly better. Being further to the right seems generally better. Any current explanations of what these 2 display dimensions generally represent, if anything consistently?

Checking the team W% and other stats of neighbors of a team within a certain distance as we have discussing would be helpful. But it probably would also be very helpful to do it
split out for these x and y dimensions only distances too to try to understand what happens statistically on average along those axises. It would probably be possible to prepare some sort of table or chart or motion chart with the average rate of change in the discrete state elements (or some of the main ones as a start) for each distance along wither axis. It would also might be possible (I'd think) to use some advanced technique to summarize how dozens of stats (dimensions) are being collapsed into these 2. The impact of even simple and crude weights (with small or modest or large variation in scale) on the stats would also be interesting to check.

The graph is a handy summary but it can also be a starting point for more study.
Back to top
View user's profile Send private message
huevonkiller



Joined: 25 May 2010
Posts: 11
Location: Miami, Fl

PostPosted: Tue Feb 01, 2011 7:53 am Post subject: Reply with quote
A good looking graph and pretty interesting. I think the premise could be very helpful.


page 2 of 2

Author Message
acollard



Joined: 22 Sep 2010
Posts: 56
Location: MA

PostPosted: Tue Feb 01, 2011 11:46 am Post subject: Reply with quote
Crow>

The problem with white is it blends into the background without outlines, and outlines ruin the readability of an already noisy graph. perhaps pink would have been better, but i figured that due to spacing/size, championship teams and non-playoff teams wouldn't get confused too much.

Orientation is not factored in at all. The way the algorithm works is that is starts them teams in a big cirlce, then iteratively moves them to "normalize the size between the edges", which the authors of the algorithm argue make it cluster, yet space the points to make them easier to read in general. So basically, if you run the algorithm again, it gives essentially the same spacing but sometimes very different orientations. I ran it a few times until the best teams were generally up, because I thought that was clearest.
Back to top
View user's profile Send private message
Crow



Joined: 20 Jan 2009
Posts: 807


PostPosted: Tue Feb 01, 2011 1:02 pm Post subject: Reply with quote
If white and outlines for the lottery team dots are not acceptable then pink or whatever other color for the also-rans would be fine of course. I can pick out yellow from tan / orange at full magnification and focus, but hard at anything less.


Thanks for the orientation response. I recognize that my spatial interpretation of the chart might be too strong or off but I guess I wanted to ask about it to learn more. I'd probably still want to see the spatial data suggested above first before I conceded this point entirely.