Page 1 of 1

Positions, Dimensions, and Factor Analysis (Ed Küpfer. 2005)

Posted: Fri Apr 15, 2011 10:05 am
by Crow
Author Message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 785
Location: Toronto

PostPosted: Wed Nov 09, 2005 5:45 pm Post subject: Positions, Dimensions, and Factor Analysis Reply with quote
(Note: I was in hospital for a little while, held incommunicado. Without a computer, if you can believe it! If I owe you an email, just hang on — I'm still catching up on my correspondence.)

Whenever I hear the words "true point guard" or "true center," my teeth start to gnash, my vision clouds over, and my girlfriend leans over to ask me why I've been muttering "platonic twaddle" over and over. Traditional positions represent what I think of as top-down one-dimensional categories. Top-down because we think of an ideal for each position, and we try to stuff each player into the box he most resembles. One-dimensional because these ideals range across a spectrum from PG to C, and although sometimes we have more positions than the traditional five, the categories still represent a ladder. I am here to propose a different kind of positional categorisation.

First of all, the top-down approach: there's no real need to stick with this. The game, its players and strategy, have evolved so much over time that traditional positions don't mean the same things as they did in the past. Does true centerness represent an ideal to which all players should aspire? Is there any evidence that, say, Steve Francis would be any more effective if he became more of a true point guard? Maybe, but I want to get away from that. I want to start focusing on what players actually do, instead of what we think they should do. A bottom-up approach to positional categories would look at how players play their positions and try to find the natural categories from within, rather than impose the positions from without.

Nor is there any particular reason to think of positions only as laying along a spectrum roughly correlated to the size of the players. Positional categories can be on any number of dimensions. For example, height along one axis and possession usage along another. Or, an axis for passing ability, for scoring ability, and for defensive ability. The possibilities are literally endless, which makes this a potentially daunting task. Luckily, there are some statistical tools built for this kind of mess.

FACTOR ANALYSIS

I don't want to get too technical here — mostly because I am not confident at all that I will get the details right. Factor analysis is a technique used to collapse a large set of variables into a smaller set by running axis through the points in the data that will explain the variance best. I hope that made sense, because that's the best explanation I can give. If you want more, try here. I used this page to get the technique working in Minitab.

What I wanted to do is construct a bottom-up two-dimensional positional "map." I settled on two dimensions because they can be displayed graphically. More dimensions would probably yield more useful information, but people can easily grasp the idea of a positional map. Factor analysis will give me two axis, one to be plotted along the X axis and the other along the Y. Similarities between players is proportional to their geometric distance from each other on the map, something which can be readily and easily eye-balled.[*]

The next step was to decide on which stats to include. There is nothing objective about the way I went about this: I wanted to include enough stats so that meaningful information could be returned, but I wanted the stats to be easily found so to cut down on the amount of work that had to be done. I settled on the stats on the 82games.com "By Position" pages, like this one. By including identical offensive and defensive variables, the page covers a wide variety of stats. I also wanted to include a variable that isn't on that page, but seemed vitally important when categorising a player: minutes per game.

On the other hand, I didn't want to include stats that "fixed" the results — that is, I didn't want to help the results with the answer. When I first tried this out, I included Height and Weight as variables. But these stats are merely another way of categorising players along traditional lines, which is what I wanted to get away from in the first place. I want to see if the regular boxscore-type stats would categorise players meaningfully without reference to anything else. For that same reason, I removed iFG as a variable in my analysis. I also did not include PER, because that stats is essentially a summary of all the other stats — it would be redundant to include it.

In summary, here are the stats I included (for every stat except the first I used both Offensive and Defensive, prefixed with "o" and "d" respectively):

Code:
MPG
FGA/48
eFG%
FTA/48
Reb/48
Ast/48
T/O/48
Blk/48
PF/48
Pts/48



All of those, save MPG, can be found on the 82games.com player "By Position" pages.

ANALYSIS

I assembled my data, which included every player-team-season combination from the 02-03 season to the 04-05 season (n = 1360). I standardised the stats by subtracting the season-specific mean from each observation, and dividing the difference by the season-specific standard deviation. (Factor analysis will return weird, but usable, results if you don't standardise, but what the hell. The means and standard deviations I used can be found in the appendix to this post.)

The way factor analysis works, it returns a bunch of numbers with funky names like "eigenvalues" and "loadings" and "communalities." But the ones I want are the "factor score coefficients." Here's what I got:

Code:

Factor Score Coefficients

Variable Factor1 Factor2
MPG 0.015 -0.225
oFGA 0.002 -0.223
oeFG 0.049 -0.144
oFTA 0.095 -0.225
oReb 0.159 -0.039
oAst -0.130 -0.032
oT/O -0.017 -0.101
oBlk 0.135 -0.020
oPF 0.078 0.131
oPts 0.041 -0.283
dFGA -0.084 -0.039
deFG 0.015 0.019
dFTA 0.042 0.091
dReb 0.159 0.003
dAst -0.153 0.012
dT/O -0.074 0.063
dBlk 0.151 0.013
dPF 0.143 -0.079
dPts -0.049 0.016



(One of the problems with factor analysis is that it doesn't return p-values for the factor coefficients. Some of those coefficients look awful small. However, I will employ the age-old solution to such technical statistical problems: I'll ignore it.)

[Technical Note: I used Minitab to get those factors. I used the Principal Components method of extraction with Varimax rotation to extract 2 factors. I would appreciate anyone who knows more about factor analysis letting me know if other options would be more appropriate.]

Altough it doesn't look like it, Factor1 and Factor2 will be used to construct the x- and y-axis of a positional map.

ENOUGH OF THAT STUFF, WHADDYA GOT FOR ME?

Take a look at this chart:
Image




That is a sample of 30 players. The stats used were averages of their stats over the 02-03 to 04-05 seasons, weighted by minutes played. Each player's stats was multiplied by the coefficients from above, and plotted on the x-axis (Factor 1) and the y-axis (Factor 2). Are similar players grouped together? Does the map makes sense to you?

There are some things in there that stick out to me. The first is that the horizontal x-axis displays, very roughly, the players as a function of their size/position. On the left we have the PGs, Nash and Childs and Brunson. There are exceptions, of course: Factor Analysis returns numbers along an abstract axis, one that doesn't exist. On this invented axis, players go right-to-left from Rigaudeau --> Brevin --> Lue --> Cassell to Boozer --> Mourning --> Fortson --> Shaq on the other end.

Another thing that sticks out is the vertical axis corresponds very roughly to how many touches or minutes a player got. From bottom to top we have Eschmeyer --> Grant Long --> Omar Cook --> Paul Shirley to Carmelo --> Lebron --> Duncan --> Kobe --> Iverson at the other end.

I want to emphasise that the players' placement on the map is derived solely from their stats, and not from any measures of position. I think the fact that we can we can categorise players into positions using only their boxscore stats like shooting accuracy and rebounds and minutes played is pretty cool.

My idea is this: find clusters of observations on the chart. These would represent positional "islands," a different way of categorising positions. You won't be able to do that from the chart above, as I chose the player in that sample purposely to represent as many different areas on the positional map as I could while being spread apart. Here's what the total distribution of positional factors looks like:

Image

Because the stats were standardised, they are almost all gathered within two standard deviations of x=0, y=0 (the tick marks each represent a single standard deviation). Positional islands are hard to see in that chart, but when the players' names are placed on the chart the clusters are a little clearer. I'll show some charts like that in later posts.

SUMMARY

In this post I want to put forward the idea that a) positions should be designated not by what we think players should be doing, but by how they actually play, and b) that positions are more usefully categorised along 2 dimensions rather than the traditional single dimension. I put forward a method of collapsing a bunch of offensive and defensive stats into a more manageable form. I believe a graphical presentation of these 2-dimensional positional categories is both useful and intuitive.

I'll be adding more thoughts to this thread in future posts.

STATS APPENDIX

Descriptive stats of the variables used in the factor analysis:

Code:
Variable YEAR Mean StDev Q1 Median Q3
MPG 2003 21.671 10.104 13.044 20.804 29.798
2004 22.094 9.772 13.364 20.797 30.583
2005 21.595 9.845 13.280 20.492 29.768

oFGA 2003 15.198 4.231 12.021 15.029 17.900
2004 14.923 4.176 11.949 14.710 18.113
2005 14.804 4.292 11.781 14.406 17.844

oeFG 2003 0.45299 0.06565 0.42700 0.46017 0.49478
2004 0.45728 0.06003 0.42707 0.46483 0.49318
2005 0.46268 0.06966 0.43200 0.46944 0.50363

oFTA 2003 4.389 2.246 2.674 4.055 5.511
2004 4.3304 2.1167 2.7605 4.0957 5.6414
2005 4.569 2.446 2.900 4.076 6.017

oReb 2003 8.485 3.603 5.552 8.009 11.109
2004 8.462 3.582 5.432 8.002 11.200
2005 8.329 3.567 5.300 7.795 11.000

oAst 2003 3.946 2.626 1.992 3.135 5.374
2004 3.841 2.581 1.904 2.947 5.519
2005 3.786 2.652 1.900 2.987 5.029

oT/O 2003 2.8723 1.0980 2.1000 2.8000 3.4000
2004 2.7544 0.8982 2.1165 2.6623 3.3723
2005 2.7002 0.9688 2.0000 2.7000 3.3000

oBlk 2003 1.0032 1.0217 0.3000 0.6333 1.3712
2004 1.0045 1.0299 0.3000 0.6372 1.4000
2005 0.9988 1.0720 0.2975 0.6000 1.3467

oPF 2003 4.9274 1.8071 3.7012 4.6889 5.8985
2004 4.8508 1.8069 3.5914 4.5216 5.9637
2005 5.2288 2.1100 3.8073 4.8500 6.1000

oPts 2003 17.012 5.526 13.300 16.450 20.104
2004 16.884 5.226 13.421 16.510 20.138
2005 17.090 5.684 13.173 16.640 20.607

dFGA 2003 16.094 2.299 14.600 16.181 17.331
2004 15.718 2.049 14.400 15.762 17.073
2005 15.723 2.111 14.586 15.900 17.015

deFG 2003 0.47526 0.05265 0.45026 0.47300 0.49903
2004 0.47515 0.05139 0.44989 0.47654 0.50101
2005 0.48352 0.05527 0.45546 0.48163 0.50619

dFTA 2003 5.3071 1.5699 4.3439 5.0000 5.9192
2004 5.0467 1.4408 4.2654 4.9060 5.6630
2005 5.5619 1.5782 4.6305 5.3038 6.1420

dReb 2003 8.696 3.090 5.947 8.253 11.490
2004 8.665 3.288 5.715 8.002 11.688
2005 8.636 3.299 5.644 7.838 11.700

dAst 2003 4.209 2.094 2.650 3.552 5.373
2004 4.113 2.137 2.538 3.428 5.179
2005 4.047 2.268 2.400 3.363 5.054

dT/O 2003 2.9217 0.7318 2.5000 2.8330 3.2659
2004 2.8844 0.7261 2.4709 2.8000 3.2000
2005 2.7856 0.7340 2.3593 2.7192 3.1000

dBlk 2003 1.1062 0.8846 0.4050 0.7897 1.6820
2004 1.0694 0.8503 0.3917 0.8000 1.6925
2005 1.0748 0.8680 0.4000 0.7614 1.6701

dPF 2003 4.3433 1.2582 3.4633 4.2000 5.0996
2004 4.2424 1.1671 3.4225 4.1023 5.0608
2005 4.4649 1.2531 3.6777 4.2523 5.2000

dPts 2003 19.261 2.935 17.459 19.062 20.684
2004 18.688 2.670 17.048 18.637 20.196
2005 19.371 2.923 17.600 19.351 20.900



[* A Note On Player Similarities.

This is a tough thing to wrap your head around, much less to get the data to return useful answers. I don't want to knock anyone who has done work in this area — from personal experience, I know there isn't a single right answer, and just the fact that you got a answer is reason enough to celebrate.

But.

I think you're all going about it wrong. (Howzat for chutzpah?) One can categorise all stats into two type: successes and opportunities. FG% measures a success rate. MPG measures an opportunity rate. FT% measures a success rate. FTA/g measures an opportunity rate. Every method of similarity I've seen mixes the two types, successes and opportunities. I think this is wrong. I believe that a player who shoots 12 FGA per game is very similar to another player shooting 12 FGA per game, regardless of the difference in their respective success rates.

That's not quite true. What I mean is, shooting 12 FGA per game is a kind of similarity, a meaningful similarity untracked by the similarity methods I've seen. Another kind of similarity, of course, is the difference in the two players' FG%.

The two types of similarity should be noted, accounted for somehow. My factor analysis approach can be used for this purpose, although the method above isn't quite right for that. But it can be used.

I know there's a lot of people out there working with similarity, trying to develop good and meaningful methods. I want you to think about a multi-dimensional approach. I really think that's the way to go.]

[edited for formatting, 6:44 PM 05-11-09]
_________________
ed

Last edited by Ed Küpfer on Wed Nov 09, 2005 10:28 pm; edited 2 times in total
Back to top
View user's profile Send private message Send e-mail
Eli W



Joined: 01 Feb 2005
Posts: 402


PostPosted: Wed Nov 09, 2005 6:03 pm Post subject: Reply with quote
Very interesting stuff.

Regarding the note on similarity scores, that's something I've been thinking about for a while. I was planning on building a similarity system solely based on things like usage rate, FTA/FGA, 3PA/FGA, etc. Under such a system if two players rated as very similar it would suggest they were similar types of players but not necessarily similar quality players.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
HoopStudies



Joined: 30 Dec 2004
Posts: 705
Location: Near Philadelphia, PA

PostPosted: Wed Nov 09, 2005 6:57 pm Post subject: Reply with quote
Ed --

Glad you finally got a chance to sit down and do this. I think this is something that came up years ago in apbr_analysis, we talked about it, said it's interesting, then went on with the more important things in life, like getting a seat for the Cal-USC football game. Or whatever the analogous thing is in Canada.

One of the things I've thought of as I've thought about this approach is that the stats are missing certain relevant things -- the defensive side. We've talked about how positions often reflect on defense more than offense. Yet we're missing a lot of relevant defensive stats. So if we had, say, fto, dfgm, etc., how would that move guys around? I'd say it would change things around a fair amount. I actually have seen clustering with analysis of detailed defensive data.

Further, a more minor point, what would happen if you did throw in height into your analysis? Does that start changing alignments a little?

DeanO
_________________
Dean Oliver
Author, Basketball on Paper
The postings are my own & don't necess represent positions, strategies or opinions of employers.
Back to top
View user's profile Send private message Visit poster's website
Ed Küpfer



Joined: 30 Dec 2004
Posts: 785
Location: Toronto

PostPosted: Wed Nov 09, 2005 9:30 pm Post subject: Reply with quote
HoopStudies wrote:
we talked about it, said it's interesting, then went on with the more important things in life, like getting a seat for the Cal-USC football game. Or whatever the analogous thing is in Canada.


That would be the much anticipated showdown between me and the bar staff, as I attempt to secure one single TV set on which I can catch the Raps game. I usually end up huddled beside a stack of empties, watching it on a 30-year-old black-and-white security monitor with enormous rabbit ears and no sound, while the rest of the bar patrons get the hockey game on the big screen projection sets.

Quote:
So if we had, say, fto, dfgm, etc., how would that move guys around? I'd say it would change things around a fair amount. I actually have seen clustering with analysis of detailed defensive data.


Yeah, I can't see how that extra data would hurt. Give me a chance to play around with Roland's numbers, though. It may just be that there's enough there to get a real good look.

Quote:
Further, a more minor point, what would happen if you did throw in height into your analysis? Does that start changing alignments a little?


Very little. Let's look at the team you know best, and the one I know best. Here they are without the Height and Weight factors:

Image

And here they are with HT and WT:

Image

It's hard to eye-ball the difference. Here I've isolated the 10 players with the greatest difference once HT and WT factors were added. The large point represents the point with HT and WT:

Image

It's interesting that the y-axis is nearly unaffected, but beyond that the differences are negligible — we're talking less than half a standard deviation.
_________________
ed

Last edited by Ed Küpfer on Wed Nov 09, 2005 10:31 pm; edited 1 time in total
Back to top
View user's profile Send private message Send e-mail
Kevin Pelton
Site Admin


Joined: 30 Dec 2004
Posts: 978
Location: Seattle

PostPosted: Wed Nov 09, 2005 10:20 pm Post subject: Reply with quote
Ed, do you think you could post those graphs as thumbnails that link to the larger version? The post becomes a bit hard to follow because the page is stretched.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
mtamada



Joined: 28 Jan 2005
Posts: 376


PostPosted: Thu Nov 10, 2005 2:31 am Post subject: Reply with quote
Hmm, four comments:

Comment 1. I have an different interpretation of your Factor Two: it's not really about usage, it's about *quality*. Granted Iverson at the top is a little bizarre, but most purely statistical studies seem to turn up at least one "laugh test" failure (it's not nearly as funny as Dennis Rodman being at the top). But look at the top 8 players: Iverson, Shaq, Duncan, Pierce, LeBron, Garnett, and Nowitzki. Except for Pierce, and Iverson's odd #1 rating, we've pretty much got the first 8 answers that most people would give to the question: who's the best player in the NBA?

I do agree with your interpretation of your Factor One.


Given those interpretations, these results are almost identical to ones that I got when I did a principle components analysis (a simple version of factor analysis) some 25 years ago. The first (and most important) principle component was one which was very similar to your Factor 2 and was clearly measuring quality: players who did more good things and fewer bad things. In my experience, most Principle Components and Factor Analyses of these sorts of data do the same thing: the first princ. comp., or first factor, is one which simply measures quality (in a rough way that is; I'm certainly not proposing that PC or FA be used as a way to identify the NBA's best player, because neither technique is designed to correlate with winning, player quality, etc.).

The second principle component was almost identical to your Factor 1: a tallness measure, with rebounders and shotblockers at one end, and assisters at the other.

It's possible that the Varimax Rotation that you did might explain why Iverson came out on top; if you do no rotation at all, who comes out on top?


HoopStudies wrote:
we talked about it, said it's interesting, then went on with the more important things in life, like getting a seat for the Cal-USC football game. Or whatever the analogous thing is in Canada.


Comment 2: (getting back to Ed's critique about the similarity measures that are out there): without question, multivariate techniques can be of value here, helping us put reduced weights on measures which might be redundant with each other, and increased weights on measures which may be more important for distinguising between players. That's why I've been skeptical of the similarity scores that are out there ... I don't see that they've been derived using sophisticated techniques. Also there's the question of "similarity in what ways, or for what purposes?" Are they supposed to measure similarity in terms of player quality level, or similarity in terms of predicting future performance, or what?


Ed Küpfer wrote:
Quote:
So if we had, say, fto, dfgm, etc., how would that move guys around? I'd say it would change things around a fair amount. I actually have seen clustering with analysis of detailed defensive data.


Yeah, I can't see how that extra data would hurt. Give me a chance to play around with Roland's numbers, though. It may just be that there's enough there to get a real good look.


Comment 3, actually it's a question: I'm confused, it looked like Ed's results ALREADY are using defensive stats. If not, what are these results about (cutting and pasting from part the statistical results):

dFGA -0.084 -0.039
deFG 0.015 0.019
dFTA 0.042 0.091
dReb 0.159 0.003
dAst -0.153 0.012
dT/O -0.074 0.063
dBlk 0.151 0.013
dPF 0.143 -0.079
dPts -0.049 0.016


Comment 4, going back to the interpretations, and to Ed's original question of how to categorize players: if my interpretations are correct, then Ed's two factors don't really enable us to re-design NBA player categories. The "tallness" dimension is highly consistent with the old-fashioned PG-SG-SF-PF-C spectrum. And the "quality" dimension is always implicit whenever we talk about players: we've always had good PGs and bad PGs, good Centers and bad Centers. So those two factors are simply another way of measuring what we've always known and talked about: players have different roles (based largely though not completely on height) and different quality levels.

P.S. When I did my Principle Components analysis, the third principle component, and all others PCs, were undecipherable. I did not attempt any rotations (partly because I've never understood, and thus always been mistrustful of, the methodology behind data rotations). I guess this leads to Comment 5/Question 2: if you picked out a third factor, what does it look like? Can it be given a sensible interpretation? And how important is that third factor -- what eigenvalue does it have, or what does your scree chart look like? My recollection of my results is fuzzy, but I think that the first two principle components were the only really important ones, and the components beyond that had little variation left to explain.
Back to top
View user's profile Send private message
jkubatko



Joined: 05 Jan 2005
Posts: 702
Location: Columbus, OH

PostPosted: Thu Nov 10, 2005 2:25 pm Post subject: Reply with quote
Michael's mention of his principal component analysis (PCA) motivated me to do a quick and dirty PCA using data from 2005. I included players who played 41 or more games, and used the following statistics (per game): points, rebounds, assists, steals, and blocks. I got results that were similar to Michael's.

* The first principal component (prin1) was a measure of all around ability, as each variable had a positive loading. The players with the five highest prin1 scores were Allen Iverson, LeBron James, Kevin Garnett, Larry Hughes, and Tracy McGrady. The five lowest prin1 scores belonged to Darvin Ham, Aaron Williams, Mike Wilks, Ryan Bowen, and Ervin Johnson.

* The second principal component (prin2) separated "bigs" from "smalls", as rebounds and blocks had positive loadings, while assists and steals had negative loadings. (The loading for points was close to zero, and extremely small relative to the other loadings.) The five highest prin2 scores belonged to Marcus Camby, Tim Duncan, Ben Wallace, Shaquille O'Neal, and Joel Przybilla. The players with the five lowest prin2 scores were Brevin Knight, Steve Nash, Allen Iverson, Stephon Marbury, and Baron Davis. The first two principal components accounted for 83.56% of the total variability.

* The third principal component (prin3) only accounted for 6.87% of the total variability, but since it was somewhat interpretable I looked at it as well. Prin3 gave negative loadings to points and rebounds, and positive loadings to assists, steals, and blocks. This separates players whose value comes mainly from points and/or rebounds from players whose value comes from other statistics. The five highest prin3 scores belonged to Andrei Kirilenko, Marcus Camby, Brevin Knight, Theo Ratliff, and Ben Wallace; the five lowest belonged to Zach Randolph, Michael Redd, Antawn Jamison, Troy Murphy, and Corey Maggette.

* The fourth and fifth principal components did not have natural interpretations.
_________________
Regards,
Justin Kubatko
Basketball-Reference.com
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Ed Küpfer



Joined: 30 Dec 2004
Posts: 785
Location: Toronto

PostPosted: Thu Nov 10, 2005 2:36 pm Post subject: Reply with quote
mtamada wrote:
Comment 1. I have an different interpretation of your Factor Two: it's not really about usage, it's about *quality*. Granted Iverson at the top is a little bizarre, but most purely statistical studies seem to turn up at least one "laugh test" failure (it's not nearly as funny as Dennis Rodman being at the top). But look at the top 8 players: Iverson, Shaq, Duncan, Pierce, LeBron, Garnett, and Nowitzki. Except for Pierce, and Iverson's odd #1 rating, we've pretty much got the first 8 answers that most people would give to the question: who's the best player in the NBA?


I dunno. Here's a closeup of the area surrounding Iverson:

Image

Quote:
In my experience, most Principle Components and Factor Analyses of these sorts of data do the same thing: the first princ. comp., or first factor, is one which simply measures quality....

The second principle component was almost identical to your Factor 1: a tallness measure, with rebounders and shotblockers at one end, and assisters at the other.


That is fascinating. I wouldn't have expected it. I'm glad that two different analyses, using different numbers, returned similar findings.

Quote:
It's possible that the Varimax Rotation that you did might explain why Iverson came out on top; if you do no rotation at all, who comes out on top?


The Varimax rotation:

Image


The Unrotated version:

Image

Looks like you may be on to something.

Quote:
Comment 3, actually it's a question: I'm confused, it looked like Ed's results ALREADY are using defensive stats. If not, what are these results about (cutting and pasting from part the statistical results):


Hmm. I thought Dean was talking about hand-coded matchup stats. Dean?

Quote:
if my interpretations are correct, then Ed's two factors don't really enable us to re-design NBA player categories. The "tallness" dimension is highly consistent with the old-fashioned PG-SG-SF-PF-C spectrum. And the "quality" dimension is always implicit whenever we talk about players: we've always had good PGs and bad PGs, good Centers and bad Centers. So those two factors are simply another way of measuring what we've always known and talked about: players have different roles (based largely though not completely on height) and different quality levels.


You may very well be right about this. My post above was just to lay out the method, rather than to explore the results of that method. Remember that the chart above shows the results for the players' numbers averaged over three seasons. What I want to do is look closer at individual player-seasons, compare them, see if they tell us more about how the player played, in the positional sense that I was talking about. I'll be looking into it further over the next few days.

Quote:
if you picked out a third factor, what does it look like? Can it be given a sensible interpretation? And how important is that third factor -- what eigenvalue does it have, or what does your scree chart look like? My recollection of my results is fuzzy, but I think that the first two principle components were the only really important ones, and the components beyond that had little variation left to explain.


Unrotated, extracting 3 factors. I plotted Factor 2 (the "ability" measure) against Factor 3. I can't figure out what the hell it could be telling us:

Image
_________________
ed
Back to top
View user's profile Send private message Send e-mail
HoopStudies



Joined: 30 Dec 2004
Posts: 705
Location: Near Philadelphia, PA

PostPosted: Thu Nov 10, 2005 2:54 pm Post subject: Reply with quote
Ed Küpfer wrote:


Quote:
Comment 3, actually it's a question: I'm confused, it looked like Ed's results ALREADY are using defensive stats. If not, what are these results about (cutting and pasting from part the statistical results):


Hmm. I thought Dean was talking about hand-coded matchup stats. Dean?



I was. Hand-coded defensive scoresheet data, like the stuff KevinB has been collecting. Much much higher quality data.
_________________
Dean Oliver
Author, Basketball on Paper
The postings are my own & don't necess represent positions, strategies or opinions of employers.
Back to top
View user's profile Send private message Visit poster's website
James



Joined: 10 Nov 2005
Posts: 1


PostPosted: Thu Nov 10, 2005 4:09 pm Post subject: Reply with quote
I think the factor analysis approach is useful, but is better used in a slightly different fashion than has been proposed. I am not sure how much time I have at the moment to expand on things, but I will begin now, and perhaps resume at a later time.
I have done this a number of times in previous years using a standardized data set from Doug Steele's site. The conclusion you reach are somewhat dependent on how you standardize. For example, if you want to compare players games a reasonable way is to standardize stats per minute; but an alternative might be to emphasize other aspects of an individuals game i.e., stats per shot.
I do not find it so interesting defining positions, so I accept outside assignments from sources like Steele and then search for similarities amongst the players at these positions.
You have dismissed the most important parts of the analysis when you do not consider things like eigenvalues and loadings. The value of the eigenvalue for each factor divided by the total number of variables you analyze tells you the percentage of the total variation that factor explains. Typically, any factor that has an eigenvalue less than one is useless.
The loadings tell you the correlation between the factor and the variable. A rule of thumb is that if the loading is under 0.7 the factor is not an adequate replacement for the variable. This is because 0.7 squared represents the percentage of variation in the variable that is explained by concomitant variation in the factor. .7*.7 is about 50%.
What you want is factors that have unambiguously high loadings on some variables, then you can define these factors by these variables and try to interpret what the factor represents. Any variables that do not have any sufficiently high loadings on any of the significant factors are not adequately represented by the analysis and must be considered separately. Rotations can greatly increase interpretation because they often will load some variables highly on one factor and low on all others. this greatly reduces ambiguity.
As I remember my analysis, one factor is often interpretable as aggressiveness. A second factor often indicates the tendency to be an inside player versus perimeter player
Back to top
View user's profile Send private message
Ed Küpfer



Joined: 30 Dec 2004
Posts: 785
Location: Toronto

PostPosted: Thu Nov 10, 2005 4:53 pm Post subject: Reply with quote
James wrote:
I have done this a number of times in previous years using a standardized data set from Doug Steele's site. The conclusion you reach are somewhat dependent on how you standardize. For example, if you want to compare players games a reasonable way is to standardize stats per minute; but an alternative might be to emphasize other aspects of an individuals game i.e., stats per shot.


The stats I used (except for minutes/game and EFG%) are per minute stats, standardized by subtracting the mean and dividing the difference by the standard deviation. Per shot stats wouldn't work becuase many of the stats (eg rebounds or assists) have nothing to do with how many shots a player takes.


Quote:
You have dismissed the most important parts of the analysis when you do not consider things like eigenvalues and loadings. The value of the eigenvalue for each factor divided by the total number of variables you analyze tells you the percentage of the total variation that factor explains. Typically, any factor that has an eigenvalue less than one is useless.


Okay. I'll redo the analysis, omitting the unimportant factors.

Quote:
As I remember my analysis, one factor is often interpretable as aggressiveness. A second factor often indicates the tendency to be an inside player versus perimeter player


I don't see anything like your first factor. That may be because you limited your defensive stats to the usual boxscore stats (STL, REB, BLK), while I included many more.

I greatly appreciate your comments on my analysis. Will incorporate your suggestions in the future.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
Mark



Joined: 20 Aug 2005
Posts: 807


PostPosted: Sat Nov 12, 2005 9:20 pm Post subject: Players by quartiles Reply with quote
Using your "Descriptive stats of the variables used in the factor analysis" chart I looked at some player examples by quartile on offense and defense...

a player at the exact mean for offensive and defensive points would appear to lose the matchup (rounding off all fractions for simplicity) 16 to 19 or about -3;

a player at the 75% percentile on offense but 25% percentile on defense would roughly tie 20 to 20 or 0;

a player at the 25% percentile on offense but 75% on defense would lose 13 to 17 or -4;

a player at the 75% percentile on offense but 50% percentile on defense
would win 20 to 19 or +1;

a player at the 50% percentile on offense but 75% percentile on defense
would lose 16 to 17 or -1;

a player at the 50% percentile on offense but 25% percentile on defense
would lose 16 to 20 or -4; and

a player at the 25% percentile on offense but 50% percentile on defense
would lose 13 to 19 or -6.


If player data was fixed, not affected by team, this would suggest in general that you would mainly want players above the 75th percentile on offense as long as they are at least above the 25th percentile on defense.

But of course player data is not fixed and team interactions matter and multiple good or bad offensive or defensive players affect results in more than a linear way. So this method is too static and not reliable. I would not be surprised if some GMs thought this way though. The most successful ones would seem likely to be the ones with a better feel for team dynamics and the relative value of offense and defense advantages by position, especially against other top contenders.

Still it is interesting that it seems to take at least a 25% percentile advantage between the two sides of the game to gain even a single point and even then only under certain combinations and not others, maybe even most other combinations.

A very good two-way player at the 75% percentile on offense and 75% percentile on defense would win 20 to 17 or +3, or three times the advantage of an offensive side only biased player (75% percentile offense / 50% defense. They should be much more sought after.

More information about how much extra bang for the buck you get from players in the 90th percentile is not available from this chart but I assume the gains are much more dramatic and explain the emphasis on stars and especially superstars.

Offensive bias is slightly more favorable in the boxscore.

It would be interesting what the average salaries were for these player examples and how much of a premium you pay for expected gains on the court. And maybe you could use some form of maximization analysis of the offense/defense combinations of free agent choices and fill out your roster as a team mix of offense and defense within your budget frontier rather than just viewing each player buying decision piecemeal. Some buys might make more moneyball sense than other options in the team context and the best choice of player for a particular team might not be the absolute best player based on weighted stats but rather one that fills more of the underfilled needs and especially if they can do so at a favorable price.
Back to top
View user's profile Send private message
gabefarkas



Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC

PostPosted: Wed Jan 11, 2006 9:58 am Post subject: Re: Positions, Dimensions, and Factor Analysis Reply with quote
This is one of the most fascinating things I've ever read. I don't know why or how I missed it the first time around. A few thoughts/questions...

Ed Küpfer wrote:
In summary, here are the stats I included (for every stat except the first I used both Offensive and Defensive, prefixed with "o" and "d" respectively):

Code:
MPG
FGA/48
eFG%
FTA/48
Reb/48
Ast/48
T/O/48
Blk/48
PF/48
Pts/48



All of those, save MPG, can be found on the 82games.com player "By Position" pages.

ANALYSIS

I assembled my data, which included every player-team-season combination from the 02-03 season to the 04-05 season (n = 1360). I standardised the stats by subtracting the season-specific mean from each observation, and dividing the difference by the season-specific standard deviation. (Factor analysis will return weird, but usable, results if you don't standardise, but what the hell. The means and standard deviations I used can be found in the appendix to this post.)

The way factor analysis works, it returns a bunch of numbers with funky names like "eigenvalues" and "loadings" and "communalities." But the ones I want are the "factor score coefficients." Here's what I got:

Code:

Factor Score Coefficients

Variable Factor1 Factor2
MPG 0.015 -0.225
oFGA 0.002 -0.223
oeFG 0.049 -0.144
oFTA 0.095 -0.225
oReb 0.159 -0.039
oAst -0.130 -0.032
oT/O -0.017 -0.101
oBlk 0.135 -0.020
oPF 0.078 0.131
oPts 0.041 -0.283
dFGA -0.084 -0.039
deFG 0.015 0.019
dFTA 0.042 0.091
dReb 0.159 0.003
dAst -0.153 0.012
dT/O -0.074 0.063
dBlk 0.151 0.013
dPF 0.143 -0.079
dPts -0.049 0.016




OK, I'm with you up until right around here. How do you take nineteen "statistics" and relate them to two axes? You take a player's value for said statistic, normalize it, and multiply it by the Factor rating above, but then what? Do you sum all of the values for statistic*Factor1 and that becomes one axis, and sum all the values for statistic*Factor2 and that becomes the second axis?

Could you explain further please?

Ed Küpfer wrote:

In this post I want to put forward the idea that a) positions should be designated not by what we think players should be doing, but by how they actually play, and b) that positions are more usefully categorised along 2 dimensions rather than the traditional single dimension. I put forward a method of collapsing a bunch of offensive and defensive stats into a more manageable form. I believe a graphical presentation of these 2-dimensional positional categories is both useful and intuitive.


Right off the bat, I think the clumping is beginning to work. Draw a circle with center exactly at (0,0) and expand it until you get a meaningful sample. That's your first position. Then, what remains in each of the four quadrants represents the other four positions.

What do you think?


Ed Küpfer wrote:

I think you're all going about it wrong. (Howzat for chutzpah?) One can categorise all stats into two type: successes and opportunities. FG% measures a success rate. MPG measures an opportunity rate. FT% measures a success rate. FTA/g measures an opportunity rate. Every method of similarity I've seen mixes the two types, successes and opportunities. I think this is wrong. I believe that a player who shoots 12 FGA per game is very similar to another player shooting 12 FGA per game, regardless of the difference in their respective success rates.

That's not quite true. What I mean is, shooting 12 FGA per game is a kind of similarity, a meaningful similarity untracked by the similarity methods I've seen. Another kind of similarity, of course, is the difference in the two players' FG%.


The question that immediately pops into my mind is: does every opportunity measured have to have a success related to it that is also measured? For instance, I could posit that FGA and FG% are opportunity and success measurements that would be related. However, what success would be related to MPG? What opportunity is related to steals/game?

Also, how does this opportunity/success model relate, if at all, to the axes described earlier in your post? Perhaps another model could be created that more directly plots players based on one axis being a measurement of opportunities and the other being a measurement of successes?
Back to top
View user's profile Send private message Send e-mail AIM Address
Mark



Joined: 20 Aug 2005
Posts: 807


PostPosted: Wed Jan 11, 2006 1:49 pm Post subject: Reply with quote
"Draw a circle with center exactly at (0,0) and expand it until you get a meaningful sample. That's your first position. Then, what remains in each of the four quadrants represents the other four positions. "

I agree that Ed's chart seems to divide into four positions. But there is plenty of room in each quadrant for variation and maybe a few slip over the line to the next position.

The fifth doesnt have to be centered at (0,0). Is it the small forward? Likely but not required to be. I think it could be in any quadrant depending on the mix of the other 4 guys. Maybe the small forward can be thought of as generally near the y axis (because half big / half perimeter) but not necessarily around 0,0 and I not sure there is any strong reason it is desirable to be close there.

I like the greater freedom of 4 positions not 5 and seeing where players fall in each quad since you can see the relative amount of the other positions in them by how much closer they are to the other quadrants than is typical for others at their position (i.e. Antonio Daniels more SG like than Luke Ridnour)

I wonder if you total the x and y positions of the 5 starters on teams how much team variation there would be. The differences from the average would tell you if the team as a whole played "taller/bigger" or better offensively.
Back to top
View user's profile Send private message
gabefarkas



Joined: 31 Dec 2004
Posts: 1313
Location: Durham, NC

PostPosted: Thu Jan 12, 2006 9:26 am Post subject: Reply with quote
Mark wrote:

I like the greater freedom of 4 positions not 5 and seeing where players fall in each quad since you can see the relative amount of the other positions in them by how much closer they are to the other quadrants than is typical for others at their position (i.e. Antonio Daniels more SG like than Luke Ridnour)


Well, yeah. If you have a 2-axis plot, of course dividing it in 4 make sense. Unfortunately, there's 5 guys on the floor, each with distinct positional names attributed to them at any one time. Thus, we need to figure out how to divide the 2-axis plot into 5.