APBRmetrics

Posted: **Tue May 14, 2013 3:50 pm**

This post was split off from the thread Demystifying Ridge Regression.

Preface: If you don't want to get confused by math, please ignore my post and jump to the next one.

v-zero wrote:All ridge regression is in a nutshell, is the insertion of dummy measurements into a weighted multiple linear regression (the standard APM regression in this case, with dummy measurements for each individual player). That's it. In the case of standard RAPM the dummy measurements are set to zero (these dummy measurements are also reffered to as priors, if you weren't sure). However, there is no compunction for them to be set to zero, it is merely the most basic version. The other factor that may confuse people is the lambda value. All the lambda value is is a way to apportion a certain weight to the dummy measurements, depending on your confidence in them, relative to the data.

Mathematically speaking that is incorrect. The ridge regression refers to Tikhonov regularization, which is just the induction of the lambda as a constrain (Hoerl and Kennard, 1970) based on Tikhonov's work on stabilization of inverse problems (1943). The dummy measurement, you are speaking of, is not per se part of the ridge regression, but a prior distribution of the coefficients based on Bayesian probability (which is the basis of linear bayesian regression).

The important thing about the ridge regression is the existence theorem, where a lambda always exists such that a set of coefficients derived via ridge regression results into a lower MSE than the set of coefficients derived via OLS. That this is true was shown by Chawla, 1988. What we essentially have is regression to the mean, which increases the predictive power.

Now, in basketball, the set of equations derived from the play-by-play presents an ill-posed problem. Especially in our case, the multicollinearity is a big issue. Ridge regression will limit that problem by putting a constrain to how far a coefficient can differ from the mean. That gives a better predictive value, but also comes at the cost of an introduced bias. The bias can be calculated. In essence you could set lambda to infinity, which would give everyone 0 as a coefficient. The other corollary was already part of a previous post, where lambda -> 0 will make the set of coefficients going to the results of the OLS. That can be easily understood by looking at the following equations.

OLS

Code: Select all

β = (X^(T)X)^(-1)X^(T)y

β is the coefficent vector (the result)
X is the design matrix
X^T is the transpose design matrix
y is the response vector

For ridge regression it looks basically the same, with the exception of the introduced lambda.

Code: Select all

β = (X^(T)X + λI)^(-1)X^(T)y

I is the identity matrix

Well, this shows that the basic essence of your post is correct; someone who understands OLS, should not have an issue with ridge regression. The shown equations explain easily why lambda = 0 will give the same result as the OLS. λI for λ = 0 is 0.

Now, let us call U = (X^(T)X + λI)^(-1), then we get as the bias:

Code: Select all

bias(β) = -λUβ

Pretty straight forward, I guess.

Oh, and I wouldn't ignore the matrix algebra, because it really helps to solve the problem rather quickly. It also makes is easier to understand for my taste. And I really think that someone should understand matrix algebra first, before trying to understand the ridge regression.

Reference:
Tikhonov, 1943: http://a-server.math.nsc.ru/IPP/BASE_WORK/tihon_en.html
Hoerl and Kennard, 1970: http://www.jstor.org/discover/10.2307/1 ... 2215866301
Chawla, 1988: http://www.sciencedirect.com/science/ar ... 5288900399

Btw, if I understood Engelmann's approach correctly, the one thing which helped him improving the results further was a machine learning algorithm. He showed (in a Sloan paper) that a boxscore-based stats improves in terms of predictive power even over regularized +/- by using a finite state machine model. I guess, he applied that approach to the ridge regression model as well in order to improve the predictive power further. Not entirely sure whether that is true, however, but his results in comparison to my results (albeit not a good matchupfile) suggest that.

AcrossTheCourt wrote:I've heard of using -3 or -2 for rookies as the standard. But there are clearly rookies who have an impact right away (guys who spent a while in school like Duncan or, oddly enough, Rubio), and there are clearly rookies who are over their heads (Austin Rivers, Morrison.) So it doesn't seem ideal to use the same prior for everyone.

You could use a different prior for each rookie. What you would need to have is a reliable approximation, which can be based on draft position, height, age, college(highschool) stats. I applied a regression to a statistical +/- model where I found that a higher draft position usually means a higher value, more height over league average height for a specific position (I have lead guards, wings and pivots, only 3 positions) will usually result into a better value and age also having a positive effect (meaning, older rookies tend to have a better value).

Using the same prior for each rookie is just an estimate and gives a better predictive value than using no prior. I doubt that someone would argue, that this would be ideal or that no better way would exist. It is just a rather easy solution for an existing problem.

Posted: **Tue May 14, 2013 7:42 pm**

v-zero wrote:JE did not use his state-space model, he used a simple box score model.

He had RAPM values published on his website before he had his currently as prior used boxscore-based model established. The results were different than what basically everyone else got via ridge regression (i'm talking about NPI here). My idea is that he used the finite state model together with ridge regression, in order to get better predictive power. It is just a guess ...

Posted: **Wed May 15, 2013 9:09 pm**

mystic wrote:
v-zero wrote:JE did not use his state-space model, he used a simple box score model.
He had RAPM values published on his website before he had his currently as prior used boxscore-based model established. The results were different than what basically everyone else got via ridge regression (i'm talking about NPI here). My idea is that he used the finite state model together with ridge regression, in order to get better predictive power. It is just a guess ...

Just to clarify, the state-machine metric didn't see much daylight and was never used as a prior for any of my RAPM version. Differences between mine and other RAPM ratings must have been due to different weighing schemes, number of years included, different priors or different matchupdata

Posted: **Thu May 16, 2013 9:32 am**

Thanks for the clarification. To clarify my idea: I wasn't thinking of a prior when talking about the state machine model, but rather that you used the algorithm in order to determine an appropiate weighting scheme. So, in the end I was really more thinking about differences in weighting schemes, especially because it seems as if the standard deviation for your metric increased with a bigger sample significant more than for my calculations. Anyway, thanks again for the clarification.

APBRmetrics

Contrasting Ridge Regression and OLS

Contrasting Ridge Regression and OLS

Re: Demystifying Ridge Regression

Re: Demystifying Ridge Regression

Re: Demystifying Ridge Regression