Contrasting Ridge Regression and OLS
Posted: Tue May 14, 2013 3:50 pm
This post was split off from the thread Demystifying Ridge Regression.
Preface: If you don't want to get confused by math, please ignore my post and jump to the next one.
The important thing about the ridge regression is the existence theorem, where a lambda always exists such that a set of coefficients derived via ridge regression results into a lower MSE than the set of coefficients derived via OLS. That this is true was shown by Chawla, 1988. What we essentially have is regression to the mean, which increases the predictive power.
Now, in basketball, the set of equations derived from the play-by-play presents an ill-posed problem. Especially in our case, the multicollinearity is a big issue. Ridge regression will limit that problem by putting a constrain to how far a coefficient can differ from the mean. That gives a better predictive value, but also comes at the cost of an introduced bias. The bias can be calculated. In essence you could set lambda to infinity, which would give everyone 0 as a coefficient. The other corollary was already part of a previous post, where lambda -> 0 will make the set of coefficients going to the results of the OLS. That can be easily understood by looking at the following equations.
OLS
β is the coefficent vector (the result)
X is the design matrix
X^T is the transpose design matrix
y is the response vector
For ridge regression it looks basically the same, with the exception of the introduced lambda.
I is the identity matrix
Well, this shows that the basic essence of your post is correct; someone who understands OLS, should not have an issue with ridge regression. The shown equations explain easily why lambda = 0 will give the same result as the OLS. λI for λ = 0 is 0.
Now, let us call U = (X^(T)X + λI)^(-1), then we get as the bias:
Pretty straight forward, I guess.
Oh, and I wouldn't ignore the matrix algebra, because it really helps to solve the problem rather quickly. It also makes is easier to understand for my taste. And I really think that someone should understand matrix algebra first, before trying to understand the ridge regression.
Reference:
Tikhonov, 1943: http://a-server.math.nsc.ru/IPP/BASE_WORK/tihon_en.html
Hoerl and Kennard, 1970: http://www.jstor.org/discover/10.2307/1 ... 2215866301
Chawla, 1988: http://www.sciencedirect.com/science/ar ... 5288900399
Btw, if I understood Engelmann's approach correctly, the one thing which helped him improving the results further was a machine learning algorithm. He showed (in a Sloan paper) that a boxscore-based stats improves in terms of predictive power even over regularized +/- by using a finite state machine model. I guess, he applied that approach to the ridge regression model as well in order to improve the predictive power further. Not entirely sure whether that is true, however, but his results in comparison to my results (albeit not a good matchupfile) suggest that.
Using the same prior for each rookie is just an estimate and gives a better predictive value than using no prior. I doubt that someone would argue, that this would be ideal or that no better way would exist. It is just a rather easy solution for an existing problem.
Preface: If you don't want to get confused by math, please ignore my post and jump to the next one.
Mathematically speaking that is incorrect. The ridge regression refers to Tikhonov regularization, which is just the induction of the lambda as a constrain (Hoerl and Kennard, 1970) based on Tikhonov's work on stabilization of inverse problems (1943). The dummy measurement, you are speaking of, is not per se part of the ridge regression, but a prior distribution of the coefficients based on Bayesian probability (which is the basis of linear bayesian regression).v-zero wrote:All ridge regression is in a nutshell, is the insertion of dummy measurements into a weighted multiple linear regression (the standard APM regression in this case, with dummy measurements for each individual player). That's it. In the case of standard RAPM the dummy measurements are set to zero (these dummy measurements are also reffered to as priors, if you weren't sure). However, there is no compunction for them to be set to zero, it is merely the most basic version. The other factor that may confuse people is the lambda value. All the lambda value is is a way to apportion a certain weight to the dummy measurements, depending on your confidence in them, relative to the data.
The important thing about the ridge regression is the existence theorem, where a lambda always exists such that a set of coefficients derived via ridge regression results into a lower MSE than the set of coefficients derived via OLS. That this is true was shown by Chawla, 1988. What we essentially have is regression to the mean, which increases the predictive power.
Now, in basketball, the set of equations derived from the play-by-play presents an ill-posed problem. Especially in our case, the multicollinearity is a big issue. Ridge regression will limit that problem by putting a constrain to how far a coefficient can differ from the mean. That gives a better predictive value, but also comes at the cost of an introduced bias. The bias can be calculated. In essence you could set lambda to infinity, which would give everyone 0 as a coefficient. The other corollary was already part of a previous post, where lambda -> 0 will make the set of coefficients going to the results of the OLS. That can be easily understood by looking at the following equations.
OLS
Code: Select all
β = (X^(T)X)^(-1)X^(T)y
X is the design matrix
X^T is the transpose design matrix
y is the response vector
For ridge regression it looks basically the same, with the exception of the introduced lambda.
Code: Select all
β = (X^(T)X + λI)^(-1)X^(T)y
Well, this shows that the basic essence of your post is correct; someone who understands OLS, should not have an issue with ridge regression. The shown equations explain easily why lambda = 0 will give the same result as the OLS. λI for λ = 0 is 0.
Now, let us call U = (X^(T)X + λI)^(-1), then we get as the bias:
Code: Select all
bias(β) = -λUβ
Oh, and I wouldn't ignore the matrix algebra, because it really helps to solve the problem rather quickly. It also makes is easier to understand for my taste. And I really think that someone should understand matrix algebra first, before trying to understand the ridge regression.
Reference:
Tikhonov, 1943: http://a-server.math.nsc.ru/IPP/BASE_WORK/tihon_en.html
Hoerl and Kennard, 1970: http://www.jstor.org/discover/10.2307/1 ... 2215866301
Chawla, 1988: http://www.sciencedirect.com/science/ar ... 5288900399
Btw, if I understood Engelmann's approach correctly, the one thing which helped him improving the results further was a machine learning algorithm. He showed (in a Sloan paper) that a boxscore-based stats improves in terms of predictive power even over regularized +/- by using a finite state machine model. I guess, he applied that approach to the ridge regression model as well in order to improve the predictive power further. Not entirely sure whether that is true, however, but his results in comparison to my results (albeit not a good matchupfile) suggest that.
You could use a different prior for each rookie. What you would need to have is a reliable approximation, which can be based on draft position, height, age, college(highschool) stats. I applied a regression to a statistical +/- model where I found that a higher draft position usually means a higher value, more height over league average height for a specific position (I have lead guards, wings and pivots, only 3 positions) will usually result into a better value and age also having a positive effect (meaning, older rookies tend to have a better value).AcrossTheCourt wrote:I've heard of using -3 or -2 for rookies as the standard. But there are clearly rookies who have an impact right away (guys who spent a while in school like Duncan or, oddly enough, Rubio), and there are clearly rookies who are over their heads (Austin Rivers, Morrison.) So it doesn't seem ideal to use the same prior for everyone.
Using the same prior for each rookie is just an estimate and gives a better predictive value than using no prior. I doubt that someone would argue, that this would be ideal or that no better way would exist. It is just a rather easy solution for an existing problem.