Page 1 of 1

Problems with Linearity Assumptions

Posted: Mon Dec 29, 2014 8:20 pm
by DSMok1
Nick Neuteufel has been bringing up on Twitter some issues with the linearity assumption used by Generalized Linear Models used by such stats as Box Plus/Minus, SPM, and others.

In particular, DRB% fails to show any linear link function to RAPM, or even any linear link function for any transformation of DRB%.

An example of the issue:
Image
(p-val of GVLMA is 6.068e-13.)
DRB% is a non-linear predictor that violates the linearity assumption of linear regression. More on that: http://t.co/PAhWovJFUl
Links to some of the tweets so the conversation can be followed:
https://twitter.com/Neuteufel/status/548135893354442754
https://twitter.com/Neuteufel/status/549630822232653826

Nick promised an upcoming post on the subject, so when that appears I'll link to it here.

EDIT: Paper on Global Validation of Linear Model Assumptions (a good read): http://www.google.com/url?sa=t&rct=j&q= ... 1339,d.aWw

Re: Problems with Linearity Assumptions

Posted: Mon Dec 29, 2014 8:25 pm
by nrestifo
I talked to him about this. I've been looking forward to reading it.

Re: Problems with Linearity Assumptions

Posted: Wed Dec 31, 2014 3:24 am
by mtamada
Well if transformations don't work, then an easy and standard patch is to use what econometricians call slope dummies, i.e. allow the regression line to have one or more kinks in it. That results in a piecewise linear regression line; if we really need to have a curved regression line then we can resort to cubic splines.

If the non-linearity is truly serious then we'll get much better estimates of the effects of defensive rebounding.

But eyeballing that graph, the non-linearity doesn't look that large to me. Plus or minus half a point throughout almost all of the range. Yes half a point is large in some contexts, but plus-minus regressions have large standard errors to begin with. Will an improved functional form lead to a large revision in our estimate of the effect of defensive rebounds?

Maybe; and given the statistical significance that he's found it could well be worthwhile to use a better-fitting functional form.

The other question is will these improved functional form lead to different estimates of the other coefficients? That's one of the hidden pitfalls of mis-specifying the functional form for defensive rebounds: the other estimates become inaccurate. Maybe we find out that the old regressions have been mis-evaluating assists, shooting, etc. Again we won't know until we re-run the regressions, but my guess is that this won't cause major revision of the estimates.

Re: Problems with Linearity Assumptions

Posted: Wed Dec 31, 2014 12:38 pm
by v-zero
Methods such as the use of boosted decision stumps (with linear regressions at their terminal nodes if required) can avoid the linearity question for the most part.

Re: Problems with Linearity Assumptions

Posted: Wed Dec 31, 2014 10:23 pm
by Crow
What would be the objections to modeling defensive rebounds by a different form than most or all of the rest?

Re: Problems with Linearity Assumptions

Posted: Wed Dec 31, 2014 11:42 pm
by xkonk
If memory serves, some potentially helpful predictors for SPM were discarded simply because the results were undesirable (e.g., Dennis Rodman was rated really, really highly). Why not make the executive decision that this non-linearity won't be adjusted for?

Re: Problems with Linearity Assumptions

Posted: Thu Jan 01, 2015 12:46 pm
by DSMok1
xkonk wrote:If memory serves, some potentially helpful predictors for SPM were discarded simply because the results were undesirable (e.g., Dennis Rodman was rated really, really highly). Why not make the executive decision that this non-linearity won't be adjusted for?
That is not the case; BPM was not adjusted to choose predictors based on outputs.

Early on, I experimented with a lot of non linearity, but it was not stable out of sample at all.

Re: Problems with Linearity Assumptions

Posted: Thu Jan 01, 2015 4:09 pm
by xkonk
I wasn't referring to BPM specifically; I had your ASPM more in mind. I think this thread might be what I was thinking of: viewtopic.php?f=2&t=21&p=34&hilit=Rodman+DRB#p34 . It raises a fair question though: if you were willing to change the form of the regression for one metric, why not another?

Re: Problems with Linearity Assumptions

Posted: Fri Jan 02, 2015 7:12 am
by mtamada
DSMok1 wrote: Early on, I experimented with a lot of non linearity, but it was not stable out of sample at all.
Yeah, there are a ton of examples where non-linear models have more weaknesses than strengths.

I suspect that this is one of them, or that some mildly alternative functional form will cover most of the non-linearity, and won't lead to radically different overall results.

But I don't know that for sure. When we detect non-linearity, then we need to investigate ways of dealing with it. So by all means NickN should continue his research, it's potentially important. But I don't expect it to be, especially hearing that you looked into it extensively already.


Another way to describe it: to assume a linear functional form is absurd and leads to inherently weak models. But when we use a non-linear functional form, it often (not always, but often) ends up being even worse than the linear one. Typically I'll try to find a simple patch of some sort that deals with the worst of the non-linearities, such as slope dummies (this is assuming that transformations failed to deal with the problem, which is evidently the case here).

Re: Problems with Linearity Assumptions

Posted: Fri Jan 02, 2015 1:25 pm
by DSMok1
xkonk wrote:I wasn't referring to BPM specifically; I had your ASPM more in mind. I think this thread might be what I was thinking of: viewtopic.php?f=2&t=21&p=34&hilit=Rodman+DRB#p34 . It raises a fair question though: if you were willing to change the form of the regression for one metric, why not another?
Right, that was really early on, when I was just starting out. I was going by smell test back then... :)

Re: Problems with Linearity Assumptions

Posted: Fri Jan 02, 2015 9:40 pm
by colts18
I don't think you should take Defensive rebounds away from the regression. It helps identify the good defenders. It even does a good job with perimeter defenders. MJ, Kobe, LeBron, Wade, Carter were all guys with really good DReb for guards and they were generally good defenders. Dreb is a good proxy for height which is correlated with defense for perimeter players.

Here are some suggestions to add to your model.
1. Usage%. I assume usage% is not linear just like Dreb%. Look to see if you can model usage% better. I'm not sure a 40% usage% is 2x more valuable than a 20% usage%.

2. Games played and games started. Games played is a good proxy for injuries which makes players less effective. You should look into it. Games started is a proxy for being a good player plus it does a good job of adjusting for competition faced.

3. Adjust 3 point rate for height. A big man with a high 3 point rate provides more spacing than a guard. Of course you can use that same regression to downgrade those big men on defense because they are generally not good on defense (ex: Ryan Anderson, Mullens, etc.)

4. adjust free throw rate to usage. I think you didn't get any correlation for free throw rate because big men with low usage rates generally have high ft rates. Give more credit to high usage guys (usage that subtracts FTA) with high FTr. Instead of FTr, try using FTA per 100 possessions. That will reward high FTA guys instead of guys with 2 FTA/g on 3 FGA/g.

5. Look into high minutes low offensive stat guys without the defensive counting stats (blks, stls). Joe Dumars never had a positive DBPM during his career. Adjust the stat so that high MPG like him can get credit for defense if they aren't producing much on offense because those kinds of players wouldn't be playing 35-40 MPG if they couldn't play offense or defense.

Re: Problems with Linearity Assumptions

Posted: Tue Jan 13, 2015 9:52 pm
by Crow
I asked Nick about it last week. He said he planned to do something this past weekend... but he didn't seem that committed to it. So at this point, I wouldn't count on it.