Incorporating "gamestate" into APM/RAPM, design matrix ?s

A Gravity Well · Post by **A Gravity Well** » Tue Nov 24, 2015 11:29 pm

Sample design matrix for 2v2 game Golden State Warriors AT Atlanta Hawks

Using 1/-1 dummy variables for offense/defense not as a matter of preference but so the regression "sees" who the home team is: when Atlanta is on offense, their players are asssigned 1s and HCA is set to 1, differentiating HCA from Golden State, whose players are set to -1 on defense; when Atlanta is on defense, their players are assigned -1s and HCA is set to -1, differentiating HCA from Golden State, whose players are set to 1 on offense. This is an attempt to capture the cumulative effect of HCA on both ends of the court: having two columns for hca.o and hca.d results in hca.d being dropped (and rightfully so).

Is it "wrong" to do it like this? How I'll be using "gamestate" information (in the above example: score margin at beginning of possession) is set up in the same way, so that if the away team is ahead, the sign of the dummy variable in the respective margin column is OPPOSITE the sign of HCA for that row. In the fifth row (not including headers), the score entering that possession is 5-4 advantage Golden State, who has the ball. Since Golden State has the ball, HCA is at -1, signifying that any HCA rests with the defense (whose dummy variables are -1), and the margin column m.1 has a positive 1, again signed OPPOSITE of HCA, in an attempt to see that it is team that isn't at home who has the lead.

Conceptually, I've always seen this as

Code: Select all

 (Home Team Rating - HCA) - Road Team Rating = Score Differential

rather than

Code: Select all

Home Team Rating - Road Team Rating + Home Court Advantage = Score Differential

as the former understands who the home team is and subtracts home court advantage from them. This is the hurdle I'm trying to clear with implementing gamestate data, as shouldn't the regression understand who is up and who is down?

Crow · Post by **Crow** » Sat Nov 28, 2015 3:24 am

Anyone have feedback to offer on this?

J.E. · Post by **J.E.** » Tue Dec 01, 2015 9:47 am

This is extremely confusing to me.

What's m, jt, pm, ts, sc, dg, ab?

A Gravity Well · Post by **A Gravity Well** » Tue Dec 01, 2015 9:46 pm

J.E. wrote:This is extremely confusing to me.

What's m, jt, pm, ts, sc, dg, ab?

m <- margin. m2 if team is up 2, m1 if up 1, etc

Remaining initials are those of the players for this example -- Jeff Teague, Paul Millsap, Thabo Sefolosha. Just for the purposes of this mock-up did I do 2 vs 2, actual data set is full 5v5 with unique IDs for each column.

permaximum · Post by **permaximum** » Wed Dec 02, 2015 12:31 am

I have never used HCA as a variable on the regression stage to calculate ARPM/RAPM since I haven't cared about it. I doubt J.E. does it this way but if I implemented HCA as a variable to the regression, I would do it differently than what you describe.

HCA.o.= 1 , HCA.d = 0 on offense along with home players and

HCA.o = 0, HCA.d = -1 on defense along with home players. Having two columns for hca shouldn't be a problem. Think about it as an another player. The same principle goes for gamestate information but it should be a little bit more complex. There are possibly diminishing returns and stuff so, implementing it as a linear regression (whether it's penalized or not) variable doesn't sound right to me. I don't know how J.E. does it.

Still I have a feeling with "gamestate" you just mean the score margin not the gamestate adjustment on player values such as the effect of leading by 9 with 4 mins left or trailing by 38 with 21 mins left. The gamestate adjustment I mentioned can't be linear.

As for those formulas, the latter is true if I guessed what "rating" means correct. Home Team's rating shouldn't have HCA in it when you adjust for that.

Also, I agree that your post was rather confusing. Hopefully, I understood it right.

EvanZ · Post by **EvanZ** » Wed Dec 02, 2015 1:17 am

Why are you treating the "game state" (point margin) as a dummy variable and not just a continuous variable? Do you have reason to believe the effect (if any) will be non-linear? I mean, maybe it will be, but are you really going to have a column for each possible value of point margin? Seems like overkill and possibly quite prone to overfitting.

A Gravity Well · Post by **A Gravity Well** » Wed Dec 02, 2015 2:53 am

permaximum wrote:I have never used HCA as a variable on the regression stage to calculate ARPM/RAPM since I haven't cared about it. I doubt J.E. does it this way but if I implemented HCA as a variable to the regression, I would do it differently than what you describe.

What are the alternatives to not including it as a variable?

HCA.o.= 1 , HCA.d = 0 on offense along with home players and

HCA.o = 0, HCA.d = -1 on defense along with home players. Having two columns for hca shouldn't be a problem. Think about it as an another player. The same principle goes for gamestate information but it should be a little bit more complex. There are possibly diminishing returns and stuff so, implementing it as a linear regression (whether it's penalized or not) variable doesn't sound right to me. I don't know how J.E. does it.

Still I have a feeling with "gamestate" you just mean the score margin not the gamestate adjustment on player values such as the effect of leading by 9 with 4 mins left or trailing by 38 with 21 mins left. The gamestate adjustment I mentioned can't be linear.

For this mock-up, I meant score margin, but "the effect of leading by 9 with 4 mins left or trailing by 38 with 21 mins left." was next on the list for implementation. Guess I have to re-think HOW to implement that, then.

As for those formulas, the latter is true if I guessed what "rating" means correct. Home Team's rating shouldn't have HCA in it when you adjust for that.

Also, I agree that your post was rather confusing. Hopefully, I understood it right.

You have it right. Since posting I've come to terms with the "seeing" conundrum I'd inflicted upon myself.

A Gravity Well · Post by **A Gravity Well** » Wed Dec 02, 2015 3:01 am

EvanZ wrote:Why are you treating the "game state" (point margin) as a dummy variable and not just a continuous variable? Do you have reason to believe the effect (if any) will be non-linear? I mean, maybe it will be, but are you really going to have a column for each possible value of point margin? Seems like overkill and possibly quite prone to overfitting.

The idea was to bin point and time (up 29 with 3 left, down 7 with 4 left and so on) totals. If being up anywhere from 15 to 30 with 2 minutes left had essentially the same effect, they'd all be placed in one bin, if 9 and 10 with 11 minutes left had the same effect, they'd all be placed in one bin and so forth.

J.E. · Post by **J.E.** » Wed Dec 02, 2015 9:54 am

I'm more a fan of having a single column for HCA, and toggle it for home (1) and away(0) possessions

Or, adjust the results vector depending on if it was a home or away possession (probably better with Ridge and small sample sizes)

As for your m's, I would not put -1s in there. Just switch the m's on/off for m.down_30 until m.up_30

permaximum · Post by **permaximum** » Wed Dec 02, 2015 10:18 am

EvanZ wrote:Why are you treating the "game state" (point margin) as a dummy variable and not just a continuous variable? Do you have reason to believe the effect (if any) will be non-linear? I mean, maybe it will be, but are you really going to have a column for each possible value of point margin? Seems like overkill and possibly quite prone to overfitting.

In theory, it shouldn't be linear. I haven't tested but i'm 100% sure on this.

Q: What's the effect of leading by 24 and trailing by 19 with 1:24 mins left? What if the garbage time lineup is on the floor or if they are not?

A Gravity Well wrote:What are the alternatives to not including it as a variable?

Calculate HCA at the team level with SRS home/away difference compared to league average but don't forget to include the effect of B2B. This way you'll have all the data you need for all arenas instead of being limited to PBP data 2000+ (1996+ if you're lucky). When you find HCA(per 100 offense and defense poss) for all arenas, adjust possession scores (results vector J.E. pointed out above) in the regression ( e.g.: 300 (possession score * 100) - 2.4 (hca) home team on offense or 400 + 2.4(hca) home team on defense). This way you won't have different values for HCA on defense and offense but you'll have a more reliable value.

BTW score/mov/rating (whatever you call it) for each poss should always be positive in your regression if you're following what I described.

Edit: If I were you I would just use the values J.E. found in this thread. There are HCA values for teams along with B2B and rest effects. Since he calculated it via one big regression (2002-2014 i guess) the sample size should be enough. Then I would do what I described above.

Edit2: It looks those HCA values are not per 100 possession but per game. So you simply need to do: "HCA*2/1.95". Remove charlotte hornets from the list and use charlotte bobcats' value. Then normalize those values. HCA values should never be negative.

A Gravity Well · Post by **A Gravity Well** » Wed Dec 02, 2015 7:03 pm

J.E. wrote:I'm more a fan of having a single column for HCA, and toggle it for home (1) and away(0) possessions

If you were to do two columns for HCA, how would you do it? The 1/-1 way (It looks good to me?)? For matters of prediction/interaction with other units, knowing a team plays the same on offense but is -3 on defense would be useful information.

As for your m's, I would not put -1s in there. Just switch the m's on/off for m.down_30 until m.up_30

Do you do any time-and-margin considerations?

EvanZ · Post by **EvanZ** » Wed Dec 02, 2015 7:12 pm

Yeah, I have to agree with JE about the HCA factor. I don't see any good reason to use two columns. It's not two different factors. And if you treat it as two different factors, they are 100% correlated, so it doesn't add any information.

Also, why not just use win probability as a proxy for game state? That's how I would start doing this.

permaximum · Post by **permaximum** » Wed Dec 02, 2015 8:11 pm

Let alone using HCA as one column in regression, I wouldn't even bother to capture it's effect while calculating RAPM unless I do a regression by using a data like 1996-2015. It's overkill. However it doesn't meant it's useless. I don't remember if somebody tested HCA's effect seperately for both ends of the floor. It shouldn't be meaningfully different but I would like to see a test.

A Gravity Well · Post by **A Gravity Well** » Fri Dec 04, 2015 8:28 am

EvanZ wrote:Also, why not just use win probability as a proxy for game state? That's how I would start doing this.

I love that.

Another question:

Last night, the Spurs played the second game of a back to back at Memphis. The night before, they were home against Milwaukee, making this a HomeRoad back-to-back. Memphis had a full day of rest, last playing on the road on Tuesday in New Orleans. Are rest effects "interaction terms"? Or are rest effects just another gamestate/HCA-like column, where the effect is only switched on when the team suffering from the back to back or the team enjoying the day of rest is on offense (but then how does the regression know what the defense is suffering from or enjoying)? Or should I ask this question of J.E. at the previously linked thread on HCA/rest?

A Gravity Well · Post by **A Gravity Well** » Wed Dec 09, 2015 12:35 am

permaximum wrote:
A Gravity Well wrote:What are the alternatives to not including it as a variable?
Calculate HCA at the team level with SRS home/away difference compared to league average but don't forget to include the effect of B2B. This way you'll have all the data you need for all arenas instead of being limited to PBP data 2000+ (1996+ if you're lucky). When you find HCA(per 100 offense and defense poss) for all arenas, adjust possession scores (results vector J.E. pointed out above) in the regression ( e.g.: 300 (possession score * 100) - 2.4 (hca) home team on offense or 400 + 2.4(hca) home team on defense). This way you won't have different values for HCA on defense and offense but you'll have a more reliable value.

BTW score/mov/rating (whatever you call it) for each poss should always be positive in your regression if you're following what I described.

Edit: If I were you I would just use the values J.E. found in this thread. There are HCA values for teams along with B2B and rest effects. Since he calculated it via one big regression (2002-2014 i guess) the sample size should be enough. Then I would do what I described above.

Edit2: It looks those HCA values are not per 100 possession but per game. So you simply need to do: "HCA*2/1.95". Remove charlotte hornets from the list and use charlotte bobcats' value. Then normalize those values. HCA values should never be negative.

If I'm understanding you and the fine posters here correctly:

To calculate HCA for each team over N years:

1/0/-1 dummy variables: 1 for Offense, -1 for Defense

Result vector of points per 100 possessions for the offensive unit of the matchup

b2b.rh.ot = second night of a back to back (b2b) of the road/home (rh) variety, first game finished in overtime (ot)
r.2.r = rest (r) of last playing two (2) days ago, last game played on the road (r)

First game: Boston Celtics @ Atlanta Hawks
Atlanta last played three days ago on the road, r.3.r is switched on (1 when on offense, -1 when on defense)
Boston last played two days ago at home, r.2.h is switched on (1 when on offense, -1 when on defense)
hca.o is a 1 when Atlanta is on offense
hca.d is a -1 when Atlanta is on defense
Atlanta tallies a 105.7 offensive rating
Boston tallies a 99.9 offensive rating

Modifications
J.E. mentioned not using 1/-1 for the same variable -- should rest effects, then, be split into offensive and defensive halves?
Don't use hca.o and hca.d for each team -- just use one column "hca", set to 1 when the home offense is on the court
Go possession by possession rather than game by game to account for gamestate considerations (More on this to come)
Travel effects, but implementation seems beyond arduous -- would need a team's travel schedule to track whether they return home between road games four days apart. Would then need to track whether they arrive the day of or the day before or earlier. Have columns that are bins of distances from home or distances from last played game? 0-150, 151-300...

And then
After getting the values for rest effects and each team's home court over N years, when running RAPM for players for a specific year, adjust the result vector for each possession or each lineup stint by the values of the previously-found effects present for that possession -- home court, rest effects, travel effects (?) and gamestate -- or series of possessions. (If using gamestate, likely just go possession by possession, as a stint can have multiple gamestates within it).

To drill down further, run the same regression as earlier which found each team's home court advantage, but run it for general league-wide home court advantage instead of team by team -- not for their numbers, but for the ratio of their values. Use that ratio to then divide up each team's HCA among offensive and defensive possessions when adjusting the result vector.

APBRmetrics

Incorporating "gamestate" into APM/RAPM, design matrix ?s

Incorporating "gamestate" into APM/RAPM, design matrix ?s

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?

Re: Incorporating "gamestate" into APM/RAPM, design matrix ?