Thursday, June 14, 2018

"Is Your Opponent Underrated?" methodology

The complete list of tournaments used in the sample:

Atlantic Open 2017 (U2100, U1900, U1700, U1500)
Bradley Open 2017 (U2100, U1800, U1500)
Cherry Blossom Classic 2018 (U2200, U1900, U1600)
Chesapeake Bay Open 2018 (U2200, U1800, U1600)
Chicago Open 2018 (U2300, U1900, U1700, U1500)
Chicago Class 2017 (Expert, A, B, C)
National Chess Congress 2017 (U2200, U2000, U1800, U1600, U1400)
Continental Open 2017 (U2100, U1900, U1700, U1500)
US Amateur East Individual 2018 (U2200, U1800, U1400)
Eastern Class 2018 (Expert, A, B, C)
Eastern Chess Congress 2017 (U2100, U1900, U1700, U1500)
Evans Memorial 2017 (Expert, A, B, C)
George Washington Open 2017 (U2100, U1800, U1500)
Kings Island Open 2017 (U2100, U1900, U1700, U1500)
Liberty Bell Open 2018 (U2100, U1900, U1700, U1500)
Manhattan Open 2017 (U2200, U2000, U1800, U1600, U1400)
National Open 2017 (U2300, U2100, U1900, U1700, U1500)
North American Open 2017 (U2300, U2100, U1900, U1700, U1500)
North Eastern Open 2017 (U2050, U1650)
Pacific Coast Open 2017 (U2100, U1900, U1700, U1500)
Pan-American Intercollegiate 2017
Philadelphia Open 2018 (U2200, U2000, U1800, U1600, U1400)
Potomac Open 2017 (U2300, U2100, U1900, U1700, U1500)
Southern Open 2017 (U2100, U1800, U1500)
Southwestern Class 2018 (Expert, A, B, C)
US Amateur Team North 2018
US Open 2017
World Open 2017 (U2200, U2000, U1800, U1600)
World Amateur Team 2017

Total = 22,828 games
________________________________________________________________

You can see that I often dropped the open section. Why is that? I notice that GMs and IMs frequently travel to large tournaments, so their opponents come from all over the country. Thus, their ratings will  probably not be affected by local inflation or deflation. Regional differences in ratings will only be detected among players who compete primarily in local tournaments, as explained in the article. I typically dropped the bottom sections of the tournaments, since ratings are very volatile at those levels.

Define the "elo residual" to be (actual score - expected score). Let "D" be (player A's rating - player B's rating), i.e., the difference in the ratings. Then Player A's expected score against Player B is:


This comes from Section 4.2 of "The US Chess Rating System" by Glickman and Doan (April 24, 2017) Link to the description of the USCF rating system

Example: In the article, I said that a 1700's expected score against a 1500 is about 0.75. Let's suppose that the 1700 won the game. In that case, his actual score is 1. Therefore, his Elo residual = actual score - expected score = 1 - 0.75 = 0.25. We can say that the 1700 scored 0.25 more points than expected. On average, the Elo residual is zero.

This brings us to a crucial idea. If a player is truly underrated, then he should score better than his rating indicates when he goes to a national tournament. In other words, his Elo residual should be positive. If players from a certain state consistently have positive Elo residuals, then that state is underrated.

I created dummy variables for all 50 states, British Columbia, Ontario, and "other" (for foreign players). For Player A, the dummy variable for a state equals 1 if Player A is from that state. If Player A's opponent is from that state, then the dummy equals -1. Exception: if both players are from the same state, then the dummy is zero. In all other cases, the dummy is zero.

Example: Suppose that Player A is from Montana. He faces an opponent from Kansas. The dummy variable for Montana equals 1. The dummy variable for Kansas is -1. The dummy variables for all the other states are 0.

The first step would be to estimate a linear regression of the following form:

Elo residual =  (BetaAlabama)(Alabama dummy) + (BetaAlaska)(Alaska dummy) + (BetaArizona)(Arizona dummy)+ ... + epsilon

Here, "BetaAlabama" is the coefficient on the Alabama dummy; that is what the model is trying to estimate. If Alabama players are underrated, then "BetaAlabama" will be positive. That's because when "BetaAlabama" is positive, then Alabama players have positive Elo residuals, which means that they outperform their ratings. It also means that players facing an Alabama opponent will tend to underperform. "BetaAlaska" is the coefficient on the Alaska dummy, "BetaArizona" is the coefficient on the Arizona dummy, and so on for all the other states. As usual, epsilon is the disturbance term.

However, there is one issue with this approach. Consider someone who plays 9 games in the US Open. Each game is one observation in my sample. However, those observations might not be independent, which can lead to - how do I say this in plain English? - let's say it leads to problems. If you don't have a stats background and you have read this far, I admire your persistence and curiosity. Unfortunately, the rest of this article isn't going to make much sense if you haven't taken - at the very least - an advanced undergrad course in stats. Preferably a graduate level course.

To correct for these issues, I can insert fixed effects or random effects for each player in each tournament.
Delta is the intercept and alpha_i is the fixed or random effects term for player i. EloResidual_ij is the Elo residual in the game between player i and player j. Due to multicollinearity, the dummy for "other" was dropped.

A Lagrange multiplier test soundly rejected the null of no random effects (test statistic = 279.13, p-value = 0.0000). Then I performed a Hausman test (test statistic = 43.87, p-value = 0.7164); the assumptions of the random effects model were not rejected. Therefore, I based my results on the random effects model.

One last issue. The coefficients in the model above are in terms of the Elo residual. E.g., the intercept plus the coefficient for Washington state was about 0.05, which means that Washington players tend to score 0.05 points more than their ratings would indicate.


In order to convert this into rating points, I took a first order Taylor series approximation of the expected score formula. The rating adjustment for state k is the following.


This transformation was also applied to the confidence intervals in order to generate the second graph.

________________________________________________________________

UPDATE: It looks like some of the graphs had to be cut from the original article. It's probably because the magazine was running out of space. The figures below show the 95% confidence intervals for each state. Here is an example of how to interpret them. Take the first state, Alabama (AL). Our best estimate is that their players are underrated by about 30 points. However, they could be anywhere from 10 points overrated to as much as 70 points underrated. There is a fair amount of uncertainty with smaller states. That is because their sample sizes are small. I get a much bigger sample from states like California (CA), so their confidence interval is not so wide.






------------------------------------------------------------------------------------------------
UPDATE2: Due to a software upgrade, I can now estimate a nonlinear model with random effects.


Though this bypasses the need for a linear approximation, there are other challenges. A key assumption is that the random effects are uncorrelated with the regressors. In the linear model, I verified that with a Hausman test. That doesn't work for nonlinear models because the fixed effects estimator is not consistent. The results for the nonlinear model are in the graphs below, but be aware that they rest on an assumption that I haven't been able to test.




2 comments:

  1. Regarding inflation over time, and your comment in the Chess Life article "if I want to know if a 2100 today is the same as a 2100 in 1990, I have no way to arrange a match between them", it occurs to me that you could actually use computer/software players rated against human players to "travel back in time". It should be possible to use a hardware emulator to replicate the original environment and ensure stable playing strength.

    ReplyDelete
    Replies
    1. I thought about that, but I don't know of any computers that played regularly in human tournaments. Usually programs come with an estimate of its strength, but those estimates can be very unreliable. For example, when I was 1900, I could score around 30-40% against a computer that was supposedly 2450.

      Delete