e4stat: New Methodology

Before the Carlsen-Nepo match, I had noticed that my model was underestimating the draw rate. I improved it before the World Championship, but there was more work to do. The original model was based on a large database. Most of the games involved strong amateurs, such as 2300s and 2400s. Elite tournaments were a tiny minority, so the model was not focused on them. However, my forecasts are always for top tournaments. First, I needed a better sample. I only kept games where both players were 2500+. Then I estimated a model based on games where at least one player was 2700+ during the period 2010-2021.

The rest of the details are more technical. Each chess game has 3 possible outcomes; this suggests that an ordered logit model is appropriate. However, several sources said that the proportional odds assumption is often violated in the data. Instead, I used a generalized order logit. The explanatory variables were:

-the rating gap (i.e. White's rating - Black's rating)

-the average rating of the two players

-the year in which the game was played

-"elite": this variable equals 1 if one of the players is 2750+ and other other isn't. Otherwise it equals 0. This variable could matter if 2750's are overrated. Normally if a player is overrated, then their rating should adjust. When their rating is above their true strength, they won't be able to perform well enough to maintain it. However, top players mostly compete in round robins against each other. They don't mix with the rest of the pool. If overrated players face each other, then the ratings don't adjust.

-interaction terms (gap x avg, gap x year, avg x year, elite x gap, elite x avg, elite x year, gap^2, avg^2, year^2)

The terms involving Elite were statistically significant (p-value = 0.0004). Here are the results (N = 25534 games):

Interpreting ordered logit coefficients is notoriously difficult. Based on this table, I can't say if Elite players are overrated or underrated, but the Elite variable is clearly significant.

Next, I checked if the model fixes the original problem: does it underestimate the draw rate? If both players are 2750+, the model's draw rate is only 0.19% lower. This difference is statistically insignificant (p-value = 0.832). Though the model is targeted for 2700s, sometimes an organizer will invite a local 2600 to an elite round robin. Does the model still work for them? I tested it in this sample: one player is 2700+, the other isn't. The model's draw rate is 0.14% too high, which is statistically insignificant (p-value = 0.709).

The new model estimates the win, loss, and draw probabilities. The traditional model only estimated the draw rate; the win and loss probabilities were then extrapolated from Elo's expected score formula. This allows me to test the validity of the expected score formula. I will discuss this in a different post, since most readers probably gave up somewhere around the second paragraph.

e4stat

Friday, May 6, 2022

New Methodology

No comments:

Post a Comment