e4stat: May 2022

Monday, May 30, 2022

Norway Chess 2022 Update

Updated forecast after Tari replaced Rapport

Friday, May 20, 2022

Norway Chess 2022

Anand returns to classical chess! Though many players are skipping the tournament in order to focus on the Candidates, the field is still very strong.

Saturday, May 7, 2022

Are 2700s overrated? Insights from the new model

In my last post, I described the new model. Now we will look at more results, though this time I will try to stay away from all the technical jargon.

If you perform better than your expected score, then your rating goes up. But is the expected score formula accurate? The model can tell us. Here is the graph for a 2650 player facing opponents from 2600 to 2900. When a 2650 plays another 2650, the expected score is 0.5. Not surprising - when you play someone with the same rating, you have equal chances. The model and Elo's formula agree on that. But then the two lines diverge. On the far left, we have a 2650 playing a 2600. Elo's formula ("theory") says that the 2650's expected score is about 0.57. But in the data, the 2650 scores slightly worse - roughly 0.55. This means that in real life, the 2650 will lose rating points if playing weaker players. On the other side of the graph, the pattern reverses. When facing stronger opponents, 2650s perform better than their rating and gain points.

The next graph is for a 2700. It is a similar story. 2700s underperform when facing weaker opponents. However, they do better than expected against stronger players.

It's also true for 2750s:

What does this mean for the forecasts? The new model is based on the data, so it will give more weight to the underdogs. The old model was based on Elo's formula; it was overestimating the favorites. We can see this when we revisit the Carlsen-Nepo match. Carlsen was the higher rated player, so the original model gave him higher chances.

Old model

Carlsen wins the match: 82.86%

Nepo wins the match: 9.115%

Drawn match: 8.025%

New model

Carlsen wins the match: 75.535%

Nepo wins the match: 13.56%

Drawn match: 10.905%

A word of caution about interpreting the results: the model was based on games with at least one 2700 player. Will a 2000-rated amateur gain points if they play up a section? I don't know. The model was designed for elite tournaments, not amateurs.

I don't have a clear cut answer for the question in the title. Are 2700s overrated? Yes - but only when playing weaker opponents. When playing stronger opponents, they are underrated. Should FIDE fix this problem by adjusting the expected score formula? It might be more complicated than that. There is going to be a feedback effect on the ratings. There are many rating systems that are superior to Elo; it would be better to switch to one of them instead.

Friday, May 6, 2022

New Methodology

Before the Carlsen-Nepo match, I had noticed that my model was underestimating the draw rate. I improved it before the World Championship, but there was more work to do. The original model was based on a large database. Most of the games involved strong amateurs, such as 2300s and 2400s. Elite tournaments were a tiny minority, so the model was not focused on them. However, my forecasts are always for top tournaments. First, I needed a better sample. I only kept games where both players were 2500+. Then I estimated a model based on games where at least one player was 2700+ during the period 2010-2021.

The rest of the details are more technical. Each chess game has 3 possible outcomes; this suggests that an ordered logit model is appropriate. However, several sources said that the proportional odds assumption is often violated in the data. Instead, I used a generalized order logit. The explanatory variables were:

-the rating gap (i.e. White's rating - Black's rating)

-the average rating of the two players

-the year in which the game was played

-"elite": this variable equals 1 if one of the players is 2750+ and other other isn't. Otherwise it equals 0. This variable could matter if 2750's are overrated. Normally if a player is overrated, then their rating should adjust. When their rating is above their true strength, they won't be able to perform well enough to maintain it. However, top players mostly compete in round robins against each other. They don't mix with the rest of the pool. If overrated players face each other, then the ratings don't adjust.

-interaction terms (gap x avg, gap x year, avg x year, elite x gap, elite x avg, elite x year, gap^2, avg^2, year^2)

The terms involving Elite were statistically significant (p-value = 0.0004). Here are the results (N = 25534 games):

Interpreting ordered logit coefficients is notoriously difficult. Based on this table, I can't say if Elite players are overrated or underrated, but the Elite variable is clearly significant.

Next, I checked if the model fixes the original problem: does it underestimate the draw rate? If both players are 2750+, the model's draw rate is only 0.19% lower. This difference is statistically insignificant (p-value = 0.832). Though the model is targeted for 2700s, sometimes an organizer will invite a local 2600 to an elite round robin. Does the model still work for them? I tested it in this sample: one player is 2700+, the other isn't. The model's draw rate is 0.14% too high, which is statistically insignificant (p-value = 0.709).

The new model estimates the win, loss, and draw probabilities. The traditional model only estimated the draw rate; the win and loss probabilities were then extrapolated from Elo's expected score formula. This allows me to test the validity of the expected score formula. I will discuss this in a different post, since most readers probably gave up somewhere around the second paragraph.