Reforming the Candidates Cycle: Methodology
By
Matthew S. Wilson
For each game, the model has to
know the probability of a win, of a loss, and of a draw. Here I describe where
those probabilities come from and the tests that justify the model’s assumptions.
After a few unexciting preliminaries, we’ll look at the more interesting
question of whether there is a confidence effect from winning your previous
game.
First, I searched my database for
games in which both players are rated 2750 and above. I used this rating cutoff
since only top players can be serious contenders for the World Championship.
Then I threw out games that were blitz, rapid, blindfold, Chess960, or involved
a computer. This is because the Candidates cycle will consist almost entirely
of classical games between humans. These searches left about 1600 games for
analysis.
In these 1600 games, 62.5% ended in
draws. That is how I chose the probability of a draw in the model. Most players
know Elo’s famous formula:
rating change = K(actual score - expected score)
When you do better than
expected (i.e., actual score > expected score), then you gain rating points.
The expected scores come from the Elo system (Elo’s book The Rating of Chess Players,
Past and Present goes into much greater depth). The expected scores also
have another interpretation:
expected score = (1 x probability of a win) + (0.5 x probability of a draw)
Plugging in the draw
rate of 62.5% and the expected score from Elo’s system lets us solve for the
probability of a win. With that in hand, we can also solve for the probability
of a loss.
You may have noticed an
implicit assumption. I’m treating the games as independent, and this can
certainly be questioned. If you win your previous game, you might gain
confidence and play better in the next game. Top players often say that
confidence is important. So should the probabilities change if you won your
last game?
I tested this in my
database of top players. Here is how the tests were run. We want to know if the
result of your previous game influences the outcome of the next game. First, we
take into account ratings and color (whether you played white or black). We
adjust for ratings by using the expected score:
ER = actual score - expected score
ER is positive if you
perform above your rating (perhaps due to confidence) and negative if you
underperform your rating. Elo ratings don’t take color into account, but there
is a simple solution to this. I divided the data into 6 bins.
The test: if there is a
confidence effect, then the average in Bin 1 will be higher than the averages
in Bins 2 and 3. Similarly, the average in Bin A will be greater than the
averages in Bins B and C.
The result? No
significant difference between Bins 1, 2, and 3! If anything, White scored
slightly worse if he won his previous
game – the opposite of a confidence effect. But the difference was too small to
pass tests of statistical significance. Similarly, Bins A, B, and C had no
statistically significant differences. There is no evidence that the result in
the previous game affects the outcome of the next game.
But perhaps this should
not be surprising. In order to become a top player, you might have to be able
to put the last game behind you and not let your emotions influence your play. Importantly,
these tests were done in a database where all players were rated 2750+. That is
the right group to study for the Candidates cycle model. But it doesn’t answer
the other interesting question, which is whether there is a confidence effect
for lower rated players. It is uncertain if the model can be applied to the
Women’s World Championship (a caveat for that part of the essay). This may be a
fruitful area for more study.
I explored another
avenue for a confidence effect to manifest itself. The previous test only
considered the results in the game immediately after the previous game. What
about the games two or three rounds later in a tournament? A confidence boost
from an early win might affect those results. Also, an early loss could
indicate poor form, and this could show up the results from later rounds. Both
of these effects would cause the players’ scores to be more spread out.
Compared to reality, the model would have too many players near 50% and not enough
at the extremes.
The test: Step 1. Find the average distance from a
50% score in top tournaments. Let’s say you scored 6.5/10 in a tournament and I
got 3.5/10. Your distance from a 50% score is 6.5 – 5 = 1.5. Mine is 5 – 3.5 =
1.5. Our average distance from 50% is (1.5 + 1.5)/2 = 1.5. The top tournaments
in the test came from the 2015 Grand Chess Tour. Step 2. Find the average distance from a 50% score in the model’s
simulations. Step 3. Run a
statistical test to check if the results from Steps 1 and 2 are significantly
different. The numbers found in Step 1 should be larger than their counterparts
in Step 2 if there is a confidence effect.
But it turned out that
the difference was statistically insignificant. There really does not seem to
be a confidence effect, at least in top tournaments. In the simulations, the
games are independent, and the evidence does not contradict that.
Strictly speaking, I
have not disproven the confidence effect. Rather, the tests have looked for and
failed to find any evidence of such an effect. Why did they not find any
confidence effect? It could be that the effect is simply not there. It could
also be that it is too small for the tests to find. But a small confidence
effect is not a problem for the model. Small changes in the draw and win
probabilities have very small effects on the result. It doesn’t matter very
much if Player A’s chances of reaching the World Championship match are 88.95%
instead of 89.66%.
There is another
assumption worth testing. The probability of a draw is the same in every game,
but perhaps it should depend on the players’ ratings. The probabilities of a
win and of a loss were already adjusted for rating based on Elo’s formulas. I’ll
call the gap between white’s rating and black’s rating the rating gap. It’s reasonable to expect a higher draw rate between
evenly matched players. If the players are far apart in skill, draws should be
rarer. The further apart the players’ ratings are, the lower the draw rate; I
will call this the rating gap effect.
This was easy to
establish in a large database (classical time controls and both players 2200+).
However, if I only look at games between those rated 2750+, the rating gap
effect is no longer statistically significant. It should be noted that a rating
gap effect would be very hard to
detect in this database. That is because when both players are 2750+, there is
hardly any variation in the rating gap. Except for some of Carlsen’s and
Kasparov’s games, the database consists of battles between players very evenly
matched. It will be tough to find a rating gap effect when there isn’t much of
a rating gap.
Suppose we remain
convinced that the rating gap effect is real but too small for the tests to
pick up. To estimate its size, I have to go back to my larger database
(classical time controls and both players 2200+). This suggested that the draw
rate should be a couple percentage points lower when Player A is involved and a
few points higher when he is not. What would that imply for the model? Not much.
Small changes in the draw rate only lead to very small changes in the outcome.
Furthermore, the changes act in opposite directions. Lowering the draw rate for
Player A makes upsets more likely. But raising the draw rate for the other
games makes it harder for the weaker players to overtake him. Overall, Player
A’s chances of reaching the World Championship match are still very close to
90%.
A final concern is that
the model ignores whether a player has white or black. But everyone plays the
same number of games with each color, so any of those issues will cancel out.
As you can see, there
is room to debate the model’s assumptions. However, the results barely budge if
the model is tweaked in response to these potential criticisms. The proposed
format for the Candidates cycle would greatly increase the chance that the best
player wins.