Methodology
Friday, April 14, 2017
Grenke Chess Classic
The Grenke Chess Classic starts soon. The 8-player round robin features 3 players in the 2800 club: Carlsen, Caruana, and MVL.
Methodology
Methodology
Thursday, March 23, 2017
US Chess Championship 2017
The US Chess Championship begins next week. Almost surely, the title will go to one of the "Big Three": So, Caruana, or Nakamura. So and Caruana have been doing very well in the last year; both of them have pushed their ratings into the 2800s. Nakamura has fallen about 20 points behind them, though he should definitely not be counted out.
Thursday, February 2, 2017
Reforming the Candidates Cycle: Methodology
I am thrilled that my article, "Reforming the Candidates Cycle," was published in Chess Life. To save space (an important consideration for magazines), the methodology and discussion sections are posted here online instead of in print.
Reforming the Candidates Cycle: Methodology
By
Matthew S. Wilson
For each game, the model has to
know the probability of a win, of a loss, and of a draw. Here I describe where
those probabilities come from and the tests that justify the model’s assumptions.
After a few unexciting preliminaries, we’ll look at the more interesting
question of whether there is a confidence effect from winning your previous
game.
First, I searched my database for
games in which both players are rated 2750 and above. I used this rating cutoff
since only top players can be serious contenders for the World Championship.
Then I threw out games that were blitz, rapid, blindfold, Chess960, or involved
a computer. This is because the Candidates cycle will consist almost entirely
of classical games between humans. These searches left about 1600 games for
analysis.
In these 1600 games, 62.5% ended in
draws. That is how I chose the probability of a draw in the model. Most players
know Elo’s famous formula:
rating change = K(actual score - expected score)
When you do better than
expected (i.e., actual score > expected score), then you gain rating points.
The expected scores come from the Elo system (Elo’s book The Rating of Chess Players,
Past and Present goes into much greater depth). The expected scores also
have another interpretation:
expected score = (1 x probability of a win) + (0.5 x probability of a draw)
Plugging in the draw
rate of 62.5% and the expected score from Elo’s system lets us solve for the
probability of a win. With that in hand, we can also solve for the probability
of a loss.
You may have noticed an
implicit assumption. I’m treating the games as independent, and this can
certainly be questioned. If you win your previous game, you might gain
confidence and play better in the next game. Top players often say that
confidence is important. So should the probabilities change if you won your
last game?
I tested this in my
database of top players. Here is how the tests were run. We want to know if the
result of your previous game influences the outcome of the next game. First, we
take into account ratings and color (whether you played white or black). We
adjust for ratings by using the expected score:
ER = actual score - expected score
ER is positive if you
perform above your rating (perhaps due to confidence) and negative if you
underperform your rating. Elo ratings don’t take color into account, but there
is a simple solution to this. I divided the data into 6 bins.
The test: if there is a
confidence effect, then the average in Bin 1 will be higher than the averages
in Bins 2 and 3. Similarly, the average in Bin A will be greater than the
averages in Bins B and C.
The result? No
significant difference between Bins 1, 2, and 3! If anything, White scored
slightly worse if he won his previous
game – the opposite of a confidence effect. But the difference was too small to
pass tests of statistical significance. Similarly, Bins A, B, and C had no
statistically significant differences. There is no evidence that the result in
the previous game affects the outcome of the next game.
But perhaps this should
not be surprising. In order to become a top player, you might have to be able
to put the last game behind you and not let your emotions influence your play. Importantly,
these tests were done in a database where all players were rated 2750+. That is
the right group to study for the Candidates cycle model. But it doesn’t answer
the other interesting question, which is whether there is a confidence effect
for lower rated players. It is uncertain if the model can be applied to the
Women’s World Championship (a caveat for that part of the essay). This may be a
fruitful area for more study.
I explored another
avenue for a confidence effect to manifest itself. The previous test only
considered the results in the game immediately after the previous game. What
about the games two or three rounds later in a tournament? A confidence boost
from an early win might affect those results. Also, an early loss could
indicate poor form, and this could show up the results from later rounds. Both
of these effects would cause the players’ scores to be more spread out.
Compared to reality, the model would have too many players near 50% and not enough
at the extremes.
The test: Step 1. Find the average distance from a
50% score in top tournaments. Let’s say you scored 6.5/10 in a tournament and I
got 3.5/10. Your distance from a 50% score is 6.5 – 5 = 1.5. Mine is 5 – 3.5 =
1.5. Our average distance from 50% is (1.5 + 1.5)/2 = 1.5. The top tournaments
in the test came from the 2015 Grand Chess Tour. Step 2. Find the average distance from a 50% score in the model’s
simulations. Step 3. Run a
statistical test to check if the results from Steps 1 and 2 are significantly
different. The numbers found in Step 1 should be larger than their counterparts
in Step 2 if there is a confidence effect.
But it turned out that
the difference was statistically insignificant. There really does not seem to
be a confidence effect, at least in top tournaments. In the simulations, the
games are independent, and the evidence does not contradict that.
Strictly speaking, I
have not disproven the confidence effect. Rather, the tests have looked for and
failed to find any evidence of such an effect. Why did they not find any
confidence effect? It could be that the effect is simply not there. It could
also be that it is too small for the tests to find. But a small confidence
effect is not a problem for the model. Small changes in the draw and win
probabilities have very small effects on the result. It doesn’t matter very
much if Player A’s chances of reaching the World Championship match are 88.95%
instead of 89.66%.
There is another
assumption worth testing. The probability of a draw is the same in every game,
but perhaps it should depend on the players’ ratings. The probabilities of a
win and of a loss were already adjusted for rating based on Elo’s formulas. I’ll
call the gap between white’s rating and black’s rating the rating gap. It’s reasonable to expect a higher draw rate between
evenly matched players. If the players are far apart in skill, draws should be
rarer. The further apart the players’ ratings are, the lower the draw rate; I
will call this the rating gap effect.
This was easy to
establish in a large database (classical time controls and both players 2200+).
However, if I only look at games between those rated 2750+, the rating gap
effect is no longer statistically significant. It should be noted that a rating
gap effect would be very hard to
detect in this database. That is because when both players are 2750+, there is
hardly any variation in the rating gap. Except for some of Carlsen’s and
Kasparov’s games, the database consists of battles between players very evenly
matched. It will be tough to find a rating gap effect when there isn’t much of
a rating gap.
Suppose we remain
convinced that the rating gap effect is real but too small for the tests to
pick up. To estimate its size, I have to go back to my larger database
(classical time controls and both players 2200+). This suggested that the draw
rate should be a couple percentage points lower when Player A is involved and a
few points higher when he is not. What would that imply for the model? Not much.
Small changes in the draw rate only lead to very small changes in the outcome.
Furthermore, the changes act in opposite directions. Lowering the draw rate for
Player A makes upsets more likely. But raising the draw rate for the other
games makes it harder for the weaker players to overtake him. Overall, Player
A’s chances of reaching the World Championship match are still very close to
90%.
A final concern is that
the model ignores whether a player has white or black. But everyone plays the
same number of games with each color, so any of those issues will cancel out.
As you can see, there
is room to debate the model’s assumptions. However, the results barely budge if
the model is tweaked in response to these potential criticisms. The proposed
format for the Candidates cycle would greatly increase the chance that the best
player wins.
Reforming the Candidates Cycle: Discussion
I am thrilled that my article, "Reforming the Candidates Cycle," was published in Chess Life. To save space (an important consideration for magazines), the methodology and discussion sections are posted here online instead of in print.
Reforming the Candidates Cycle: Additional Discussion
By Matthew S. Wilson
The proposed Candidates cycle begins with 12 players
competing in 4 Grand Prix tournaments. Why use 12 players instead of a
different number? This is based on general considerations rather than a
statistical model. Ratings fluctuate – for instance, around early 2015,
Grischuk rose all the way to #3 in the world, but within a year he dropped out
of the top 10. We want to invite enough players so that we are sure that the
best one is not accidentally excluded. On the other hand, inviting too many
players can also create problems. Pardon me for a moment while I take one last
swipe at the knockout world championships. At the beginning, there were 128
players. Garry Kasparov famously referred to some of them as “chess tourists.”
Many participants can only play the role of the spoiler and by chance, a few of
them will eliminate a top player. This just makes it less likely that the
tournament will successfully crown the best player.
So what is the right way to navigate the extremes of too few
players and too many players? I felt that 12 was the right number. It ensures
that no one in the top 10 will be excluded. The most recent Candidates
Tournament featured 8 players and somehow that did not include World #2,
Vladimir Kramnik. That suggests that 8 players are not enough. FIDE recently
expanded the Grand Prix to 24 players, but that necessarily involves
participants who are not in the top 20. It is highly unlikely that someone
outside the top 20 is the best in the world, so that is probably too many
players. However, my cutoff of 12 players is a bit arbitrary, and reasonable
people can differ on the best number.
It is preferable that players are invited on the basis of
rating rather than on success in a qualifying tournament. That is because
ratings reward consistently strong performance, rather than a strong
performance in a single tournament. As a practical matter, to find sponsorship
it may be necessary to allow the organizers to nominate a player. More
discussion of that later.
I would be surprised if this proposal was adopted with no
alterations. Since some modifications are almost certain to occur, it’s
important to know which changes would be minor and which would be major.
-The “artificial”
environment in the statistical model. In the model, Player A is exactly 50
rating points stronger than all of his rivals. It is very unlikely that a real
tournament will precisely replicate this situation. Earlier, I discussed the
possibility of Player A’s nearest rival being fewer than 50 points behind him.
Now for the other possibilities: (1) All the opponents are more than 50 points
weaker (2) One or more opponents are 50 points weaker and the rest are more
than 50 points weaker.
The goal of the World Championship cycle is to identify the
best player. Clearly, Option #1 just makes that even more likely to occur. The
more that Player A towers over his rivals, the more likely it is that he will
become World Champion. So if the model says that Player A has a 90% chance of
reaching the World Championship match, then his actual chance will be above
90%. That is not at all bad for the chess world – my proposed Candidates cycle
turns out to be better than advertised.
Option #2 leads to a similar outcome. “Player B” is the
individual 50 rating points below Player A. The other contenders are “Player
C,” “Player D,” etc. In the model, Players B, C, D, etc. were all equally good,
so they had equal chances of winning.
Now instead we’ll suppose that Players G and H are more than
50 points weaker than Player A, as in Option #2. This lowers the chances of
Players G and H – weaker players are less likely to win. However, this
necessarily raises the chances of all the other players, including Player A.
So once again the proposed Candidates cycle works better
than advertised. Player A earns the right to play in the World Championship
match even more often than predicted.
The practical implications? It is not a big problem if the
organizers nominate one or two players who are slightly weaker than the rest.
But there is one possible issue. Having more organizer nominees will crowd out
players who would have earned the right to participate due to their rating.
This raises the risk that the best player in the world is not even invited to
the Candidates cycle. As a general guideline, I think that it would be fine if
there were one or two players nominated by the organizers, but not many more
than that.
-The 6-game rapid
tiebreaks. Changing this would have a small impact. Tiebreaks are first
applied after the Grand Prix cycle of four 12-player round robins. After that many
games, ties are less likely. Changing the tiebreak system to 4-game rapid
matches or 2-game rapid matches will lower Player A’s chances by about 1%.
Other tiebreak systems also have a very small effect. Ideally, ties would be
broken by a short match at classical time controls, but that may be impractical
since it would take a few extra days. Organizers would need a contingency plan
to find accommodations and a playing hall in case of ties. Though 6-game rapid
matches may be the best solution, it is not critical for the Candidates cycle
proposal.
-The scoring system.
Each game is scored as 1/0.5/0 for win/draw/loss, and each player’s total score
is the sum of the points scored in all
the previous tournaments in the cycle. For instance, if you scored 25/44 in the
Grand Prix and qualified for the Candidates Tournaments, you don’t start with
0/0 in the next stage – you start with the 25/44 you earned in the Grand Prix.
This is a very important feature of the proposal.
Consider two players who have succeeded in the first stage
of a Candidates cycle, qualifying them to continue to the second stage. Player
1 qualified by scoring a dominating 9.5/11. Player 2 managed 6.5/11, barely
enough to qualify. There is definitely evidence that Player 1 is better than
Player 2. But if we have them both start from 0/0 in the second stage, we have
discarded useful information. The fresh start for both does not reflect the
fact that Player 1 did better in the previous tournament. Having them start
from their 9.5/11 and 6.5/11 scores is the simplest and best way to account for
Player 1’s superior performance in the earlier stage.
Giving everyone a fresh start at 0/0 in each stage of the
cycle has large implications for the 50 Point Principle. Player A’s chances of
making it into the World Championship match plummet to 73.6%. We want the best
players to qualify for the World Championship match. But there is always a
chance of an upset. Player A has a good chance (about 55%) of winning the Grand
Prix. If we use the total score as in the proposal, then Player A’s performance
in the Grand Prix can provide a buffer against upsets in the later stages.
Similarly, his probable victory in the Candidates Tournaments can offset potential
poor form in the Final. Starting everyone at 0/0 in each stage takes away that
buffer. Adopting the proposed scoring system is a virtually costless change
that can greatly increase the chance that the best player wins.
I cannot recommend the Grand Prix Points scoring system.
First of all, it is more complicated the the 1/0.5/0 system for win/draw/loss
that every player knows. More importantly, distortions can arise.
A few examples will illustrate what I mean by “distortions.”
Clearly, it should be better to win a Grand Prix Tournament with 8/11 than with
7/11. But either result yields the same number of Grand Prix Points (170).
Also, it is hard to see why the gap between 1st and 2nd
is 30 points, but the gap between 4th and 5th is just 10.
These distortions are not just hypothetical problems; they had a decisive
impact on the 2014-2015 Grand Prix cycle.
The top two players from the Grand Prix qualified for the
Candidates. According the Grand Prix Points system, those players were Caruana
and Nakamura (the tables below only show the top four players).
Tomashevsky finished fourth in Grand Prix Points, even though he did just as
well as Caruana and Nakamura! The reason? Tomashevsky “wasted” a point by
winning Tbilisi with 8/11 (1.5 points above his nearest rival), when 7/11 would
have sufficed and given him just as many Grand Prix Points. Caruana and
Nakamura spread out their victories more “efficiently” and thus transformed the
same 19/33 score into a larger number of Grand Prix Points. And thus Caruana
and Nakamura qualified for the Candidates while Tomashevsky was left out.
Simply using the 1/0.5/0 system is easier, fairer, and it
doesn’t cost FIDE anything.
-The World Champion
participates in the Candidates cycle. This is a break from tradition that
may be unpopular. The tradition can be preserved, though it will take one more
event to maintain the 50 Point Principle. This may well be a cost that the
chess world is willing to bear.
Variation: The World Champion does not play in the
Candidates cycle. Instead, he plays a match against the winner of the cycle.
The Candidates cycle is the same as before, except for two changes. First, the
top 2 players from the Final face off in a 24-game match. The top player from the
Final will have draw odds in case the match is tied 12-12. The winner of this
match will be the Challenger in the World Championship. They play a 26-game
match. If drawn, there will be a 4-game tiebreaker at classical time controls.
This approximately satisfies the 50 Point Principle, though as you can see, it
takes an extra match in order to do so.
What are the benefits of having the World Champion
participate in the Candidates cycle? First of all, when the Candidates cycle
whittles down the field to one player instead of two, there is a greater chance
that the best player will be eliminated. That is why the original proposed
Candidates cycle reduces the field to two players, who then play a match for
the title. Second, having the World Champion play alongside the candidates
grants us useful information. That information can be used in assigning draw
odds in case the match is tied. Tiebreak systems tend to be very arbitrary, but
in this case, the draw odds are earned
by performing better against the same opposition. If the World Champion does
not participate in the cycle, then this information is never obtained, so there
is no basis for granting draw odds. We are left with the unpleasant question of
how to break ties. Many fans would rather not see the World Championship title
determined by rapid games, which has already happened in Topalov-Kramnik and
Anand-Gelfand. And wouldn’t participation by the World Champion increase
interest in the Candidates cycle?
Nevertheless, this is not an essential feature. The
variation above can accommodate the more traditional approach.
-The draw rate in the
statistical model. As described earlier, the simulations assume that there
is a 62.5% probability of a draw. In the rapid tiebreaks, the draw probability
is 40%. This second number comes from the draw rate in the Paris rapid section
of the 2016 Grand Chess Tour. Small changes in the draw rate have only a small
impact on the results. This is especially true for the rapid tiebreaks, since
they are frequently not needed in this format. We can tweak these parts of the
model if you disagree with them, but it will not do anything dramatic to the
results.
However, the same cannot be said about large changes in the
draw rate. There is no reason to think that 62.5% is wildly inaccurate for the
regular World Championship, but it would not apply to the Women’s World
Championship. The draw rate for those tournaments is closer to 50% (based on
the 2011-2012 Women’s Grand Prix. Lower rated players tend to draw less often).
A lower draw rate means that there will be more wins and more losses. This
means that upsets become more likely. Suppose that Player 1 is expected to
score 60% against Player 2. With an 80% draw rate, Player 1 never loses; upsets
are impossible. With a 0% draw rate, an upset would not be very surprising. As
discussed earlier, the top player has a better chance of prevailing if they
play longer events. Thus, upsets are more likely when draws are rare and less
likely when events have many games. Therefore, in order to achieve the same
level of certainty that the best women wins, we need to have more games in the
Women’s Candidates cycle.
Hou Yifan recommended that FIDE use the same format for the
Women’s World Championship as the regular World Championship. This would
undoubtedly be a large improvement. But due to the different draw rates,
duplicating the formats will not duplicate the 50 Point Principle. With a 50%
draw rate in my proposed format, Player A reaches the World Championship 83.58%
of the time, compared to 89.66% earlier. Applying the 50 Point Principle to the
Women’s World Championship will require more tournaments (or longer
tournaments) for the women than for the men. But that also may strike people as
unfair. A difficult and controversial problem has arisen.
Wednesday, January 11, 2017
2017 Tata Steel Masters
The Tata Steel tournament begins soon. World Champion Magnus Carlsen will be participating along with several other top players. Will Wesley So be able to replicate his successes from the Sinquefield Cup and London Classic? Will Nepomniachtchi break into the top 10? Can Van Wely return to the 2700 club? All of these questions and more will be addressed at the tournament.
Thursday, December 8, 2016
London Chess Classic
The London Chess Classic begins soon. Despite the absence of Carlsen and Karjakin, the field is quite strong. Caruana is the top seed, and with a good result, he might even overtake Carlsen on the rating list.
Methodology
Methodology
Monday, November 28, 2016
World Chess Championship - Game 11
"Draw" is now the favorite in the match. Carlsen gets one last shot with the White pieces; if it is a draw, then there will be a 4 game rapid tiebreaker.
Subscribe to:
Posts (Atom)