Gaspode-san's Sumo Statistics

The process for calculating Elo ratings for sumo starts with Elo's original formulae. I call this the BKQ model because it has three parameters $b$, $k>0$ and $q>0$ (discussed below). Ratings are calculated in the same way for all rikishi (i.e. irrespective of division) as follows:

Taking the last point first, the rule is that if a rikishi doesn't fight then neither they nor their scheduled opponent gains or loses any points. This may sound somewhat surprising but Elo's thinking was that if someone isn't able (or willing) to take part in a chess match then that doesn't mean their ability to play chess is any less. When I started this project I thought the same could be said for rikishi. For example, when Ozeki Asonoyama was barred for a year he came back in July 2022 at Sd22 with his old rating of 2125. This was 700 points higher than anyone else in the division. Under the assumption he had been keeping fit and training, that difference sounds about right to me. And of course he rocketed back up the rankings.

On the other hand, it doesn't seem right to me that there is no loss of points from absence due to injury, especially in the case of yokzuna. However, if a non-yokozuna is absent due to injury and comes back before he is fully fit, he won't have lost any points but he is likely to lose points until his rating is commensurate with his current level of ability. I have not decided on how to resolve this yet. TBD

Returning to the formulae of the BKQ model, although it may not be obvious, it follows from the formulae that when two rikishi fight:

Let's see the formulae in action. For reasons to be discussed later, let's set $q=400$ and $k=35$. Now, at the stary of the March 2024 basho, Abi was a komusubi with an Elo rating of 2042 and Oho was an M3 at 1865. The ranks and the ratings both indicate Abi was stronger than Oho at that time (which is a minor confirmation of the Elo system). However, based purely on the ratings, we should see Abi gain less and lose more than Oho. From the PRS formula: $$ p_{Abi,Oho} = \frac{1}{1 + 10^{\frac{E_s - E_r}{q}}} = \frac{1}{1 + 10^{\frac{1865 - 2042}{400}}} \approx \frac{1}{1 + 10^{-0.4425}} \approx \frac{1}{1 + 0.361} \approx \frac{1}{1.361} \approx 0.7348 $$ It is not difficult to confirm $p_{Oho,Abi}\approx 0.2652$. It follows that if Abi beat Oho he would gain 35*(1-0.7348) = 9.284 points, and since Oho lost and 35*(0-0.2652) = -9.284, Oho would lose the same number of points. If Abi lost then 35*(0-0.78348) = -25.716 so Abi would lose 25.716 points and Oho would win 35*(1-0.2652) = 25.715 points.

Readers who are new to the Elo system can be forgiven for wondering what on Earth is going on! Although the formulae used may be difficult to understand,

points are awarded in proportion to expectations, and
the procedure for calculating the ratings is objective: a change in rating is dependent on (a) whether a rikishi beat a given opponent, (b) the Elo ratings of the rikishi involved and (c) nothing else.

Whether or not ratings predict future performance is less clear. Crucial to this question is the PRS formula. The letter '$p$' is used to indicate that $p_{r,s}$ is intended to be an estimate of the probability of $r$ beating $s$. It can't be the probability of $r$ beating $s$ because that's undefined in the case of a match between two rikishi who have never fought each other before. Any measure of relative "true abilty" is likely to be subjective as discussed earlier.

The first point to note is that it is not unreasonable to call $p_{r,s}$ "the estimated probability of $r$ beating $s$" since it satisfies the fundamental requirement of a probability viz: $p_{r,s} + p_{s,r} = 1$ (proof: exercise). This is no accident since the formula for $p_{r,s}$ is that of a logistic function. TBD link with charts

The second point is that it follows from the maths that the estimated probability of $r$ beating $s$ is:

proportional to the difference in the ratings;
equal to 0.5 if the rikishi have the same rating and
never equal to 0 or 1 even if the difference in ratings is huge.

It seems to me that these are all desirable characteristics of a rating system although it remains to be seen if $p_{r,s}$ is useful; i.e. to what extent is it a a good predictor of results? We have also not yet considered the extent to which an Elo rating "records" rikishi's past performance. These issues are discussed in much more detail in Analysis. TBD

For now, let's look at the one concrete example we have just seen. Both Abi and Oho started with 1,000 points when they appeared in Grand Sumo. By the end of that January 2024 basho their ratings had risen to 2042 and 1985 respectively. A minor confirmation of the Elo system is that Abi's rank of komusubi was higher than Oho's M3 which was also true of their Elo ratings. However, was 0.7348 a reasonable estimate of the probability of Abi beating Oho in March 2024? Without thinking about it too much, my guess is that it's not an unreasonable estimate of success for a komosubi against an M3.

Model Parameters

We will delve into the mathematics of Elo ratings in a later section. However, whilst the PRS formula is on the page it seems appropriate to say more about the roles are played by $b$, $k$ and $q$.

First, $b$ determines where the ratings start and has no other effect. I chose $b=1000$ because when I started this project the values I chose for the other parameters led to a yokozuna having a rating of around 2500. This made sumo Elo ratings familiar to people who know their chess Elo ratings where 2500+ is the rating of a Grandmaster. Unfortunately after refining the code, 2200 is currently the norm for yokazunas. (It is only Hakuho that has got anywhere near 2500.) The choice of $b$ really is arbitrary. For example, if you think $b'$ is a better initial value then an Elo rating $E'$ computed from this starting value is easy to compute from $E$, one computed from $b$: $E'=E-1000+b'$.

The parameter $k$ is the maximum possible number of points that can be won or loss in any one fight. It therefore determines the significance of the most recent results. If $k$ is too small then the ratings hardly move at all and if it is too big the ratings jump about wildly. I experimented and found that $k=35$ works for me.

Finally, the role played by $q$ is to control the extent to which past performance influences change in ratings. This is because if $q$ is (very) large then $p_{r,s}$ is around 0.5 irrespective of the values of $E_s$ and $E_r$. Conversely, when $q$ is small, the values of $p_{r,s}$ will be sensitive to $E_s-E_r$.

However, $q$ also controls the volatility of adjustments to ratings: as $q$ grows the ratings change more slowly. This means that volatility is dependent on $k$ and $q$. I found that $q=400$ is used by others and in experiments it seemed to work well with $k=35$.

The parameters $k$ and $q$ are explored more fully in Analysis. TBD

There is in fact a fourth parameter: $d$, the date from which I started calculating the ratings. However, I am mostly interested in the makuuchi and juro divisions and the fine people at sumodb have full results going back to 1909. However, back then there were only two basho each year and I thought Elo ratings would change too slowly to be of interest. The number of bashos increased over time until we got to the current norm of six in 1958. But I typed "January 1957" into my program, so that's when my Elo ratings start. When there were only five basho. And I forgot to add "D" to "BKQ". Ahem.

The BKQ Model

Model Parameters