Skip to content

Small elo adjustment#27

Draft
jameshi16 wants to merge 2 commits intomainfrom
improve-elo
Draft

Small elo adjustment#27
jameshi16 wants to merge 2 commits intomainfrom
improve-elo

Conversation

@jameshi16
Copy link
Collaborator

@jameshi16 jameshi16 commented Aug 27, 2024

Changes

  • Remove cap for text-based scoring; percentile-based scoring will balance this for us.

Checklist

  • My code compiles
  • I have committed all the files needed to build the project (check if your file is found in .gitignore)
  • If I'm introducing a new step in the build process, I have documented / automated it
  • I have tested my changes (minimally with one Twitch VOD)

Changes are deployed

Related cards

Neuro Chat Elo Development

@jameshi16 jameshi16 added bug Something isn't working enhancement New feature or request labels Aug 27, 2024
@jameshi16 jameshi16 added this to the Beta milestone Aug 27, 2024
@jameshi16 jameshi16 self-assigned this Aug 27, 2024
@jameshi16
Copy link
Collaborator Author

@owobred @Gaijutsu

I present you an interesting problem we're facing: static default elo.

The way elo is calculated allows for negative elo, which is theoretically allowed as far as elo goes. In our code base, it is almost necessary, because:

  • Percentile-based elo scoring relies on "how the best and the worst is performing" to score everyone. If a minimum elo is set, the percentile scoring will be heavily skewed. (Try to set a minimum elo for Discord leaderboards, only Heir will get points because they're first)
  • How negative / positive someone is encodes the history of the user, which is useful when awarding elo. The more the difference, the more the improvement. Having a fixed minimum elo will diminish this improvement.

Now, this also means that the median elo constantly changes. There are cases where the entire leaderboard goes through a lull period, which presents an interesting situation where the default elo is higher than everybody on the leaderboard. This, obviously, skews calculations.

One way to fix this is to inject all new users into the median of the leaderboard.

Do you folks have any thoughts?

@jameshi16
Copy link
Collaborator Author

(fyi bilibili is broken for roughly the same reason)

@owobred
Copy link
Collaborator

owobred commented Aug 28, 2024

This is the current elo distribution for the overall leaderboard, which feels pretty messed up.
rank distribution of Overall leaderboard
(red line is 1200 elo, yellow is mean, orange is median)
This is very different to the distributions for sites like lichess.org, which sorta has a collection in the middle.
image

I feel like there is something elo going on other than just the starting elo causing issues. Not sure where its coming from though? My only guess would be people who only chat once or something, not sure.

In terms of actually answering your question, I feel like injecting new users into the median (or somewhere else) would just cause elo to plummet to increasingly negative numbers every stream. I think it might be worth re-evaluating how elo gets calculated, as something seems amiss here 🤷

@jameshi16
Copy link
Collaborator Author

jameshi16 commented Aug 28, 2024

Thanks for the data @owobred. I think this has given me enough visibility to say that we're not using elo correctly.

To understand this, let's break down the ideal scoring process and define some scoring requirements, and then talk about why our current implementation of elo won't work at all.

Our scoring process:

  1. All users have an overall number (this is what we see as elo) per leaderboard. If they don't already have a number, they will be assigned 1200 (this is the static default in the current behaviour).
  2. In a period of 2 hours, users compete to obtain scores; this comes from chatting, subs, bits, etc.
  3. These scores are accumulated using some arbitrary formula, producing one number for each person per leaderboard.
  4. These numbers are used to determine a user's position relative to another user. Let's call the result the ephemeral ranking, because they only represent a user's position for that particular stream.
  5. Based on the ephemeral rankings, we want to update the overall number.
  6. The sorted list of overall numbers will then produce the chatter's final ranking.

Our scoring requirements:

  • A consistent user should always outrank a new user
  • Bursts of high amounts of activity should average out to moderate consistent activity (definition of activity = per stream)
  • A consistent user turned inconsistent / inactive should fall off
  • The numbers should remain manageable (we shouldn't get people with 30k score for example)

Even though we've implemented the core scoring algorithm practically the same way as chess, the way we're using them is not the same because each user could be battling people way outside their league, very unlike chess.

  • Usually, you're only matched with people around your elo. If A has a lower rank than B, and A has a lower elo, they're rewarded, and B is penalised. Remember that everyone is penalised by the difference of their elo.
  • If A has 1200 elo, and everyone else in the leaderboard has < 1200 elo, then by the current algorithm, A will always win every battle; even if they only challenge 100 sampled users, they'll gain K * 100 * difference elo. This is a lot of elo.
    • Suppose B is second to A. B will lose to A, and then win K * 99 * difference times. However, it is possible that k * difference between A and B > K * 99 * difference, which can happen if we introduce min-cap elo, or everyone happens to be the same elo (same median as seen above).
    • etc.

So, this causes the median to become more and more negative, which would violate our scoring requirements because it'll be hard not to be negative.

To fix this, we need to work in the same assumption-pool as how chess elo was calculated. That is to say: for each user, find maybe the 10 closest users to them, and then do battle with those elo scores. Do not commit the elo for each iteration. This would pit each user in the same league as one another, and conform more closely to how elo is meant to be used. It will also allow us to implement minimum elo, since battling is done in a user's rank locality.

@jameshi16
Copy link
Collaborator Author

pinging @Gaijutsu for visibility

@jameshi16
Copy link
Collaborator Author

(Time-dependent) may try to make some changes and test them on the stream tonight

@AlsoGaijutsu
Copy link
Collaborator

AlsoGaijutsu commented Aug 28, 2024

That's a well reasoned analysis, and I completely agree with the idea of localised elo matches.

The proposed scoring requirements are good, but removes a dynamic element to the leaderboard. The changing of positions in the current scoring system is an issue, however we might benefit from having some sort of leaderboard for the stream based purely on that stream's ephemeral rank, or maybe a 'Top X of Today's Stream'. Implementing a local method will make the leaderboard largely settle, assuming users interaction rate averages across all streams. This is based on how VOD elo does a single recalculation at the end of a stream, so only one round of local elo matches occurs. This may not be the case depending on how live elo does its elo recalculations.

Localised elo matches would also reduce elo recalculation by 100x (for closest 10 users at least) which is awesome for live elo.

@AlsoGaijutsu
Copy link
Collaborator

AlsoGaijutsu commented Aug 29, 2024

I've done a quick test run with a version of localised elo matches and compared it with our current implementation. The results look very promising already.

My implementation sorts the users into groups of 9 by elo, and had all users in each group battle each other to update elo scores. The below tests are the result of running a backfill on the last 5 streams.

Current Implementation

OldMethod

Localised Elo

NewMethod

It's interesting that elos cluster around specific peaks for localised elo, though this is certainly a side effect of my implementation.

I also created a version that creates a range around each user instead of set groups.
AdjNewMethod

Overall, I think localised elo is definitely the way to go. Just needs a proper implementation

@jameshi16
Copy link
Collaborator Author

jameshi16 commented Aug 30, 2024

The proposed scoring requirements are good, but removes a dynamic element to the leaderboard.

These aren't "proposed" scoring requirements; they've always been that way. Those are the whole reason why the elo scoring method is used in the first place. Seems like it has truly failed its purpose if the dev who ported it over to Rust didn't know 😅

Also, I'm not certain I understand why understanding the scoring requirements will remove the dynamic element from the leaderboard.

The changing of positions in the current scoring system is an issue, however we might benefit from having some sort of leaderboard for the stream based purely on that stream's ephemeral rank, or maybe a 'Top X of Today's Stream'.

This is a good feature request, but it feels largely irrelevant to this discussion.

Implementing a local method will make the leaderboard largely settle, assuming users interaction rate averages across all streams. This is based on how VOD elo does a single recalculation at the end of a stream, so only one round of local elo matches occurs.

Really? Based on how I see it, it'll actually vary more compared to the previous method. In fact, in the graphs you've posted, it does indeed vary more; less people are at the median.

This may not be the case depending on how live elo does its elo recalculations.

Having worked a little on Live Elo (to try and fix a bug), live elo does two kinds of calculations:

  • All the performance thus far, every 30 seconds
  • At the end of the program

Between t and t + 30, the scores can change drastically, and hence the elo can change drastically.

I've done a quick test run with a version of localised elo matches and compared it with our current implementation. The results look very promising already.

Thanks for this! ❤️ I'm implementing something myself, so it might be good to commit your changes into a branch somewhere for inspection.

My implementation sorts the users into groups of 9 by elo, and had all users in each group battle each other to update elo scores. The below tests are the result of running a backfill on the last 5 streams.

I need slightly more details here. How are the groups formed? Are they overlapping windows, or separate chunks of 9?

i.e. suppose I am a user in the middle of the list. Do I battle 8 * 8 times = 64 times, or only 8 times ever?

In my own implementation, I have a "partial window" of users centered around the user I want to force a battle with:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], say I want to take a window of 5. Then:

(0, [1, 2, 3]) -> 1 battles with 2 and 3
(1, [1, 2, 3, 4]) -> 2 battles with 1, 3, and 4
(2, [1, 2, 3, 4, 5]) -> 3 battles with 1, 2, 4, and 5
(2, [2, 3, 4, 5, 6])
(2, [3, 4, 5, 6, 7])
...
(2, [6, 7, 8, 9, 10]) -> 8 battles with 6, 7, 9 and 10
(2, [7, 8, 9, 10]) -> 9 battles with 7, 8, and 10
(2, [8, 9, 10]) -> 10 battles with 8 and 9

where the tuple represents (center index in slice, slice)

(few edits here because I got confused about my own algorithm)

So in each slice, only the center user actually gets their elo updated in the battle. In effect, each user will only have window_size - 1 battles

It's interesting that elos cluster around specific peaks for localised elo, though this is certainly a side effect of my implementation.

Yep this seems strange to me, although the general shape seems right.

I also created a version that creates a range around each user instead of set groups.

So a range of maximally 9 users? (i.e. the same as the above partial window?)

If so, the graph looks weird to me. I expect a more distributed graph.

I have some changes stashed, so I'll probably compare it shortly

@AlsoGaijutsu
Copy link
Collaborator

AlsoGaijutsu commented Aug 30, 2024

Sorry, said the wrong thing 😅, my brain grabbed the first word it thought of. I shouldn't have use proposed, just meant more along the lines of 'the requirements as written'.

In any case, my point regarding the leaderboard being dynamic was more that someone could rise or fall a significant number of places in a short period. This is very much not intended on our leaderboards as they are now, but it was fun to suddenly see new names at the top from time to time. This is a separate discussion though, my bad 🙏

As for the implementation details, the grouped implementation does non-overlapping groups of 9, and each user fights 8 battles, one with each other member of their group. I'm dropping this in a favour of the partial window approach.

My sliding partial approach is very similar to yours, except it maintains a fixed window size.

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], say I want to take a window of 5. Then:

(0, [1, 2, 3, 4, 5]) -> 1 battles with 2, 3, 4, and 5
(1, [1, 2, 3, 4, 5]) -> 2 battles with 1, 3, 4, and 5
(2, [1, 2, 3, 4, 5]) -> 3 battles with 1, 2, 4, and 5
(2, [2, 3, 4, 5, 6])
(2, [3, 4, 5, 6, 7])
...
(2, [6, 7, 8, 9, 10]) -> 8 battles with 6, 7, 9 and 10
(3, [6, 7, 8, 9, 10]) -> 9 battles with 6, 7, 8, and 10
(4, [6, 7, 8, 9, 10]) -> 10 battles with 6, 7, 8 and 9

where the tuple represents (center index in slice, slice)

As in yours, only the center user gets updated. These updates are stored in a HashMap to be applied only after all battles have been fought.

The group size was 9 for both test runs. The vast majority of users ended up on intergers for some reason ¯_(ツ)_/¯. I'll keep looking into this

I did my testing in a fork. It is VERY scuffed, so not sure how helpful it can be for now. The per-user-range is the only branch vaguely worth looking at

@jameshi16
Copy link
Collaborator Author

jameshi16 commented Aug 30, 2024

Here are my graphs for a single backfill:
image

As you can see, it has a very nice distribution (File 2 (chat only), 4 (non-vips), and 5 (overall)).

As a side note: I personally think Copypasta Leaders should no longer be a leaderboard. (oblivious)

Will post 5 backfills and my fork with my implementation of partial window soon ™️

@jameshi16
Copy link
Collaborator Author

5 backfills look like this:

image

I think this looks pretty good!

@jameshi16
Copy link
Collaborator Author

Alright, based on today's stream, I think we need a slight adjustment to how we're selecting people in the window. It looks like we might do better treating people with the same elo as one person when adding them to the window, otherwise we might experience a cold-start issue.

@owobred
Copy link
Collaborator

owobred commented Aug 31, 2024

I have a potential idea on how to approach this (though I haven't tested it at all, might be a terrible idea). Essentially, each player has a "matchmaking budget" which dictates how many players they can match against, based on how far apart they are.

I wrote some python (on my phone, idk if it's valid 🥺) that kinda conveys my idea. It also includes the budget -= 1 line to put an upper bound on the number of opponents.

class Player:
  elo: float
  score: float

current_player: Player # not in `players` list
players: list[Player]
players.sort(key=lambda p: abs(p.elo - current_player.elo))

budget = 100
opponents: list[Player] = []

for opponent in players:
  if budget <= 0:
    break

  budget -= 1
  budget -= abs(opponent.elo - current_player.elo) # could be scaled in some way to favour closer players
  opponents.append(opponent)

# ... calculate new elo against opponents

@jameshi16
Copy link
Collaborator Author

I have a potential idea on how to approach this (though I haven't tested it at all, might be a terrible idea). Essentially, each player has a "matchmaking budget" which dictates how many players they can match against, based on how far apart they are.

I wrote some python (on my phone, idk if it's valid 🥺) that kinda conveys my idea. It also includes the budget -= 1 line to put an upper bound on the number of opponents.

class Player:
  elo: float
  score: float

current_player: Player # not in `players` list
players: list[Player]
players.sort(key=lambda p: abs(p.elo - current_player.elo))

budget = 100
opponents: list[Player] = []

for opponent in players:
  if budget <= 0:
    break

  budget -= 1
  budget -= abs(opponent.elo - current_player.elo) # could be scaled in some way to favour closer players
  opponents.append(opponent)

# ... calculate new elo against opponents

This looks pretty interesting. Might implement a test along these lines just to check it out

@jameshi16
Copy link
Collaborator Author

New idea. Instead of using Elo, which can take up to 3 minutes to calculate, we want to update a distribution based on observed data.

This is useful especially in live-elo, since we now have the element of time to play with.

Here's the idea:

  • Our scoring systems will still use the current implementation
  • We will still rely on Vec<TimestampedPerformances>. This gives us a velocity for performances, which is useful for our current usecase.
  • In the elo calculation, we will update a prior distribution based on the observed performance velocity.
  • We will store the velocity. We will then take the weighted average of the previous velocity and the current velocity to compute the effective velocity.
  • As a score, we will report to live-elo-websocket-proxy the percentile of observing x (either likelihood or Cumulative Density Function). Alternatively, if we can get some measure of mean and standard deviation, we can math out the equivalent when mean = 800, standard deviation = 200 (chess elo)

Ideally this solves two problems:

  • Elo isn't a function that is strictly monotonic (which we have achieved here, however, it requires tweaking the hyperparameter, which can inflate the largest elo)
  • It will no longer take 3 minutes to calculate elo sometimes

Essaying was the easy part; actually writing the code / doing the math would be difficult :^)
[If you can find a way for me to update a gaussian prior distribution, and get a gaussian posterior, please let me know i am desperate]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

Development

Successfully merging this pull request may close these issues.

3 participants