Feat: ban IPs not sending a valid connection ID #1124

josecelano · 2024-12-09T12:03:19Z

This PR uses a Counting Bloom Filter to count IP sending UDP requests with wrong connection IDs.

The IP is banned when the tracker receives more than 10 requests from a given IP with a bad connection ID. Bad connection IDs are cookie values that have expired or are from the future.

With the current CountingBloomFilter configuration (0.01 rate), we would have a False Positive for every 10000 IPs, meaning when two IPs have a collision, and one of them is misbehaving, the other one would also be banned.

To avoid false positives, we introduced a second counter with a HashMap. This consumes more memory, but it's reset every 120 seconds. The HashMap is only used when the CBF detects a potential bad client.

TODO

Straightforward implementation
Benchmarking (how much this new feature affects performance)
Add an E2E test
Remove IPs from the banned list every hour
Review filter settings CountingBloomFilter::with_rate(4, 0.01, 100)
Refactor: extract the IP ban service from the main loop
Benchmarking after extracting BanService

Questions

Should we add a configuration option for the maximum number of errors allowed?

Future PR

Add a metric to tracker stats for the number of banned IPs.
Ban subnets

codecov · 2024-12-09T12:24:23Z

Codecov Report

Attention: Patch coverage is 96.70330% with 3 lines in your changes missing coverage. Please review.

Project coverage is 76.20%. Comparing base (a7e20df) to head (29e506d).
Report is 4 commits behind head on develop.

Files with missing lines	Patch %	Lines
src/servers/udp/server/banning.rs	97.64%	1 Missing and 1 partial ⚠️
src/servers/udp/server/launcher.rs	83.33%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1124      +/-   ##
===========================================
+ Coverage    75.96%   76.20%   +0.24%     
===========================================
  Files          168      169       +1     
  Lines        11437    11528      +91     
  Branches     11437    11528      +91     
===========================================
+ Hits          8688     8785      +97     
+ Misses        2585     2580       -5     
+ Partials       164      163       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

josecelano · 2024-12-09T15:02:18Z

Benchmarking results (from current implementation to new one):

Best case request out from 469963 to 445998 (-5.09%)
Worst case request out from 432509 to 422968 (−2.20%)

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR base (current implementation)

Best case

Requests out: 469963.10/second
Responses in: 422699.98/second

Connect responses: 209285.78
Announce responses: 209256.51
Scrape responses: 4157.68
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 104
p99.9: 337
p100: 425

Worst case

Requests out: 432509.73/second
Responses in: 389143.07/second

Connect responses: 192802.08
Announce responses: 192473.11
Scrape responses: 3867.88
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 106
p99.9: 313
p100: 401

PR (new implementation)

Best case:

Requests out: 445998.57/second
Responses in: 401399.07/second

Connect responses: 198792.26
Announce responses: 198630.85
Scrape responses: 3975.96
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 104
p99.9: 321
p100: 405

Worst case:

Requests out: 422968.03/second
Responses in: 380671.43/second

Connect responses: 188407.04
Announce responses: 188497.77
Scrape responses: 3766.61
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 104
p99.9: 307
p100: 383

josecelano · 2024-12-10T09:47:44Z

Benchmarking results after extracting BanService:

Best case request out from 469963 to 364465 (-22.44%)
Worst case request out from 432509 to 337984 (−21.85%)

The problem is the way I clean bans. I'm going to move the check to the service to avoid executing the check to clean bans on each loop iteration.

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR (new implementation after extracting `BanService`)

Best case:

Requests out: 364465.51/second
Responses in: 328018.44/second

Connect responses: 162198.50
Announce responses: 162554.64
Scrape responses: 3265.30
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 104
p99.9: 267
p100: 331

Worst case:

Requests out: 337984.39/second
Responses in: 304185.64/second

Connect responses: 150492.27
Announce responses: 150661.45
Scrape responses: 3031.91
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 104
p99.9: 250

josecelano · 2024-12-10T10:18:09Z

Benchmarking results after extracting BanService:

Best case request out from 469963 to 440593 (-6.24%)
Worst case request out from 432509 to 416496 (−3.70%)

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR (new implementation running ban cleaner in a new thread)

Best case:

Requests out: 440593.07/second
Responses in: 393355.73/second

Connect responses: 195003.12
Announce responses: 194468.01
Scrape responses: 3884.59
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 105
p99.9: 315
p100: 401

Worst case:

Requests out: 416496.57/second
Responses in: 374961.47/second

Connect responses: 185585.72
Announce responses: 185699.20
Scrape responses: 3676.56
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 104
p99.9: 303
p100: 375

josecelano · 2024-12-12T17:11:03Z

We are planning to make some big changes to this implementation to avoid False Positives:

6ca82e9 feat: [#1128] add new metric UDP total requests aborted (Jose Celano) 9499fd8 feat: [#1128] add new metric UDP total responses (Jose Celano) 286fe02 feat: [#1128] add new metric UDP total requests (Jose Celano) Pull request description: Add more metrics to the UDP tracker stats. The new values are: - `udp4_requests`: total number of requests received from IPv4 clients. - `udp6_requests`: total number of requests received from IPv6 clients. - `udp4_responses`: total number of responses sent to IPv4 clients. - `udp6_responses`: total number of responses sent to IPv6 clients. - `udp_requests_aborted`: total number of requests aborted to make room in the active requests buffer. ### Notes - Responses sent might differ from requests received because of aborted requests. - When we [merge the IP ban service](#1124), we can add a new metric for the total number of IPs banned. - I want to add these new metrics to the [live demo Grafana dashboard](torrust/torrust-demo#20). ### Subtasks - [x] `udp4_requests` - [x] `udp6_requests` - [x] `udp4_responses` - [x] `udp6_responses` - [x] `udp_requests_aborted` - [x] Benchmarking to check how it affects performance before merging it. ACKs for top commit: josecelano: ACK 6ca82e9 Tree-SHA512: 7fbf75b264b191f5c58fcecde8d5e783bbe54ee1c1799acdddc04a9ef64b7196d8b95d1bcad420b1df269bc7929e44417a1d164c6953b00804b0d1e5f0b36e7d

…limit The life demo tracker is receiving many UDP requests with a wrong conenctions IDs. Errors are logged (write disk) and that decreases the tracker performance. This counts errors and bans Ips after 10 errors for 2 minutes. We use two levels of counters. 1. First level: A Counting Bloom Filter: fast and low memory consumption but innacurate (False Positives). 2. HashMap: Exact Counter for Ips. CBFs are fast and use litle memory but they are also innaccurate. They have False Positives meaning some IPs would be banned only becuase there are bucket colissions (IPs sharing the same counter). To avoid banning IPs incorrectly we decided to introduce a second counter, which is a HashMap that counts error precisely. IPs are only banned when this counter reaches the limit (over 10 errors). We keep the CBF as a first level filter. It's a fast-check IP filter without affecting tracker's performance. When the IP is banned according to the first filter we double-check in the HashMap. CBF is faster than checking always for banned IPs against the HashMap. This solution should be good if the number of IPs is low. We have to find another solution anyway for IPv6 where is cheaper to own a range of IPs.

Becuase we are using aquatic_udp_load_test with this ocndifugration ``` Starting client with config: Config { server_address: 127.0.0.1:3000, log_level: Error, workers: 1, duration: 0, summarize_last: 0, extra_statistics: true, network: NetworkConfig { multiple_client_ipv4s: true, sockets_per_worker: 4, recv_buffer: 8000000, }, requests: RequestConfig { number_of_torrents: 1000000, number_of_peers: 2000000, scrape_max_torrents: 10, announce_peers_wanted: 30, weight_connect: 50, weight_announce: 50, weight_scrape: 1, peer_seeder_probability: 0.75, }, } ```

josecelano · 2024-12-16T16:48:11Z

Benchmarking results after adding the HashMap:

Best case request out from 417682 to 413429 (-1.01%)
Worst case request out from 404915 to 396124 (−2.17%)

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR (new implementation running ban cleaner in a new thread)

Best case:

Requests out: 413429.53/second
Responses in: 371125.89/second

Connect responses: 184013.88
Announce responses: 183437.46
Scrape responses: 3674.55
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 105
p99.9: 299
p100: 373

Worst case:

Requests out: 396124.13/second
Responses in: 356491.42/second

Connect responses: 176396.93
Announce responses: 176532.52
Scrape responses: 3561.97
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 105
p99.9: 287
p100: 373

Develop Branch

Best case:

Requests out: 417682.64/second
Responses in: 375161.75/second

Connect responses: 185776.19
Announce responses: 185634.17
Scrape responses: 3751.40
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 105
p99.9: 305
p100: 393

Worst case:

Requests out: 404915.33/second
Responses in: 371497.74/second

Connect responses: 183964.50
Announce responses: 183821.18
Scrape responses: 3712.05
Error responses: 0.00
Peers per announce response: 0.00
Announce responses per info hash:
p10: 1
p25: 1
p50: 1
p75: 1
p90: 2
p95: 3
p99: 106
p99.9: 299
p100: 371

josecelano · 2024-12-16T16:48:31Z

ACK 29e506d

josecelano · 2024-12-16T17:09:32Z

Hi @da2ce7 Tomorrow (09:30 UTC) I will merge this, deploy it to the live demo, run it for some hours and compare the Grafana dashboard before and after the deployment. I will also compare CPU and memory consumption. I expect to have again the problem we had that the tracker started comsuming more and more memory until the docker container is restarted. And I expect the number of errors decrease drastically becuase of the IP banning.

Current data (2024-12-17 09:29 UTC)

josecelano · 2024-12-18T09:35:22Z

Hi @da2ce7, after running the new IP ban filter for 24h. Errors have decreased comparing rate between announce requests and error responses.

Without ban service: Announces 450-650 -> Errors 70-110
With ban service: Announces 450-700 -> Errors 110-225

I don't have the exact value because I should have created a graph for the previous version with that rate.

They have not decreased too much, maybe for two reasons:

The ban duration is too short, only two minutes, which is the common announce interval. I will increase it to 1 hour as in the first draft of the PR.
I think there are many "bad" clients, so even if you ban some, there are always more. It's not a subset of "attackers".

As you can see in the "UDP4 requests and responses (per second)" graph we are not sending the response (becuase the IP was banned) for 100 requests per second on average.

NOTES:

It's weird that memory consumption has not increased like some months or weeks ago. The number of torrents, seeders and leechers seems to be pretty stable. I wonder why that has changed.
Error responses are not only wrong connection IDs responses. They include any type of error.

Current data (2024-12-18 09:29 UTC)

josecelano changed the title ~~Feat: socket addresses not sending a valid connection ID~~ Feat: ban IP not sending a valid connection ID Dec 9, 2024

josecelano temporarily deployed to coverage December 9, 2024 12:03 — with GitHub Actions Inactive

josecelano changed the title ~~Feat: ban IP not sending a valid connection ID~~ Feat: ban IPs not sending a valid connection ID Dec 9, 2024

josecelano requested a review from da2ce7 December 9, 2024 12:03

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 60a9f29 to 3bb718d Compare December 9, 2024 12:06

josecelano temporarily deployed to coverage December 9, 2024 12:07 — with GitHub Actions Inactive

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 3bb718d to 30cae9b Compare December 9, 2024 12:09

josecelano linked an issue Dec 9, 2024 that may be closed by this pull request

Ban socket addresses not sending a valid connection ID #1096

Closed

josecelano temporarily deployed to coverage December 9, 2024 12:10 — with GitHub Actions Inactive

josecelano mentioned this pull request Dec 9, 2024

Ban socket addresses not sending a valid connection ID #1096

Closed

josecelano temporarily deployed to coverage December 9, 2024 15:15 — with GitHub Actions Inactive

josecelano had a problem deploying to coverage December 9, 2024 15:37 — with GitHub Actions Failure

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from b94359f to 446a906 Compare December 9, 2024 16:02

josecelano temporarily deployed to coverage December 9, 2024 16:02 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage December 9, 2024 16:05 — with GitHub Actions Inactive

josecelano mentioned this pull request Dec 9, 2024

Sdterror output from tracing running tests #1069

Closed

josecelano temporarily deployed to coverage December 9, 2024 16:36 — with GitHub Actions Inactive

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 2690bcd to 4801edf Compare December 9, 2024 16:37

josecelano temporarily deployed to coverage December 9, 2024 16:37 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage December 9, 2024 18:44 — with GitHub Actions Inactive

da2ce7 mentioned this pull request Dec 10, 2024

Adjust Licence to GPL2+ nicklan/bloom-rs#11

Open

josecelano temporarily deployed to coverage December 10, 2024 10:15 — with GitHub Actions Inactive

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 010a2e5 to 77cf089 Compare December 10, 2024 10:29

josecelano temporarily deployed to coverage December 10, 2024 10:29 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage December 10, 2024 10:47 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage December 10, 2024 10:52 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage December 10, 2024 11:01 — with GitHub Actions Inactive

josecelano marked this pull request as ready for review December 10, 2024 11:01

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from d539959 to 7b4ec75 Compare December 10, 2024 11:03

josecelano temporarily deployed to coverage December 10, 2024 11:04 — with GitHub Actions Inactive

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 7b4ec75 to 88b447f Compare December 11, 2024 10:27

josecelano temporarily deployed to coverage December 11, 2024 10:28 — with GitHub Actions Inactive

josecelano mentioned this pull request Dec 13, 2024

Add more metrics to the UDP tracker stats #1130

Merged

6 tasks

chore(deps): add dependency bloom

87401e8

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 88b447f to 26c05e5 Compare December 16, 2024 11:15

josecelano temporarily deployed to coverage December 16, 2024 11:15 — with GitHub Actions Inactive

josecelano temporarily deployed to coverage December 16, 2024 16:16 — with GitHub Actions Inactive

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 6af01ff to e0f54d3 Compare December 16, 2024 16:25

josecelano temporarily deployed to coverage December 16, 2024 16:25 — with GitHub Actions Inactive

josecelano added 2 commits December 16, 2024 16:36

josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from e0f54d3 to 29e506d Compare December 16, 2024 16:39

josecelano temporarily deployed to coverage December 16, 2024 16:40 — with GitHub Actions Inactive

josecelano merged commit 208694f into torrust:develop Dec 17, 2024
23 checks passed

josecelano mentioned this pull request Dec 17, 2024

Investigate how often clients in the wild regenerate UDP connection IDs/decide what to do chihaya/chihaya#417

Open

This was referenced Dec 18, 2024

Increase IP ban duration to 1 hour #1139

Closed

Add more metrics #1145

Closed

Check UDP errors after increasing IP ban duration from 2 minutes to 1 hour torrust/torrust-demo#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: ban IPs not sending a valid connection ID #1124

Feat: ban IPs not sending a valid connection ID #1124

Uh oh!

josecelano commented Dec 9, 2024 •

edited

Loading

Uh oh!

codecov bot commented Dec 9, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 9, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 10, 2024

Uh oh!

josecelano commented Dec 10, 2024

Uh oh!

josecelano commented Dec 12, 2024

Uh oh!

josecelano commented Dec 16, 2024 •

edited

Loading

Uh oh!

josecelano commented Dec 16, 2024

Uh oh!

josecelano commented Dec 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

josecelano commented Dec 18, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat: ban IPs not sending a valid connection ID #1124

Feat: ban IPs not sending a valid connection ID #1124

Uh oh!

Conversation

josecelano commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Questions

Future PR

Uh oh!

codecov bot commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

josecelano commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR base (current implementation)

PR (new implementation)

Uh oh!

josecelano commented Dec 10, 2024

PR (new implementation after extracting BanService)

Uh oh!

josecelano commented Dec 10, 2024

PR (new implementation running ban cleaner in a new thread)

Uh oh!

josecelano commented Dec 12, 2024

Uh oh!

josecelano commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR (new implementation running ban cleaner in a new thread)

Develop Branch

Uh oh!

josecelano commented Dec 16, 2024

Uh oh!

josecelano commented Dec 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current data (2024-12-17 09:29 UTC)

Uh oh!

Uh oh!

josecelano commented Dec 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current data (2024-12-18 09:29 UTC)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

josecelano commented Dec 9, 2024 •

edited

Loading

codecov bot commented Dec 9, 2024 •

edited

Loading

josecelano commented Dec 9, 2024 •

edited

Loading

PR (new implementation after extracting `BanService`)

josecelano commented Dec 16, 2024 •

edited

Loading

josecelano commented Dec 16, 2024 •

edited

Loading

josecelano commented Dec 18, 2024 •

edited

Loading