Skip to content

Conversation

@josecelano
Copy link
Member

@josecelano josecelano commented Dec 9, 2024

This PR uses a Counting Bloom Filter to count IP sending UDP requests with wrong connection IDs.

The IP is banned when the tracker receives more than 10 requests from a given IP with a bad connection ID. Bad connection IDs are cookie values that have expired or are from the future.

With the current CountingBloomFilter configuration (0.01 rate), we would have a False Positive for every 10000 IPs, meaning when two IPs have a collision, and one of them is misbehaving, the other one would also be banned.

To avoid false positives, we introduced a second counter with a HashMap. This consumes more memory, but it's reset every 120 seconds. The HashMap is only used when the CBF detects a potential bad client.

TODO

  • Straightforward implementation
  • Benchmarking (how much this new feature affects performance)
  • Add an E2E test
  • Remove IPs from the banned list every hour
  • Review filter settings CountingBloomFilter::with_rate(4, 0.01, 100)
  • Refactor: extract the IP ban service from the main loop
  • Benchmarking after extracting BanService

Questions

  • Should we add a configuration option for the maximum number of errors allowed?

Future PR

  • Add a metric to tracker stats for the number of banned IPs.
  • Ban subnets

@josecelano josecelano changed the title Feat: socket addresses not sending a valid connection ID Feat: ban IP not sending a valid connection ID Dec 9, 2024
@josecelano josecelano changed the title Feat: ban IP not sending a valid connection ID Feat: ban IPs not sending a valid connection ID Dec 9, 2024
@josecelano josecelano requested a review from da2ce7 December 9, 2024 12:03
@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 60a9f29 to 3bb718d Compare December 9, 2024 12:06
@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 3bb718d to 30cae9b Compare December 9, 2024 12:09
@josecelano josecelano linked an issue Dec 9, 2024 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 96.70330% with 3 lines in your changes missing coverage. Please review.

Project coverage is 76.20%. Comparing base (a7e20df) to head (29e506d).
Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
src/servers/udp/server/banning.rs 97.64% 1 Missing and 1 partial ⚠️
src/servers/udp/server/launcher.rs 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1124      +/-   ##
===========================================
+ Coverage    75.96%   76.20%   +0.24%     
===========================================
  Files          168      169       +1     
  Lines        11437    11528      +91     
  Branches     11437    11528      +91     
===========================================
+ Hits          8688     8785      +97     
+ Misses        2585     2580       -5     
+ Partials       164      163       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@josecelano
Copy link
Member Author

josecelano commented Dec 9, 2024

Benchmarking results (from current implementation to new one):

  • Best case request out from 469963 to 445998 (-5.09%)
  • Worst case request out from 432509 to 422968 (−2.20%)

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR base (current implementation)

Best case

Requests out: 469963.10/second
Responses in: 422699.98/second

  • Connect responses: 209285.78
  • Announce responses: 209256.51
  • Scrape responses: 4157.68
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 104
  • p99.9: 337
  • p100: 425

Worst case

Requests out: 432509.73/second
Responses in: 389143.07/second

  • Connect responses: 192802.08
  • Announce responses: 192473.11
  • Scrape responses: 3867.88
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 106
  • p99.9: 313
  • p100: 401

PR (new implementation)

Best case:

Requests out: 445998.57/second
Responses in: 401399.07/second

  • Connect responses: 198792.26
  • Announce responses: 198630.85
  • Scrape responses: 3975.96
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 104
  • p99.9: 321
  • p100: 405

Worst case:

Requests out: 422968.03/second
Responses in: 380671.43/second

  • Connect responses: 188407.04
  • Announce responses: 188497.77
  • Scrape responses: 3766.61
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 104
  • p99.9: 307
  • p100: 383

@josecelano
Copy link
Member Author

Benchmarking results after extracting BanService:

  • Best case request out from 469963 to 364465 (-22.44%)
  • Worst case request out from 432509 to 337984 (−21.85%)

The problem is the way I clean bans. I'm going to move the check to the service to avoid executing the check to clean bans on each loop iteration.

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR (new implementation after extracting BanService)

Best case:

Requests out: 364465.51/second
Responses in: 328018.44/second

  • Connect responses: 162198.50
  • Announce responses: 162554.64
  • Scrape responses: 3265.30
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 104
  • p99.9: 267
  • p100: 331

Worst case:

Requests out: 337984.39/second
Responses in: 304185.64/second

  • Connect responses: 150492.27
  • Announce responses: 150661.45
  • Scrape responses: 3031.91
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 104
  • p99.9: 250

@josecelano
Copy link
Member Author

Benchmarking results after extracting BanService:

  • Best case request out from 469963 to 440593 (-6.24%)
  • Worst case request out from 432509 to 416496 (−3.70%)

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR (new implementation running ban cleaner in a new thread)

Best case:

Requests out: 440593.07/second
Responses in: 393355.73/second

  • Connect responses: 195003.12
  • Announce responses: 194468.01
  • Scrape responses: 3884.59
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 105
  • p99.9: 315
  • p100: 401

Worst case:

Requests out: 416496.57/second
Responses in: 374961.47/second

  • Connect responses: 185585.72
  • Announce responses: 185699.20
  • Scrape responses: 3676.56
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 104
  • p99.9: 303
  • p100: 375

@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 010a2e5 to 77cf089 Compare December 10, 2024 10:29
@josecelano josecelano marked this pull request as ready for review December 10, 2024 11:01
@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from d539959 to 7b4ec75 Compare December 10, 2024 11:03
@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 7b4ec75 to 88b447f Compare December 11, 2024 10:27
@josecelano
Copy link
Member Author

We are planning to make some big changes to this implementation to avoid False Positives:

josecelano added a commit that referenced this pull request Dec 16, 2024
6ca82e9 feat: [#1128] add new metric UDP total requests aborted (Jose Celano)
9499fd8 feat: [#1128] add new metric UDP total responses (Jose Celano)
286fe02 feat: [#1128] add new metric UDP total requests (Jose Celano)

Pull request description:

  Add more metrics to the UDP tracker stats. The new values are:

  - `udp4_requests`: total number of requests received from IPv4 clients.
  - `udp6_requests`: total number of requests received from IPv6 clients.
  - `udp4_responses`: total number of responses sent to IPv4 clients.
  - `udp6_responses`: total number of responses sent to IPv6 clients.
  - `udp_requests_aborted`: total number of requests aborted to make room in the active requests buffer.

  ### Notes

  - Responses sent might differ from requests received because of aborted requests.
  - When we [merge the IP ban service](#1124), we can add a new metric for the total number of IPs banned.
  - I want to add these new metrics to the [live demo Grafana dashboard](torrust/torrust-demo#20).

  ### Subtasks

  - [x] `udp4_requests`
  - [x] `udp6_requests`
  - [x] `udp4_responses`
  - [x] `udp6_responses`
  - [x] `udp_requests_aborted`
  - [x] Benchmarking to check how it affects performance before merging it.

ACKs for top commit:
  josecelano:
    ACK 6ca82e9

Tree-SHA512: 7fbf75b264b191f5c58fcecde8d5e783bbe54ee1c1799acdddc04a9ef64b7196d8b95d1bcad420b1df269bc7929e44417a1d164c6953b00804b0d1e5f0b36e7d
@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 88b447f to 26c05e5 Compare December 16, 2024 11:15
@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from 6af01ff to e0f54d3 Compare December 16, 2024 16:25
…limit

The life demo tracker is receiving many UDP requests with a wrong conenctions IDs. Errors are logged (write disk) and that
decreases the tracker performance.

This counts errors and bans Ips after 10 errors for 2 minutes.

We use two levels of counters.

1. First level: A Counting Bloom Filter: fast and low memory consumption
   but innacurate (False Positives).
2. HashMap: Exact Counter for Ips.

CBFs are fast and use litle memory but they are also innaccurate. They
have False Positives meaning some IPs would be banned only becuase there
are bucket colissions (IPs sharing the same counter).

To avoid banning IPs incorrectly we decided to introduce a second
counter, which is a HashMap that counts error precisely. IPs are only
banned when this counter reaches the limit (over 10 errors).

We keep the CBF as a first level filter. It's a fast-check IP filter
without affecting tracker's performance. When the IP is banned according
to the first filter we double-check in the HashMap.

CBF is faster than checking always for banned IPs against the HashMap.

This solution should be good if the number of IPs is low. We have to
find another solution anyway for IPv6 where is cheaper to own a range of
IPs.
Becuase we are using aquatic_udp_load_test with this ocndifugration

```
Starting client with config: Config {
    server_address: 127.0.0.1:3000,
    log_level: Error,
    workers: 1,
    duration: 0,
    summarize_last: 0,
    extra_statistics: true,
    network: NetworkConfig {
        multiple_client_ipv4s: true,
        sockets_per_worker: 4,
        recv_buffer: 8000000,
    },
    requests: RequestConfig {
        number_of_torrents: 1000000,
        number_of_peers: 2000000,
        scrape_max_torrents: 10,
        announce_peers_wanted: 30,
        weight_connect: 50,
        weight_announce: 50,
        weight_scrape: 1,
        peer_seeder_probability: 0.75,
    },
}
```
@josecelano josecelano force-pushed the 1096-ban-socket-addresses-not-sending-a-valid-connection-id branch from e0f54d3 to 29e506d Compare December 16, 2024 16:39
@josecelano
Copy link
Member Author

josecelano commented Dec 16, 2024

Benchmarking results after adding the HashMap:

  • Best case request out from 417682 to 413429 (-1.01%)
  • Worst case request out from 404915 to 396124 (−2.17%)

Counting Bloom filter values: CountingBloomFilter::with_rate(4, 0.01, 100)

PR (new implementation running ban cleaner in a new thread)

Best case:

Requests out: 413429.53/second
Responses in: 371125.89/second

  • Connect responses: 184013.88
  • Announce responses: 183437.46
  • Scrape responses: 3674.55
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 105
  • p99.9: 299
  • p100: 373

Worst case:

Requests out: 396124.13/second
Responses in: 356491.42/second

  • Connect responses: 176396.93
  • Announce responses: 176532.52
  • Scrape responses: 3561.97
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 105
  • p99.9: 287
  • p100: 373

Develop Branch

Best case:

Requests out: 417682.64/second
Responses in: 375161.75/second

  • Connect responses: 185776.19
  • Announce responses: 185634.17
  • Scrape responses: 3751.40
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 105
  • p99.9: 305
  • p100: 393

Worst case:

Requests out: 404915.33/second
Responses in: 371497.74/second

  • Connect responses: 183964.50
  • Announce responses: 183821.18
  • Scrape responses: 3712.05
  • Error responses: 0.00
    Peers per announce response: 0.00
    Announce responses per info hash:
  • p10: 1
  • p25: 1
  • p50: 1
  • p75: 1
  • p90: 2
  • p95: 3
  • p99: 106
  • p99.9: 299
  • p100: 371

@josecelano
Copy link
Member Author

ACK 29e506d

@josecelano
Copy link
Member Author

josecelano commented Dec 16, 2024

Hi @da2ce7 Tomorrow (09:30 UTC) I will merge this, deploy it to the live demo, run it for some hours and compare the Grafana dashboard before and after the deployment. I will also compare CPU and memory consumption. I expect to have again the problem we had that the tracker started comsuming more and more memory until the docker container is restarted. And I expect the number of errors decrease drastically becuase of the IP banning.

Current data (2024-12-17 09:29 UTC)

image

image

@josecelano josecelano merged commit 208694f into torrust:develop Dec 17, 2024
23 checks passed
@josecelano
Copy link
Member Author

josecelano commented Dec 18, 2024

Hi @da2ce7, after running the new IP ban filter for 24h. Errors have decreased comparing rate between announce requests and error responses.

Without ban service: Announces 450-650 -> Errors 70-110
With ban service: Announces 450-700 -> Errors 110-225

I don't have the exact value because I should have created a graph for the previous version with that rate.

They have not decreased too much, maybe for two reasons:

  1. The ban duration is too short, only two minutes, which is the common announce interval. I will increase it to 1 hour as in the first draft of the PR.
  2. I think there are many "bad" clients, so even if you ban some, there are always more. It's not a subset of "attackers".

As you can see in the "UDP4 requests and responses (per second)" graph we are not sending the response (becuase the IP was banned) for 100 requests per second on average.

NOTES:

  • It's weird that memory consumption has not increased like some months or weeks ago. The number of torrents, seeders and leechers seems to be pretty stable. I wonder why that has changed.
  • Error responses are not only wrong connection IDs responses. They include any type of error.

image

Current data (2024-12-18 09:29 UTC)

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ban socket addresses not sending a valid connection ID

1 participant