|
| 1 | +# Advanced Corruption Testing for libsql |
| 2 | + |
| 3 | +Hey there! I'm hamisi and I've been working on some pretty intense testing for libsql. After diving deep into the codebase, I noticed there were some areas where we could really stress test the system to make sure it's rock solid. |
| 4 | + |
| 5 | +## What I Built |
| 6 | + |
| 7 | +I've created a bunch of test suites that really push libsql to its limits. The idea is to simulate all kinds of crazy scenarios that might happen in production - network failures, memory pressure, concurrent operations going wild, you name it. |
| 8 | + |
| 9 | +### The Test Files |
| 10 | + |
| 11 | +**Basic Corruption Tests** (`data_corruption_simulation.rs`) |
| 12 | +This is where I started. I wanted to see what happens when transactions get interrupted by network issues, or when the WAL gets compacted while stuff is still happening. Pretty basic but important stuff. |
| 13 | + |
| 14 | +**Advanced Scenarios** (`advanced_corruption_scenarios.rs`) |
| 15 | +This one gets more interesting. I'm testing things like what happens when replicas get out of sync, or when checkpoints happen at the worst possible time. The kind of edge cases that keep you up at night. |
| 16 | + |
| 17 | +**Extreme Stress Tests** (`extreme_corruption_tests.rs`) |
| 18 | +Okay, this is where I went a bit crazy. I'm running 20+ workers all hitting the database at once with tiny memory limits and terrible network conditions. If there's a race condition hiding somewhere, this should find it. |
| 19 | + |
| 20 | +**Edge Cases** (`edge_case_corruption_tests.rs`) |
| 21 | +All the weird stuff - what happens with massive integers, crazy Unicode characters, NULL values where they shouldn't be. The kind of data that breaks things in unexpected ways. |
| 22 | + |
| 23 | +**The Big One** (`comprehensive_bug_hunter.rs`) |
| 24 | +This runs everything at once. Multiple scenarios, different types of operations, maximum chaos. It's like throwing everything at the wall and seeing what sticks. |
| 25 | + |
| 26 | +## How It Works |
| 27 | + |
| 28 | +I'm using the Turmoil simulation framework because it lets me create reproducible network failures. No more "works on my machine" - if there's a bug, I can reproduce it every time. |
| 29 | + |
| 30 | +The tests are pretty aggressive: |
| 31 | +- Super small log sizes to force constant compaction |
| 32 | +- Limited bandwidth to create realistic lag |
| 33 | +- Strategic timing of network failures |
| 34 | +- Lots of concurrent operations |
| 35 | + |
| 36 | +## What I'm Looking For |
| 37 | + |
| 38 | +Basically any kind of data corruption: |
| 39 | +- Transactions that should succeed but fail |
| 40 | +- Data that gets lost or corrupted |
| 41 | +- Constraints that get violated |
| 42 | +- Inconsistencies between replicas |
| 43 | +- Memory corruption issues |
| 44 | +- Unicode handling problems |
| 45 | + |
| 46 | +## Running the Tests |
| 47 | + |
| 48 | +```bash |
| 49 | +# Run everything (warning: takes a while) |
| 50 | +cargo test corruption |
| 51 | + |
| 52 | +# Run specific tests |
| 53 | +cargo test extreme_concurrent_stress_test |
| 54 | +cargo test unicode_corruption_test |
| 55 | +cargo test memory_pressure_corruption_test |
| 56 | + |
| 57 | +# Get detailed output |
| 58 | +cargo test -- --nocapture |
| 59 | +``` |
| 60 | + |
| 61 | +## My Testing Philosophy |
| 62 | + |
| 63 | +I believe in breaking things before users do. These tests are designed to find the bugs that only show up under extreme conditions - the ones that are really hard to debug in production. |
| 64 | + |
| 65 | +Every test has comprehensive verification built in. If something goes wrong, you'll know exactly what and where. No silent failures. |
| 66 | + |
| 67 | +## Technical Details |
| 68 | + |
| 69 | +The tests use deterministic simulation, so they're reproducible. I've also built in real-time corruption detection - the moment something goes wrong, the test will catch it and tell you exactly what happened. |
| 70 | + |
| 71 | +I'm particularly focused on: |
| 72 | +- Race conditions in transaction processing |
| 73 | +- WAL integrity during compaction |
| 74 | +- Memory management under pressure |
| 75 | +- Network partition recovery |
| 76 | +- Data encoding/decoding edge cases |
| 77 | + |
| 78 | +## Why This Matters |
| 79 | + |
| 80 | +Database corruption is scary. When it happens in production, you're looking at data loss, downtime, and very unhappy users. These tests help catch those issues before they ever see the light of day. |
| 81 | + |
| 82 | +I've tried to cover all the scenarios I can think of, but I'm sure there are more. That's the nature of testing - you find what you're looking for, and sometimes you find things you weren't expecting. |
| 83 | + |
| 84 | +## Contributing |
| 85 | + |
| 86 | +If you want to add more tests or improve existing ones, go for it! The pattern is pretty straightforward: |
| 87 | +1. Set up the scenario |
| 88 | +2. Apply stress/failures |
| 89 | +3. Verify everything is still correct |
| 90 | +4. Report any issues found |
| 91 | + |
| 92 | +The more edge cases we can test, the more confident we can be in the system's reliability. |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +*These tests represent my attempt to really push libsql to its limits and make sure it can handle whatever production throws at it. Hope they're useful!* |
| 97 | + |
| 98 | +- hamisi |
0 commit comments