-
Notifications
You must be signed in to change notification settings - Fork 499
docs: add KPI generation steps #6598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
708c8e9
a03167c
c3aa4a5
097c262
67e2ad0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,186 @@ | ||
| # Celestia KPI Reproduction Steps | ||
|
|
||
| This document provides instructions for reproducing the core-app KPIs. These KPIs measure transaction submission performance and sync to tip speed. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **Verify block time configuration for 32MB/3sec blocks:** | ||
|
|
||
| Make sure app is configured for the target throughput. Verify that `DelayedPrecommitTimeout` is set to 2800ms for 3s block time. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where do I verify this? In a file? In a CLI command? |
||
|
|
||
| 2. **Install celestia-app and dependencies:** | ||
|
|
||
| ```bash | ||
| # Build all necessary binaries (must be done after verifying DelayedPrecommitTimeout) | ||
| make build-talis-bins | ||
|
|
||
| # Install talis | ||
| go install ./tools/talis/ | ||
| ``` | ||
|
|
||
| 3. **Set up cloud provider credentials:** | ||
|
|
||
| Google Cloud is recommended for high-throughput tests. Ask the DevOps team for access to Celestia's Google Cloud fibreda workspace. | ||
|
|
||
| ```bash | ||
| # Create a .env file | ||
| talis init-env --provider googlecloud | ||
|
|
||
| # Fill in the .env file with your credentials: | ||
| GOOGLE_CLOUD_PROJECT="fibreda" | ||
| GOOGLE_CLOUD_KEY_JSON_PATH="/path/to/service-account-key.json" | ||
|
Comment on lines
+26
to
+31
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tty47 is there a different Google Cloud project that we should use for collecting KPIs? Or is it safe to use the Note: some of the KPIs aren't related to FibreDA |
||
| ``` | ||
|
|
||
| 4. **SSH key is required for running experiments:** | ||
|
|
||
| Create a new SSH key or use existing one. For Google Cloud the SSH key is automatically added to instance metadata by talis. | ||
|
|
||
| Configure these variables in `.env`: | ||
|
|
||
| ```bash | ||
| TALIS_SSH_KEY_PATH=your-key-path | ||
| TALIS_SSH_KEY_NAME=your-key-name | ||
| ``` | ||
|
|
||
| ## Talis Network Deployment | ||
|
|
||
| 1. **Initialize Talis Network** | ||
|
|
||
| ```bash | ||
| # Initialize with observability for metrics collection | ||
| talis init -c kpi-test-chain -e tx-kpi --with-observability --provider googlecloud | ||
|
|
||
| # Add validator nodes (50-100 validators recommended for realistic network) | ||
| talis add -t validator -c 50 --provider googlecloud | ||
| ``` | ||
|
|
||
| 2. **Deploy Network** | ||
|
|
||
| ```bash | ||
| # Spin up cloud instances (specify SSH key if not using defaults) | ||
| talis up --provider googlecloud --workers 20 | ||
|
|
||
| # Create genesis with appropriate square size | ||
| # Square size 256 allows for ~32MB blocks | ||
| talis genesis -s 256 -b ./build | ||
|
|
||
| # Deploy the network (specify SSH key if needed) | ||
| talis deploy --direct-payload-upload --workers 20 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably worth noting the s3 option for faster deployment
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I was going to add that but thought I'd keep everything to minimum. Will add. |
||
|
|
||
| # After deployment completes, talis will output the Grafana access information: | ||
| # URL, credentials. | ||
|
|
||
| # Wait for network to start and optionally confirm all validators are online | ||
| talis status | ||
| ``` | ||
|
|
||
| ## Transaction Submission KPIs | ||
|
|
||
| **NOTE** Reset the network between KPI experiments for fresh state/accurate results. | ||
|
|
||
| ```bash | ||
| talis reset | ||
| talis deploy -w 20 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [extreme nit] not worth changing but for future reference I think when commands are in documentation (like this) we should strive to use the long format flag instead of short form flag. Motivation: as a reader, I don't know what |
||
| ``` | ||
|
|
||
| ### KPI 1: 8MB/1sec (Single Submitter) | ||
|
|
||
| **Target:** One latency monitor submitting 8MB blobs every second | ||
|
|
||
| ```bash | ||
| talis latency-monitor -i 1 -b 8000000 -z 8000000 -s 1000ms | ||
| ``` | ||
|
|
||
| **Expected Results:** | ||
|
|
||
| - Success rate: >=99.9% | ||
| - Average user latency: 6-8 seconds | ||
| - No Evictions | ||
|
|
||
| ### KPI 2: Load Shedding (Two Submitters, 8MB/1sec each) | ||
|
|
||
| **Target:** Two latency monitors submitting 8MB blobs every second (total 16MB/1sec) | ||
|
|
||
| ```bash | ||
| talis latency-monitor -i 2 -b 8000000 -z 8000000 -s 1000ms | ||
| ``` | ||
|
|
||
| **Expected Observations:** | ||
|
|
||
| - Gas price increases under load | ||
| - Some broadcast failures due to full mempool | ||
| - Higher latency due to eviction timeouts | ||
| - Sequence mismatch errors from resubmission race conditions | ||
| - Network attempts load shedding by evicting low fee transactions | ||
|
|
||
| ### Test 3: Parallel Submission (Multiple Workers) | ||
|
|
||
| **Target:** Single latency monitor with multiple parallel workers trying to fill up the throughput. | ||
|
|
||
| ```bash | ||
| # example: 15 workers submitting 2-8MB txs every 100ms | ||
| talis latency-monitor --instances 1 -w 15 -b 8000000 -z 2000000 --submission-delay 100ms | ||
| ``` | ||
|
|
||
| **Expected Results:** | ||
|
|
||
| - Consistent throughput >9MB/1sec | ||
| - Good mempool distribution | ||
|
|
||
| ### Test 4: No Eviction (Optimal Conditions) | ||
|
|
||
| This can already be measured in the first experiment but if you have to re-run: | ||
|
|
||
| ```bash | ||
| talis latency-monitor -i 1 -b 8000000 -z 8000000 -s 1000ms | ||
| ``` | ||
|
|
||
| **Expected Results:** | ||
|
|
||
| - Transactions included with zero evictions | ||
|
|
||
| ## Collect Metrics and Results | ||
|
|
||
| ### From Grafana | ||
|
|
||
| At `http://<observability-instance-ip>:3000` as displayed during `talis deploy`: | ||
|
|
||
| - Access celestia grafana dashboards displaying network data | ||
| - Access Latency monitor dashboards displaying submission statistics and latency monitor logs | ||
|
|
||
| ## Cleanup | ||
|
|
||
| ```bash | ||
| # Destroy cloud instances | ||
| talis down --workers 20 | ||
| ``` | ||
|
|
||
| ## Sync to Tip KPIs | ||
|
|
||
| These KPIs measure how quickly a new node can sync to the network tip using state sync and block sync. | ||
|
|
||
| **Target:** Total sync time <10 minutes (state sync + block sync) | ||
|
|
||
| ### Running Sync Tests | ||
|
|
||
| #### Option 1: Local node (Mocha Testnet) | ||
|
|
||
| This script runs multiple iterations and provides statistical analysis: | ||
|
|
||
| ```bash | ||
| # Single iteration | ||
| ./scripts/mocha-measure-tip-sync.sh | ||
|
|
||
| # Multiple iterations (20 iterations with 30s cooldown) | ||
| ./scripts/mocha-measure-tip-sync.sh --iterations 20 --cooldown 30 | ||
| ``` | ||
|
|
||
| #### Option 2: Manual Testing on DigitalOcean | ||
|
|
||
| For production-like testing on cloud infrastructure: | ||
|
|
||
| TBD | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll add the section once #6599 merges |
||
|
|
||
| ### Analyzing Sync Results | ||
|
|
||
| The combined sync (state + block sync) must take less than 10 minutes. | ||
Uh oh!
There was an error while loading. Please reload this page.