|
| 1 | +# Evaluation of the hax Frontend |
| 2 | + |
| 3 | +This section provides an evaluation of the hax frontend, consisting of two parts: |
| 4 | + |
| 5 | +- A **quantitative evaluation**, which measures how effectively the frontend processes Rust code. |
| 6 | +- A **qualitative analysis**, which explores the frontend’s capabilities in real-world scenarios. |
| 7 | + |
| 8 | +Together, these evaluations document the current strengths and limitations of the hax frontend. |
| 9 | + |
| 10 | +## Quantitative Evaluation |
| 11 | + |
| 12 | +The quantitative evaluation aims to estimate how much Rust code the frontend can handle. It does **not** assess the correctness of the frontend's output. |
| 13 | + |
| 14 | +### Overview |
| 15 | + |
| 16 | +The hax toolchain is composed of several components (see [high-level architecture](./index.md#high-level-arch)): |
| 17 | + |
| 18 | +- **Frontend**: hooks into the Rust compiler to export rich Abstract Syntax Trees (ASTs) for specified crates. |
| 19 | +- **Engine** and **Backends**: consume those ASTs to produce code. |
| 20 | +- **Libraries**: `hax-lib` provides supporting functionality, and reference models for existing Rust libraries (e.g. the [Core library](https://doc.rust-lang.org/stable/core/) of Rust). |
| 21 | + |
| 22 | +In this quantitative evaluation, we focus on the **frontend**: the process of generating JSON-encoded ASTs from Rust code. We aim to assess: |
| 23 | + |
| 24 | +1. **Successful Extraction**: The success rate of producing ASTs. |
| 25 | +2. **Performance**: Ensuring the extraction process remains efficient enough for real-world usage. |
| 26 | + |
| 27 | +### Methodology |
| 28 | + |
| 29 | +For each Rust crate in our test set, we follow these steps: |
| 30 | + |
| 31 | +1. Clone the crate's source code. |
| 32 | +2. Run `cargo fetch` to download its dependencies. |
| 33 | +3. Execute `cargo hax json --use-ids`, recording any errors and the time taken. |
| 34 | +4. Clean Cargo's cache with `cargo clean`. |
| 35 | +5. Run `cargo check`, again recording any errors and time. Since `cargo hax json` is effectively `cargo check` with extra work, this serves as our performance baseline. |
| 36 | + |
| 37 | +We implemented this protocol in an internal Cryspen tool, which also evaluates other parts of the hax toolchain. |
| 38 | + |
| 39 | +### Crate Selection |
| 40 | + |
| 41 | +To ensure we capture a diverse set of crates: |
| 42 | + |
| 43 | +- We include the **5,000 most downloaded** crates from crates.io. |
| 44 | +- We also include the **top 1,500 crates** in the **cryptography** category on crates.io, reflecting hax's relevance for verifying critical software like cryptographic libraries. |
| 45 | + |
| 46 | +> **TODO:** the numbers are wrong: I want to get more numbers. Currently we have top 1000k crates only. This is because the tool ran for 10 hours, and handled only 1k crates. The top 1k crates in the corpus `5k top crate \union top 1k5 crypto crates` is actually top 1k crates. (that makes sense) |
| 47 | +
|
| 48 | +### Success Rate |
| 49 | + |
| 50 | +Each crate falls into one of three categories: |
| 51 | + |
| 52 | +1. **Successful**: hax produced a valid AST. |
| 53 | +2. **Failed**: hax could not produce an AST (despite `cargo check` succeeding). |
| 54 | +3. **Both Failed**: Both `cargo check` and `cargo hax` failed. |
| 55 | + |
| 56 | +```mermaid |
| 57 | +%%{init: {'theme': 'base', 'themeVariables': { 'pie1': '#27ae60', 'pie2': '#f1c40f', 'pie3': '#e74c3c'}}}%% |
| 58 | +pie showData |
| 59 | + "`cargo check` failure" : 41 |
| 60 | + "`cargo hax` failure" : 24 |
| 61 | + "Success" : 935 |
| 62 | +``` |
| 63 | + |
| 64 | +Out of 1000 crates, our tool failed to run `cargo check` on 41 of them due to |
| 65 | +setup issues. These problems typically involve missing system packages that |
| 66 | +Cargo cannot automatically install or unusual Cargo configurations that require |
| 67 | +manual intervention. We therefore exclude these 41 crates from further analysis. |
| 68 | + |
| 69 | +Of the remaining 959 crates, the hax frontend successfully processed a **vast |
| 70 | +majority (97.5%)**. The remaining failures fall into four distinct categories, |
| 71 | +as illustrated in the pie chart below. |
| 72 | + |
| 73 | +```mermaid |
| 74 | +%%{init: {'theme': 'base', 'themeVariables': { 'pie1': '#c0392b', 'pie2': '#3498db', 'pie3': '#2980b9', 'pie4': '#e74c3c'}}}%% |
| 75 | +pie showData title Frontend failures |
| 76 | + "Unsupported Rust toolchain" : 4 |
| 77 | + "Rust setup issue" : 6 |
| 78 | + "Binder panic" : 10 |
| 79 | + "Stack overflow in Rustc" : 4 |
| 80 | +``` |
| 81 | + |
| 82 | +The errors marked in **blue** on the chart indicate situations where the Rust |
| 83 | +toolchain used by the tested crate or its dependencies is incompatible with the |
| 84 | +specific version hax is pinned to, or where the crate and hax are sensitive to |
| 85 | +toolchain variations. Rust edition 2024 was updated very recently, which |
| 86 | +explains roughly half of these issues. |
| 87 | + |
| 88 | +The errors shown in **red**, however, are directly related to hax. The |
| 89 | +binder-related panics are a [known |
| 90 | +bug](https://github.com/cryspen/hax/issues/1046). Additionally, the stack |
| 91 | +overflow errors occur due to specific code paths in the Rust compiler being |
| 92 | +incorrectly triggered by hax. Ultimately, only **1.6%** of crates encounter such |
| 93 | +hax-specific bugs. |
| 94 | + |
| 95 | +### Performance Analysis |
| 96 | + |
| 97 | +For the crates that successfully generated ASTs, we compared the time taken by `cargo hax json` against `cargo check`. Because crate size and complexity vary greatly, we normalized[^1] the times to allow fair comparisons. |
| 98 | + |
| 99 | +<div class="center-table" markdown> |
| 100 | + |
| 101 | +| Statistic | Cargo Check | Cargo Hax | |
| 102 | +|-----------------|------------:|----------:| |
| 103 | +| **Median** | 0.147 | 0.780 | |
| 104 | +| **Mean** | 0.215 | 0.771 | |
| 105 | +| **10th Decile** | 0.425 | 0.953 | |
| 106 | + |
| 107 | +</div> |
| 108 | + |
| 109 | +<!-- We break down the results into **cryptography crates** and **general crates**: --> |
| 110 | + |
| 111 | +<!-- #### Cryptography Crates |
| 112 | +
|
| 113 | +| Statistic | Cargo Check | Cargo Hax | |
| 114 | +|-----------------|------------:|----------:| |
| 115 | +| **Median** | 0.148 | 0.796 | |
| 116 | +| **Mean** | 0.199 | 0.777 | |
| 117 | +| **10th Decile** | 0.411 | 0.948 | --> |
| 118 | + |
| 119 | +- On average, `cargo hax` is about 4–5 times slower than `cargo check`. |
| 120 | +- At the **10th decile**, the slowdown is only about 2×, indicating better scaling for larger crates. |
| 121 | + |
| 122 | +## Conclusion |
| 123 | + |
| 124 | +Our quantitative evaluation shows that the hax frontend successfully extracts ASTs for a large portion of the Rust ecosystem. Nevertheless, a small portion of crates reveal performance bottlenecks or outright failures that require further investigation and optimization. |
| 125 | + |
| 126 | +These results also highlight a few **limitations** of this initial study: |
| 127 | + |
| 128 | +- We only evaluated the **frontend** process. Other stages of the toolchain, such as the engine, backends or libraries, require separate assessments. |
| 129 | +- We did not assess the **correctness or completeness** of the generated JSON, highlighting the need for a qualitative analysis to verify that the extracted ASTs meet the required specifications. |
| 130 | + |
| 131 | +Overall, the hax frontend demonstrates capabilities for large-scale Rust code verification, but continued refinement is needed to handle edge cases and improve performance. |
| 132 | + |
| 133 | +[^1]: For a given crate, we normalize the times by dividing them by the total time. |
0 commit comments