Skip to content

Commit a0414e1

Browse files
cursoragentscript3r
andcommitted
Refactor: Implement AST-based crypto detection
Replaces regex-based pattern matching with AST parsing for more accurate cryptographic library and algorithm detection. Outputs findings in JSONL format. Co-authored-by: script3r <[email protected]>
1 parent 49430e3 commit a0414e1

File tree

9 files changed

+926
-230
lines changed

9 files changed

+926
-230
lines changed

Cargo.lock

Lines changed: 91 additions & 18 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,14 @@ x509-parser = "0.15"
3333
chrono = { version = "0.4", features = ["serde"] }
3434
tempfile = "3"
3535

36+
# AST parsing dependencies
37+
tree-sitter = "0.22"
38+
tree-sitter-c = "0.21"
39+
tree-sitter-cpp = "0.22"
40+
tree-sitter-rust = "0.21"
41+
tree-sitter-python = "0.21"
42+
tree-sitter-javascript = "0.21"
43+
tree-sitter-java = "0.21"
44+
tree-sitter-go = "0.21"
45+
syn = { version = "2.0", features = ["full", "parsing", "visit"] }
46+

README.md

Lines changed: 23 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -4,90 +4,59 @@
44
<img src="cipherscope.png" alt="CipherScope Logo" width="350" height="350">
55
</div>
66

7-
Fast cryptographic inventory generator that creates Minimal Viable Cryptographic Bill of Materials (MV-CBOM) documents. Scans codebases to identify cryptographic algorithms, certificates, and assess post-quantum cryptography readiness.
7+
Fast AST-based cryptographic library and algorithm detection tool. Uses Abstract Syntax Tree parsing to precisely identify cryptographic usage in source code and outputs findings in JSONL format.
88

99
## Quick Start
1010

1111
```bash
1212
cargo build --release
13-
./target/release/cipherscope --patterns patterns.toml --progress /path/to/scan [... paths]
13+
./target/release/cipherscope --progress /path/to/scan [... paths]
1414
```
1515

1616
## What It Does
1717

18-
- **Detects** cryptographic usage across 11 languages
19-
- **Identifies** many cryptographic algorithms (AES, SHA, RSA, ECDSA, ChaCha20, etc.)
20-
- **Outputs** JSON inventory with NIST quantum security levels
21-
- **Runs fast** - GiB/s throughput with parallel scanning
18+
- **AST-based detection** - Uses tree-sitter parsers for precise source code analysis
19+
- **Library detection** - Identifies crypto libraries via import/include/using statements
20+
- **Algorithm detection** - Finds algorithm usage via method names, function calls, and type definitions
21+
- **Multi-language support** - C, C++, Rust, Python, Java, Go
22+
- **JSONL output** - Simple one-JSON-object-per-line format for easy processing
23+
- **Fast parallel scanning** - Efficient processing of large codebases
2224

2325
## Example Output
2426

25-
```json
26-
{
27-
"bomFormat": "MV-CBOM",
28-
"specVersion": "1.0",
29-
"cryptoAssets": [{
30-
"name": "RSA",
31-
"assetProperties": {
32-
"primitive": "signature",
33-
"parameterSet": {"keySize": 2048},
34-
"nistQuantumSecurityLevel": 0
35-
}
36-
}]
37-
}
27+
```jsonl
28+
{"language":"C","library":"OpenSSL","symbol":"<openssl/evp.h>","file":"src/main.c","line":1,"column":10,"snippet":"<openssl/evp.h>","detector":"ast-detector-c"}
29+
{"language":"Python","library":"cryptography","symbol":"cryptography.hazmat.primitives.ciphers","file":"app.py","line":1,"column":6,"snippet":"cryptography.hazmat.primitives.ciphers","detector":"ast-detector-python"}
30+
{"language":"Rust","library":"ring","symbol":"ring::aead","file":"main.rs","line":1,"column":5,"snippet":"ring::aead","detector":"ast-detector-rust"}
3831
```
3932

4033
## Options
4134

4235
### Core Options
43-
- `--patterns PATH` - Custom patterns file (default: `patterns.toml`)
4436
- `--progress` - Show progress bar during scanning
45-
- `--deterministic` - Reproducible output for testing/ground-truth generation
46-
- `--output FILE` - Output file for single-project CBOM (default: stdout)
47-
- `--recursive` - Generate MV-CBOMs for all discovered projects
48-
- `--output-dir DIR` - Output directory for recursive CBOMs
37+
- `--deterministic` - Reproducible output for testing
38+
- `--output FILE` - Output file for JSONL results (default: stdout)
4939

5040
### Filtering & Performance
5141
- `--threads N` - Number of processing threads
5242
- `--max-file-size MB` - Maximum file size to scan (default: 2MB)
5343
- `--include-glob GLOB` - Include files matching glob pattern(s)
5444
- `--exclude-glob GLOB` - Exclude files matching glob pattern(s)
5545

56-
### Certificate Scanning
57-
- `--skip-certificates` - Skip certificate scanning during CBOM generation
58-
59-
### Configuration
60-
- `--print-config` - Print merged patterns/config and exit
61-
6246
## Languages Supported
6347

64-
C, C++, Go, Java, Kotlin, Python, Rust, Swift, Objective-C, PHP, Erlang
65-
66-
## Configuration
67-
68-
Edit `patterns.toml` to add new libraries or algorithms. No code changes needed.
48+
C, C++, Go, Java, Python, Rust (AST-based detection)
6949

7050
## How It Works (High-Level)
7151

72-
1. Workspace discovery and prefilter
73-
- Walks files respecting .gitignore
74-
- Cheap Aho-Corasick prefilter using language-specific substrings derived from patterns
75-
2. Language detection and comment stripping
76-
- Detects language by extension; strips comments once for fast regex matching
77-
3. Library identification (anchors)
78-
- Per-language detector loads compiled patterns for that language (from `patterns.toml`)
79-
- Looks for include/import/namespace/API anchors to confirm a library is present in a file
80-
4. Algorithm matching
81-
- For each identified library, matches algorithm `symbol_patterns` (regex) against the file
82-
- Extracts parameters via `parameter_patterns` (e.g., key size, curve) with defaults when absent
83-
- Emits findings with file, line/column, library, algorithm, primitive, and NIST quantum level
84-
5. Deep static analysis (fallback/enrichment)
85-
- For small scans, analyzes files directly with the registry to find additional algorithms even if no library finding was produced
86-
6. CBOM generation
87-
- Findings are deduplicated and merged
88-
- Final MV-CBOM JSON is printed or written per CLI options
89-
90-
All behavior is driven by `patterns.toml` — adding new libraries/algorithms is a data-only change.
52+
1. **File Discovery** - Walks files respecting .gitignore and language detection
53+
2. **AST Parsing** - Uses tree-sitter parsers to build Abstract Syntax Trees for each supported language
54+
3. **Pattern Matching** - Executes tree-sitter queries to find:
55+
- **Library imports** - `#include`, `import`, `use` statements for crypto libraries
56+
- **Algorithm usage** - Function calls, method invocations, type references
57+
4. **Result Emission** - Outputs findings as JSONL with precise location information
58+
59+
The AST-based approach provides more accurate detection than regex patterns by understanding the actual structure of the code.
9160

9261
## Testing
9362

0 commit comments

Comments
 (0)