This tool indexes and searches the GPT-2 output dataset using ZeroEntropy's semantic search capabilities.
-
ZeroEntropy API Key - Set your API key:
$env:ZEROENTROPY_API_KEY = "your-api-key-here"
-
GPT-2 Dataset - The dataset should be in
../gpt-2-output-dataset/data/with these files:webtext.valid.jsonl(human text)small-117M.valid.jsonl(GPT-2 small)medium-345M.valid.jsonl(GPT-2 medium)large-762M.valid.jsonl(GPT-2 large)xl-1542M.valid.jsonl(GPT-2 XL)
cargo run --release --example search_gpt2_datasetWhen you run it, you'll see three options:
- Index dataset - Required first time. Indexes the first 100 samples from each dataset (takes ~10-20 minutes)
- Search existing collections - Search if you've already indexed
- Index and search - Do both
- Creates 5 ZeroEntropy collections (one for each dataset)
- Indexes first 100 samples from each dataset
- Adds metadata: source and index number
-
Automated Code Search: Searches for common code patterns:
- Function definitions
- Import statements
- Class methods
- Loops and conditionals
- API/HTTP requests
- Database queries
-
Interactive Search: Enter custom queries to search all collections
Try searching for:
"function definition programming code""import module python java javascript""class method implementation""for loop while loop iteration""API endpoint HTTP request"
For each query, you'll see:
- Which collection (webtext, gpt2_small, etc.)
- Number of matches
- Top results with relevance scores
- Preview of matching content
This tool helps answer: "Is there code in the GPT-2 training/output data?"
By comparing search results across:
- webtext (human-written, WebText corpus)
- gpt2_small/medium/large/xl (AI-generated)
You can see:
- If code exists in the corpus
- Whether GPT-2 generates more/less code than humans
- Quality differences in generated code
- Indexing: ~100 documents × 5 collections = ~10-20 minutes
- Search: <1 second per query across all collections
To index more samples, edit line 73 in search_gpt2_dataset.rs:
if idx >= 100 { // Change to 1000, 5000, etc.
break;
}The full dataset has 5,000 validation samples per collection.
After running this, you can:
- Analyze code density - Count how many code-like samples exist
- Compare patterns - See if GPT-2 has different code "signatures" than humans
- Feed into watermark detector - Use results to improve the detector in
gpt2-watermark-detector - Build a corpus map - Understand what's in the training data
- zeroentropy-rust - The ZeroEntropy Rust SDK
- gpt-2 - GPT-2 text generation
- gpt2-watermark-detector - Semantic watermark detection