A Large Language Model-powered batch processing tool for species taxonomy information, designed to convert common species names to Latin scientific names and retrieve NCBI Taxonomy IDs.
- 🤖 Intelligent Name Conversion: Uses Zhipu GLM-4.5 model for bidirectional conversion between common and Latin names
- 📊 Batch Data Processing: Supports efficient batch processing of CSV files with large species datasets
- 🔄 Dual Workflow System: Primary workflow + fallback workflow ensuring high success rates
- 📚 NCBI Database Integration: Queries authoritative taxonomy IDs through ETE3 library
- 🛡️ Robust Error Handling: Comprehensive retry mechanisms and exception handling
- Primary Workflow: Common name → Latin name → NCBI TaxID
- Fallback Workflow: Latin name → Common name → Latin name → NCBI TaxID
The project employs a rule-based decision system rather than relying on LLM for decision making:
- Prioritizes common name conversion (primary workflow)
- Enables Latin name-based fallback mechanism when common names are invalid or conversion fails
- Ensures data accuracy and completeness through multi-layer validation
# Install dependencies using uv (recommended)
uv syncCreate a .env file and add your Zhipu API key:
ZHIPU_API_KEY=your_zhipu_api_key_here# Use default file paths
python taxnomy_agent.py
# Use custom file paths
python taxnomy_agent.py -i input.csv -o output.csv -d /path/to/taxdump.tar.gz
# View all options
python taxnomy_agent.py --help-i, --input: Input CSV file path (default:./Host_Range_output.csv)-o, --output: Output CSV file path (default:./Host_Range_output_update.csv)-d, --cachedir: ETE3 cache file path (optional, default:./NCBI_taxnomy_db_dir/taxdump.tar.gz)
CSV files should contain the following columns (script will automatically create missing columns):
Common Name: Species common namesLatin name: Latin scientific namesTaxonomy ID: NCBI Taxonomy IDs
The system uses the NCBI Taxonomy database for authoritative species information. You can manually download the database from:
Download URL: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
The first run of ETE3 will automatically download ~100MB of taxonomy data to ~/.etetoolkit/ or use the custom cache directory specified with -d parameter.
common_name_to_latin(): Convert common names to Latin nameslatin_to_common_name(): Convert Latin names to common names
batch_common_to_latin(): Batch common name conversionbatch_latin_to_taxid_ete3(): Batch TaxID querying
process_taxonomy_csv(): Main CSV processing workflow
taxnomy_agent/
├── taxnomy_agent.py # Main script file
├── final.ipynb # Development and debugging notebook
├── data/ # Data folder
│ ├── sample_Host_Range_output.csv
│ └── output_test_taxonomy_updated.csv
├── NCBI_taxnomy_db_dir/ # NCBI taxonomy database cache
├── pyproject.toml # Project configuration
├── uv.lock # Dependency lock file
└── .env # Environment variables (create manually)
- Smart Retry Mechanism: Automatic retry on API call failures (up to 3 attempts)
- Data Cleaning: Automatic handling of special characters and null values
- Encoding Compatibility: Support for UTF-8 and latin1 encodings
- Progress Display: Real-time progress tracking with tqdm
- Cache Optimization: Local NCBI database caching after first run
- First ETE3 run downloads ~100MB NCBI taxonomy database to
~/.etetoolkit/ - Ensure stable internet connection for Zhipu API and NCBI database access
- Recommend testing with small datasets first before processing large-scale data
This project is open source. Please refer to the project configuration files for license details.