tw-code-qa

Traditional Chinese Code-QA Dataset Conversion System using Multi-Agent Architecture

Description

This project is a system for converting and processing Traditional Chinese Code-QA datasets using a multi-agent architecture built with LangChain and LangGraph.

Installation

Clone the repository:

git clone https://github.com/ai-twinkle/tw-code-qa.git
cd tw-code-qa

Install dependencies using uv:
```
uv sync
```
For development dependencies:
```
uv sync --extra dev
```
Set up environment variables:
```
cp .env.example .env
```
Then edit .env file with your actual API keys:
```
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
```
Important: API keys are required in production mode (default). For development mode, you can run without API keys using --environment development.

Usage

First, download the dataset:

uv run python scripts/download_dataset.py

Then, run the main script with the dataset path and always specify output directory to avoid mixing different datasets:

Processing Different Datasets

Important: Always specify --output-dir for each dataset to prevent mixing results from different datasets.

Educational Instruct Dataset:

uv run python main.py --dataset-path data/opencoder_dataset_educational_instruct --dataset-type opencoder --output-dir output/educational_instruct

Evol Instruct Dataset:

uv run python main.py --dataset-path data/opencoder_dataset_evol_instruct --dataset-type opencoder --output-dir output/evol_instruct

McEval Instruct Dataset:

uv run python main.py --dataset-path data/opencoder_dataset_mceval_instruct --dataset-type opencoder --output-dir output/mceval_instruct

Package Instruct Dataset:

uv run python main.py --dataset-path data/opencoder_dataset_package_instruct --dataset-type opencoder --output-dir output/package_instruct

Other Usage Examples

Test mode (process only first 10 records):

uv run python main.py --dataset-path data/opencoder_dataset_educational_instruct --output-dir output/test_run --max-records 10 --environment development

Production mode with full processing:

uv run python main.py --dataset-path data/opencoder_dataset_package_instruct --output-dir output/package_instruct --environment production

Resume a previous run (continues from last checkpoint and re-runs failed/missing records):

uv run python main.py --dataset-path data/opencoder_dataset_educational_instruct --output-dir output/educational_instruct --resume

Features

Dataset conversion for Traditional Chinese Code-QA
Multi-agent architecture using LangChain and LangGraph
Support for various LLM providers (OpenAI, Anthropic, Google)
Real-time processing with immediate save after each record
Automatic failure recovery and checkpoint system
Environment-specific configurations (development/production)

Development

Run tests:

uv run pytest

Format code:

uv run black src/
uv run isort src/

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
docs		docs
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tw-code-qa

Description

Installation

Usage

Processing Different Datasets

Other Usage Examples

Features

Development

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

ai-twinkle/tw-code-qa

Folders and files

Latest commit

History

Repository files navigation

tw-code-qa

Description

Installation

Usage

Processing Different Datasets

Other Usage Examples

Features

Development

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages