MLX Fine-tune Lab

A comprehensive toolkit for fine-tuning language models using MLX on Apple Silicon Macs. This project converts Next.js documentation from Markdown format into a structured JSONL dataset and trains LoRA adapters locally. You can adapt it for your own needs.

🎯 Project Purpose

This project is designed to:

Convert Next.js documentation from .md format into a structured JSONL dataset with question, reasoning, and answer fields
Run fine-tuning training on MLX-compatible models locally on Apple Silicon Macs
Training typically takes 5+ hours depending on your system's VRAM capacity
Note: This code only works on macOS as MLX is exclusively designed for Apple Silicon

🚀 Quick Start

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment:

cp .env-example .env
# Edit .env to set your preferred model

WANDB_API_KEY env variable is a key for https://wandb.ai service which allows to send and monitor all logs related to training process. It is optional, QLoRA training will work withot wandb key.

Run training:
```
python run_training.py
```
Test your model:
```
python run_inference.py
```

📋 Step-by-Step Fine-tuning Process

Step 0: Setup

Install dependencies and configure your environment:

pip install -r requirements.txt
cp .env-example .env

Edit .env to configure the model you want to fine-tune.

Step 1: Prepare Documentation Data

Place your Next.js documentation files in the dataset/next-js-docs directory. Keep all files in a single folder without nested subdirectories for convenience.

Step 2: Create JSONL Files

Convert each .mdx file into JSONL format with question, reasoning, and answer fields. You have two approaches:

Option A: Simple & Cost-effective

Use a script to split text by headers
Use headers as question, code blocks as answer, and documentation text as reasoning
Pros: Fast and cheap
Cons: May not be very logical or readable

Option B: High-quality but Expensive

Use Cursor or another AI tool to create JSONL files
Add files to context and use the system prompt provided below (see "System Prompt for parsing md files")
AI will create a readable and logical dataset
Note: Repeat this process multiple times as all files won't fit in context

Result: You'll have a dataset/jsonl/ folder with many files corresponding to your .mdx files.

Step 3: Merge and Shuffle Data

Combine all JSONL files into one and shuffle to prevent overfitting:

python utils/merge_and_shuffle_jsonl.py dataset/jsonl merged_dataset.jsonl

Result: A new merged_dataset.jsonl file with all combined data.

Step 4: Format for MLX Training

Combine question, reasoning, and answer fields into a single text field as required by MLX:

python utils/preprocess_data.py dataset/next-js-dataset.jsonl output.jsonl

Result: An output.jsonl file with text field containing the combined fields.

Step 5: Split Training and Validation Data

Take 10% of rows from output.jsonl and save to temp/valid.jsonl (for validation during training)
Save the remaining 90% to temp/train.jsonl (for model training)

Step 6: Configure Training Parameters

Edit config.yaml to adjust training parameters based on your dataset size and training goals. Important: Check the maximum length of elements and update max_seq_length to match the longest line in your training JSONL file.

Step 7: Start Fine-tuning

Run the training process:

python run_training.py

Training duration: ~5 hours or more depending on your Mac's power and VRAM
Result: adapters/[DD-MM-YYYY]/adapters.safetensors - your trained adapter
Note: Adapters are model-specific (Mistral adapters won't work with Llama)

Step 8: Use Your Model

Connect your adapter in run_inference.py:

ADAPTER_PATH = "./adapters/31-08-2025"  # Update with your date

Then run:

python run_inference.py

🛠️ Utility Scripts

`merge_and_shuffle_jsonl.py`

Merges all .jsonl files in a folder into a single file and shuffles the data for random distribution before training.

Usage:

python utils/merge_and_shuffle_jsonl.py dataset/jsonl merged_dataset.jsonl
python utils/merge_and_shuffle_jsonl.py temp/ output/combined.jsonl
python utils/merge_and_shuffle_jsonl.py --help

`preprocess_data.py`

Converts structured data into the single text message format required for model training.

Output format:

{
  "text": "### Question:\n{question}\n\n### Reasoning:\n{reasoning}\n\n### Answer:\n{answer}"
}

Usage:

# Basic usage
python utils/preprocess_data.py input.jsonl output.jsonl

# Examples
python utils/preprocess_data.py dataset/next-js-dataset.jsonl temp/train_formatted.jsonl
python utils/preprocess_data.py dataset/jsonl/sample.jsonl processed/sample_formatted.jsonl

# Get help
python utils/preprocess_data.py --help

`count_tokens.py`

Analyzes .jsonl files by counting tokens in the "text" field. Generates detailed statistics including total, min, max, average, and median counts, plus distribution ranges.

Note: Requires each line to be a JSON object with a "text" field.

Example output:

==================================================
TOKEN ANALYSIS RESULTS
==================================================
Total lines in file: 930
Processed lines with 'text' field: 930
Minimum token count: 78
Maximum token count: 3,679
Average token count: 494.91
Median token count: 271.50
Total token count: 460,270

Additional statistics:
Standard deviation: 538.86

Token distribution by ranges:
0-1K: 808 lines (86.9%)
1K-5K: 122 lines (13.1%)
5K-10K: 0 lines (0.0%)
10K-50K: 0 lines (0.0%)
50K+: 0 lines (0.0%)

Usage:

python utils/count_tokens.py path/to/file.jsonl

# Examples
python utils/count_tokens.py temp/train.jsonl

# Get help
python utils/count_tokens.py --help

`compare_files.py`

Compares two directories and reports differences.

Usage:

# Basic usage
python utils/compare_files.py dir1 dir2

# Examples
python utils/compare_files.py dataset/jsonl dataset/next-js-docs
python utils/compare_files.py temp/ folder1/

# Save results to file
python utils/compare_files.py dataset/jsonl dataset/next-js-docs -o results.txt

# Get help
python utils/compare_files.py --help

System Prompt for AI-Assisted Dataset Creation

Use this prompt with Cursor or other AI tools to convert .mdx files to JSONL format:

Analyze all .mdx files in the dataset/next-js-docs directory. For each one, create a corresponding .jsonl file in dataset/jsonl.

Instructions:
- If a .jsonl output file already exists, skip it. Do not overwrite.
- For each new file, generate 3-5 training examples in JSONL format (one JSON object per line).
- Each JSON object must contain three keys: question, reasoning, and answer.
- All generated content must be in English.
- The answer field must be a code snippet.
- CRITICAL: Any content from a <PagesOnly> block is DEPRECATED.
    a. In the reasoning field, state that the Pages Router is outdated.
    b. In the answer code, add the comment: // DEPRECATED: Uses legacy Pages Router.
- Use Tailwind CSS for all styling examples.
- Do not add empty lines in the .jsonl files.

Perform this task directly. **Do not write a Python script**.

📁 Project Structure

mlx-finetune-lab/
├── adapters/                 # Trained LoRA adapters (organized by date)
├── dataset/
│   ├── next-js-docs/        # Source documentation files
│   ├── jsonl/               # Converted JSONL files
│   └── next-js-dataset.jsonl # Merged dataset
├── temp/                    # Training/validation split files
├── utils/                   # Utility scripts
├── config.yaml             # LoRA training configuration
├── run_training.py         # Training script
├── run_inference.py        # Inference script
└── requirements.txt        # Python dependencies

⚙️ Configuration

Environment Variables

MODEL_NAME: Base model for fine-tuning (default: "mlx-community/Devstral-Small-2507-4bit")

Training Configuration

Edit config.yaml to adjust:

LoRA parameters (layers, rank, alpha)
Learning rate and batch size
Training iterations and validation frequency
Memory optimization settings

🚨 Important Notes

macOS Only: This project requires macOS with Apple Silicon
Memory Requirements: Training requires significant VRAM (8GB+ recommended)
Model Compatibility: Adapters are model-specific and cannot be transferred between different base models
Training Time: Expect 5+ hours for complete training depending on your hardware

🤝 Contributing

Feel free to submit issues and enhancement requests!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLX Fine-tune Lab

🎯 Project Purpose

🚀 Quick Start

📋 Step-by-Step Fine-tuning Process

Step 0: Setup

Step 1: Prepare Documentation Data

Step 2: Create JSONL Files

Step 3: Merge and Shuffle Data

Step 4: Format for MLX Training

Step 5: Split Training and Validation Data

Step 6: Configure Training Parameters

Step 7: Start Fine-tuning

Step 8: Use Your Model

🛠️ Utility Scripts

`merge_and_shuffle_jsonl.py`

`preprocess_data.py`

`count_tokens.py`

`compare_files.py`

System Prompt for AI-Assisted Dataset Creation

📁 Project Structure

⚙️ Configuration

Environment Variables

Training Configuration

🚨 Important Notes

🤝 Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.cursor/rules		.cursor/rules
adapters		adapters
dataset		dataset
temp		temp
utils		utils
.env-example		.env-example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
output.jsonl		output.jsonl
requirements.txt		requirements.txt
run_inference.py		run_inference.py
run_training.py		run_training.py

License

6pm/mlx-finetune-lab

Folders and files

Latest commit

History

Repository files navigation

MLX Fine-tune Lab

🎯 Project Purpose

🚀 Quick Start

📋 Step-by-Step Fine-tuning Process

Step 0: Setup

Step 1: Prepare Documentation Data

Step 2: Create JSONL Files

Step 3: Merge and Shuffle Data

Step 4: Format for MLX Training

Step 5: Split Training and Validation Data

Step 6: Configure Training Parameters

Step 7: Start Fine-tuning

Step 8: Use Your Model

🛠️ Utility Scripts

merge_and_shuffle_jsonl.py

preprocess_data.py

count_tokens.py

compare_files.py

System Prompt for AI-Assisted Dataset Creation

📁 Project Structure

⚙️ Configuration

Environment Variables

Training Configuration

🚨 Important Notes

🤝 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`merge_and_shuffle_jsonl.py`

`preprocess_data.py`

`count_tokens.py`

`compare_files.py`

Packages