Skip to content

6pm/mlx-finetune-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MLX Fine-tune Lab

A comprehensive toolkit for fine-tuning language models using MLX on Apple Silicon Macs. This project converts Next.js documentation from Markdown format into a structured JSONL dataset and trains LoRA adapters locally. You can adapt it for your own needs.

🎯 Project Purpose

This project is designed to:

  • Convert Next.js documentation from .md format into a structured JSONL dataset with question, reasoning, and answer fields
  • Run fine-tuning training on MLX-compatible models locally on Apple Silicon Macs
  • Training typically takes 5+ hours depending on your system's VRAM capacity
  • Note: This code only works on macOS as MLX is exclusively designed for Apple Silicon

πŸš€ Quick Start

  1. Install dependencies:

    pip install -r requirements.txt
  2. Configure environment:

    cp .env-example .env
    # Edit .env to set your preferred model

WANDB_API_KEY env variable is a key for https://wandb.ai service which allows to send and monitor all logs related to training process. It is optional, QLoRA training will work withot wandb key.

  1. Run training:

    python run_training.py
  2. Test your model:

    python run_inference.py

πŸ“‹ Step-by-Step Fine-tuning Process

Step 0: Setup

Install dependencies and configure your environment:

pip install -r requirements.txt
cp .env-example .env

Edit .env to configure the model you want to fine-tune.

Step 1: Prepare Documentation Data

Place your Next.js documentation files in the dataset/next-js-docs directory. Keep all files in a single folder without nested subdirectories for convenience.

Step 2: Create JSONL Files

Convert each .mdx file into JSONL format with question, reasoning, and answer fields. You have two approaches:

Option A: Simple & Cost-effective

  • Use a script to split text by headers
  • Use headers as question, code blocks as answer, and documentation text as reasoning
  • Pros: Fast and cheap
  • Cons: May not be very logical or readable

Option B: High-quality but Expensive

  • Use Cursor or another AI tool to create JSONL files
  • Add files to context and use the system prompt provided below (see "System Prompt for parsing md files")
  • AI will create a readable and logical dataset
  • Note: Repeat this process multiple times as all files won't fit in context

Result: You'll have a dataset/jsonl/ folder with many files corresponding to your .mdx files.

Step 3: Merge and Shuffle Data

Combine all JSONL files into one and shuffle to prevent overfitting:

python utils/merge_and_shuffle_jsonl.py dataset/jsonl merged_dataset.jsonl

Result: A new merged_dataset.jsonl file with all combined data.

Step 4: Format for MLX Training

Combine question, reasoning, and answer fields into a single text field as required by MLX:

python utils/preprocess_data.py dataset/next-js-dataset.jsonl output.jsonl

Result: An output.jsonl file with text field containing the combined fields.

Step 5: Split Training and Validation Data

  • Take 10% of rows from output.jsonl and save to temp/valid.jsonl (for validation during training)
  • Save the remaining 90% to temp/train.jsonl (for model training)

Step 6: Configure Training Parameters

Edit config.yaml to adjust training parameters based on your dataset size and training goals. Important: Check the maximum length of elements and update max_seq_length to match the longest line in your training JSONL file.

Step 7: Start Fine-tuning

Run the training process:

python run_training.py
  • Training duration: ~5 hours or more depending on your Mac's power and VRAM
  • Result: adapters/[DD-MM-YYYY]/adapters.safetensors - your trained adapter
  • Note: Adapters are model-specific (Mistral adapters won't work with Llama)

Step 8: Use Your Model

Connect your adapter in run_inference.py:

ADAPTER_PATH = "./adapters/31-08-2025"  # Update with your date

Then run:

python run_inference.py

πŸ› οΈ Utility Scripts

merge_and_shuffle_jsonl.py

Merges all .jsonl files in a folder into a single file and shuffles the data for random distribution before training.

Usage:

python utils/merge_and_shuffle_jsonl.py dataset/jsonl merged_dataset.jsonl
python utils/merge_and_shuffle_jsonl.py temp/ output/combined.jsonl
python utils/merge_and_shuffle_jsonl.py --help

preprocess_data.py

Converts structured data into the single text message format required for model training.

Output format:

{
  "text": "### Question:\n{question}\n\n### Reasoning:\n{reasoning}\n\n### Answer:\n{answer}"
}

Usage:

# Basic usage
python utils/preprocess_data.py input.jsonl output.jsonl

# Examples
python utils/preprocess_data.py dataset/next-js-dataset.jsonl temp/train_formatted.jsonl
python utils/preprocess_data.py dataset/jsonl/sample.jsonl processed/sample_formatted.jsonl

# Get help
python utils/preprocess_data.py --help

count_tokens.py

Analyzes .jsonl files by counting tokens in the "text" field. Generates detailed statistics including total, min, max, average, and median counts, plus distribution ranges.

Note: Requires each line to be a JSON object with a "text" field.

Example output:

==================================================
TOKEN ANALYSIS RESULTS
==================================================
Total lines in file: 930
Processed lines with 'text' field: 930
Minimum token count: 78
Maximum token count: 3,679
Average token count: 494.91
Median token count: 271.50
Total token count: 460,270

Additional statistics:
Standard deviation: 538.86

Token distribution by ranges:
0-1K: 808 lines (86.9%)
1K-5K: 122 lines (13.1%)
5K-10K: 0 lines (0.0%)
10K-50K: 0 lines (0.0%)
50K+: 0 lines (0.0%)

Usage:

python utils/count_tokens.py path/to/file.jsonl

# Examples
python utils/count_tokens.py temp/train.jsonl

# Get help
python utils/count_tokens.py --help

compare_files.py

Compares two directories and reports differences.

Usage:

# Basic usage
python utils/compare_files.py dir1 dir2

# Examples
python utils/compare_files.py dataset/jsonl dataset/next-js-docs
python utils/compare_files.py temp/ folder1/

# Save results to file
python utils/compare_files.py dataset/jsonl dataset/next-js-docs -o results.txt

# Get help
python utils/compare_files.py --help

System Prompt for AI-Assisted Dataset Creation

Use this prompt with Cursor or other AI tools to convert .mdx files to JSONL format:

Analyze all .mdx files in the dataset/next-js-docs directory. For each one, create a corresponding .jsonl file in dataset/jsonl.

Instructions:
- If a .jsonl output file already exists, skip it. Do not overwrite.
- For each new file, generate 3-5 training examples in JSONL format (one JSON object per line).
- Each JSON object must contain three keys: question, reasoning, and answer.
- All generated content must be in English.
- The answer field must be a code snippet.
- CRITICAL: Any content from a <PagesOnly> block is DEPRECATED.
    a. In the reasoning field, state that the Pages Router is outdated.
    b. In the answer code, add the comment: // DEPRECATED: Uses legacy Pages Router.
- Use Tailwind CSS for all styling examples.
- Do not add empty lines in the .jsonl files.

Perform this task directly. **Do not write a Python script**.

πŸ“ Project Structure

mlx-finetune-lab/
β”œβ”€β”€ adapters/                 # Trained LoRA adapters (organized by date)
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ next-js-docs/        # Source documentation files
β”‚   β”œβ”€β”€ jsonl/               # Converted JSONL files
β”‚   └── next-js-dataset.jsonl # Merged dataset
β”œβ”€β”€ temp/                    # Training/validation split files
β”œβ”€β”€ utils/                   # Utility scripts
β”œβ”€β”€ config.yaml             # LoRA training configuration
β”œβ”€β”€ run_training.py         # Training script
β”œβ”€β”€ run_inference.py        # Inference script
└── requirements.txt        # Python dependencies

βš™οΈ Configuration

Environment Variables

  • MODEL_NAME: Base model for fine-tuning (default: "mlx-community/Devstral-Small-2507-4bit")

Training Configuration

Edit config.yaml to adjust:

  • LoRA parameters (layers, rank, alpha)
  • Learning rate and batch size
  • Training iterations and validation frequency
  • Memory optimization settings

🚨 Important Notes

  • macOS Only: This project requires macOS with Apple Silicon
  • Memory Requirements: Training requires significant VRAM (8GB+ recommended)
  • Model Compatibility: Adapters are model-specific and cannot be transferred between different base models
  • Training Time: Expect 5+ hours for complete training depending on your hardware

🀝 Contributing

Feel free to submit issues and enhancement requests!

About

Fine-tuning MLX models on Apple Silicon.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published