A comprehensive toolkit for fine-tuning language models using MLX on Apple Silicon Macs. This project converts Next.js documentation from Markdown format into a structured JSONL dataset and trains LoRA adapters locally. You can adapt it for your own needs.
This project is designed to:
- Convert Next.js documentation from
.md
format into a structured JSONL dataset withquestion
,reasoning
, andanswer
fields - Run fine-tuning training on MLX-compatible models locally on Apple Silicon Macs
- Training typically takes 5+ hours depending on your system's VRAM capacity
- Note: This code only works on macOS as MLX is exclusively designed for Apple Silicon
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment:
cp .env-example .env # Edit .env to set your preferred model
WANDB_API_KEY
env variable is a key for https://wandb.ai service which allows to send and monitor all logs related to training process. It is optional, QLoRA training will work withot wandb key.
-
Run training:
python run_training.py
-
Test your model:
python run_inference.py
Install dependencies and configure your environment:
pip install -r requirements.txt
cp .env-example .env
Edit .env
to configure the model you want to fine-tune.
Place your Next.js documentation files in the dataset/next-js-docs
directory. Keep all files in a single folder without nested subdirectories for convenience.
Convert each .mdx
file into JSONL format with question
, reasoning
, and answer
fields. You have two approaches:
Option A: Simple & Cost-effective
- Use a script to split text by headers
- Use headers as
question
, code blocks asanswer
, and documentation text asreasoning
- Pros: Fast and cheap
- Cons: May not be very logical or readable
Option B: High-quality but Expensive
- Use Cursor or another AI tool to create JSONL files
- Add files to context and use the system prompt provided below (see "System Prompt for parsing md files")
- AI will create a readable and logical dataset
- Note: Repeat this process multiple times as all files won't fit in context
Result: You'll have a dataset/jsonl/
folder with many files corresponding to your .mdx
files.
Combine all JSONL files into one and shuffle to prevent overfitting:
python utils/merge_and_shuffle_jsonl.py dataset/jsonl merged_dataset.jsonl
Result: A new merged_dataset.jsonl
file with all combined data.
Combine question
, reasoning
, and answer
fields into a single text
field as required by MLX:
python utils/preprocess_data.py dataset/next-js-dataset.jsonl output.jsonl
Result: An output.jsonl
file with text
field containing the combined fields.
- Take 10% of rows from
output.jsonl
and save totemp/valid.jsonl
(for validation during training) - Save the remaining 90% to
temp/train.jsonl
(for model training)
Edit config.yaml
to adjust training parameters based on your dataset size and training goals. Important: Check the maximum length of elements and update max_seq_length
to match the longest line in your training JSONL file.
Run the training process:
python run_training.py
- Training duration: ~5 hours or more depending on your Mac's power and VRAM
- Result:
adapters/[DD-MM-YYYY]/adapters.safetensors
- your trained adapter - Note: Adapters are model-specific (Mistral adapters won't work with Llama)
Connect your adapter in run_inference.py
:
ADAPTER_PATH = "./adapters/31-08-2025" # Update with your date
Then run:
python run_inference.py
Merges all .jsonl
files in a folder into a single file and shuffles the data for random distribution before training.
Usage:
python utils/merge_and_shuffle_jsonl.py dataset/jsonl merged_dataset.jsonl
python utils/merge_and_shuffle_jsonl.py temp/ output/combined.jsonl
python utils/merge_and_shuffle_jsonl.py --help
Converts structured data into the single text message format required for model training.
Output format:
{
"text": "### Question:\n{question}\n\n### Reasoning:\n{reasoning}\n\n### Answer:\n{answer}"
}
Usage:
# Basic usage
python utils/preprocess_data.py input.jsonl output.jsonl
# Examples
python utils/preprocess_data.py dataset/next-js-dataset.jsonl temp/train_formatted.jsonl
python utils/preprocess_data.py dataset/jsonl/sample.jsonl processed/sample_formatted.jsonl
# Get help
python utils/preprocess_data.py --help
Analyzes .jsonl
files by counting tokens in the "text" field. Generates detailed statistics including total, min, max, average, and median counts, plus distribution ranges.
Note: Requires each line to be a JSON object with a "text" field.
Example output:
==================================================
TOKEN ANALYSIS RESULTS
==================================================
Total lines in file: 930
Processed lines with 'text' field: 930
Minimum token count: 78
Maximum token count: 3,679
Average token count: 494.91
Median token count: 271.50
Total token count: 460,270
Additional statistics:
Standard deviation: 538.86
Token distribution by ranges:
0-1K: 808 lines (86.9%)
1K-5K: 122 lines (13.1%)
5K-10K: 0 lines (0.0%)
10K-50K: 0 lines (0.0%)
50K+: 0 lines (0.0%)
Usage:
python utils/count_tokens.py path/to/file.jsonl
# Examples
python utils/count_tokens.py temp/train.jsonl
# Get help
python utils/count_tokens.py --help
Compares two directories and reports differences.
Usage:
# Basic usage
python utils/compare_files.py dir1 dir2
# Examples
python utils/compare_files.py dataset/jsonl dataset/next-js-docs
python utils/compare_files.py temp/ folder1/
# Save results to file
python utils/compare_files.py dataset/jsonl dataset/next-js-docs -o results.txt
# Get help
python utils/compare_files.py --help
Use this prompt with Cursor or other AI tools to convert .mdx
files to JSONL format:
Analyze all .mdx files in the dataset/next-js-docs directory. For each one, create a corresponding .jsonl file in dataset/jsonl.
Instructions:
- If a .jsonl output file already exists, skip it. Do not overwrite.
- For each new file, generate 3-5 training examples in JSONL format (one JSON object per line).
- Each JSON object must contain three keys: question, reasoning, and answer.
- All generated content must be in English.
- The answer field must be a code snippet.
- CRITICAL: Any content from a <PagesOnly> block is DEPRECATED.
a. In the reasoning field, state that the Pages Router is outdated.
b. In the answer code, add the comment: // DEPRECATED: Uses legacy Pages Router.
- Use Tailwind CSS for all styling examples.
- Do not add empty lines in the .jsonl files.
Perform this task directly. **Do not write a Python script**.
mlx-finetune-lab/
βββ adapters/ # Trained LoRA adapters (organized by date)
βββ dataset/
β βββ next-js-docs/ # Source documentation files
β βββ jsonl/ # Converted JSONL files
β βββ next-js-dataset.jsonl # Merged dataset
βββ temp/ # Training/validation split files
βββ utils/ # Utility scripts
βββ config.yaml # LoRA training configuration
βββ run_training.py # Training script
βββ run_inference.py # Inference script
βββ requirements.txt # Python dependencies
MODEL_NAME
: Base model for fine-tuning (default: "mlx-community/Devstral-Small-2507-4bit")
Edit config.yaml
to adjust:
- LoRA parameters (layers, rank, alpha)
- Learning rate and batch size
- Training iterations and validation frequency
- Memory optimization settings
- macOS Only: This project requires macOS with Apple Silicon
- Memory Requirements: Training requires significant VRAM (8GB+ recommended)
- Model Compatibility: Adapters are model-specific and cannot be transferred between different base models
- Training Time: Expect 5+ hours for complete training depending on your hardware
Feel free to submit issues and enhancement requests!