From b07aaf747f7a3c9a5701e77fb073031ad0119440 Mon Sep 17 00:00:00 2001 From: heejingithub Date: Tue, 26 Aug 2025 12:52:26 -0400 Subject: [PATCH 1/4] Add KR bilingual fine-tuning notebook --- ...l_clientServerArch_CORPUS_SYNCED_TOC.ipynb | 1198 +++++++++++++++++ 1 file changed, 1198 insertions(+) create mode 100644 gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb diff --git a/gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb b/gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb new file mode 100644 index 0000000000..b90e73cfbe --- /dev/null +++ b/gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb @@ -0,0 +1,1198 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "538f25ce", + "metadata": {}, + "source": [ + "# ๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡บ๐Ÿ‡ธ Fine-tune gpt-oss for better Korean language performance โ€” **Bilingual (KR ยท EN)**\n", + "August, 2025\n", + "\n", + "์ด ๋…ธํŠธ๋ถ์€ OpenAI์˜ **gpt-oss (openโ€‘weight)** ๋ชจ๋ธ์„ **ํ•œ๊ตญ ๋‰ด์Šค ๋ฌธ์ฒด + ์ตœ์‹  ๋Œ€ํ™”์ฒด**๋กœ ์„ธ๋ฐ€ ํŠœ๋‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์„\n", + "ํ•œ๊ตญ์–ด/์˜์–ด **์ด์ค‘ ์–ธ์–ด**๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. \n", + "This notebook shows how to fineโ€‘tune OpenAI's **gpt-oss (openโ€‘weight)** models for **Korean news style + modern chat tone**, in **Korean & English**.\n", + "\n", + "---\n", + "\n", + "### MXFP4 workflow clarifications ยท MXFP4 ์›Œํฌํ”Œ๋กœ ์ •๋ฆฌ\n", + "\n", + "**EN:** \n", + "- Training or fine-tuning **directly in MXFP4 is not supported** by public frameworks today. \n", + "- Recommended path: train in **BF16** (or **QLoRA 4โ€‘bit nf4**) โ†’ **merge LoRA** โ†’ **postโ€‘training quantize to MXFP4** โ†’ `save_pretrained()` for deployment. \n", + "- If you need an MXFP4 artifact, you must **reโ€‘quantize from BF16** after merging adapters. (Export utilities are evolving; if your toolchain already supports MXFP4 serialization, thatโ€™s ideal.)\n", + "\n", + "**KR:** \n", + "- ํ˜„์žฌ ๊ณต๊ฐœ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ๋Š” **MXFP4๋กœ ์ง์ ‘ ํ•™์Šต/ํŒŒ์ธํŠœ๋‹**์ด ์ง€์›๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. \n", + "- ๊ถŒ์žฅ ๊ฒฝ๋กœ: **BF16**(๋˜๋Š” **QLoRA 4โ€‘bit nf4**)๋กœ ํ•™์Šต โ†’ **LoRA ๋ณ‘ํ•ฉ** โ†’ **์‚ฌํ›„(MXFP4) ์–‘์žํ™”** โ†’ ๋ฐฐํฌ์šฉ์œผ๋กœ `save_pretrained()` ์ €์žฅ. \n", + "- MXFP4 ์•„ํ‹ฐํŒฉํŠธ๊ฐ€ ํ•„์š”ํ•˜๋ฉด, ์–ด๋Œ‘ํ„ฐ ๋ณ‘ํ•ฉ ํ›„ **BF16 โ†’ MXFP4 ์žฌ์–‘์žํ™”**๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. (์ง๋ ฌํ™” ์œ ํ‹ธ์€ ์ง„ํ™” ์ค‘์ด๋ฉฐ, ํˆด์ฒด์ธ์—์„œ MXFP4 ์ €์žฅ์„ ์ง€์›ํ•˜๋ฉด ๊ฐ€์žฅ ์ข‹์Šต๋‹ˆ๋‹ค.)\n", + "\n", + "---\n", + "\n", + "### LoRA targets (MoE) ยท LoRA ํƒ€๊นƒ(MoE ํฌํ•จ)\n", + "\n", + "**EN:** \n", + "- Minimal config (fast, low VRAM): target attention only, e.g. `[\"q_proj\",\"v_proj\"]`. \n", + "- MoEโ€‘aware config (better domain adaptation, more VRAM/time): include **expert projection layers** in addition to attention. \n", + "\n", + "```python\n", + "from peft import LoraConfig\n", + "\n", + "TARGET_MODULES = [\"q_proj\", \"v_proj\"] # baseline\n", + "MOE_TARGET_PARAMETERS = [\n", + " # example expert layers; adjust indices to your model depth\n", + " \"mlp.experts.gate_up_proj\",\n", + " \"mlp.experts.down_proj\",\n", + "]\n", + "\n", + "lora_cfg = LoraConfig(\n", + " r=16, lora_alpha=32, lora_dropout=0.05,\n", + " target_modules=\"all-linear\", # cover all linear layers\n", + " target_parameters=MOE_TARGET_PARAMETERS, # add expert projections\n", + " bias=\"none\", task_type=\"CAUSAL_LM\",\n", + ")\n", + "```\n", + "\n", + "- Start with attentionโ€‘only; if KR domain fit is insufficient, enable MoE targets and reโ€‘eval.\n", + "\n", + "**KR:** \n", + "- ์ตœ์†Œ ๊ตฌ์„ฑ(๋น ๋ฅด๊ณ  VRAM ์ ˆ์•ฝ): `[\"q_proj\",\"v_proj\"]` ๋“ฑ **์–ดํ…์…˜๋งŒ** ์ ์šฉ. \n", + "- **MoE ์ธ์ง€ ๊ตฌ์„ฑ**(๋„๋ฉ”์ธ ์ ํ•ฉ์„ฑโ†‘, ์ž์› ์†Œ๋ชจโ†‘): ์–ดํ…์…˜์— **์ „๋ฌธ๊ฐ€(Expert) ํˆฌ์˜ ๋ ˆ์ด์–ด**๋ฅผ ์ถ”๊ฐ€๋กœ ํฌํ•จ. \n", + "- ๋จผ์ € ์–ดํ…์…˜๋งŒ์œผ๋กœ ์‹œ๋„ํ•œ ๋’ค, ํ•œ๊ตญ์–ด ๋„๋ฉ”์ธ ์ ํ•ฉ์„ฑ์ด ๋ถ€์กฑํ•˜๋ฉด MoE ํƒ€๊นƒ์„ ์ผœ๊ณ  ์žฌํ‰๊ฐ€ํ•˜์„ธ์š”." + ] + }, + { + "cell_type": "markdown", + "id": "bd7c12ff", + "metadata": {}, + "source": [ + "## Contents ยท ๋ชฉ์ฐจ\n", + "0) Goals & Scope ยท ๋ชฉํ‘œ & ๋ฒ”์œ„ \n", + "1) Environment check ยท ํ™˜๊ฒฝ ์ ๊ฒ€ \n", + "2) ์„ค์ •๊ฐ’ ยท Config \n", + "3) ํŒจํ‚ค์ง€ ์„ค์น˜ ยท Install Deps \n", + "4) ๋ฐ์ดํ„ฐ ์†Œ์‹ฑ(ํ•œ๊ตญํ˜•) ยท KRโ€‘Context Data Sourcing \n", + "5) ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ยท Create Sample Data \n", + "6) ์ „์ฒ˜๋ฆฌ(PIPA) & ์Šคํƒ€์ผ ๋ผ๋ฒจ ยท PII Scrubbing & Style Tags \n", + "7) ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ/ํฌ๋งทํŒ… ยท Load & Format \n", + "8) ๋ชจ๋ธ/ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ ยท Load Model & Tokenizer \n", + "9) Fineโ€‘Tuning (LoRA/QLoRA) ยท ์„ธ๋ฐ€ ํŠœ๋‹ \n", + " 9a) Data curation & splits \n", + " 9b) Hyperparameters (r/alpha/dropout) \n", + " 9c) Merge adapters (BF16) \n", + " 9d) Save merged BF16 (`save_pretrained`) \n", + " 9e) Export & Quantize (BF16 โ†’ MXFP4) ยท ๋‚ด๋ณด๋‚ด๊ธฐ & ์–‘์žํ™” \n", + "10) ํ‰๊ฐ€(๋‰ด์Šค/๋Œ€ํ™”) ยท Evaluation (News/Chat) \n", + "11) Inference Prompt Templates ยท ์ถ”๋ก  ํ”„๋กฌํ”„ํŠธ ํ…œํ”Œ๋ฆฟ \n", + "12) ์ตœ์‹ ์„ฑ ์œ ์ง€ ยท Freshness Strategy \n", + "13) ์•ˆ์ „/์ปดํ”Œ๋ผ์ด์–ธ์Šค ยท Safety & Compliance \n", + "14) ๋ฌธ์ œํ•ด๊ฒฐ & ๋‹ค์Œ ๋‹จ๊ณ„ ยท Troubleshooting & Next Steps\n" + ] + }, + { + "cell_type": "markdown", + "id": "bb8655d2", + "metadata": {}, + "source": [ + "### โš™๏ธ Training vs Quantization โ€” Whatโ€™s supported\n", + "- **Do:** Train with BF16/FP16 or QLoRA; export merged weights.\n", + "- **Then:** Quantize to **MXFP4** for inference using provided conversion scripts/utilities.\n", + "- **Donโ€™t:** Attempt to run an endโ€‘toโ€‘end โ€œtrain in MXFP4โ€ pipeline โ€” not supported today." + ] + }, + { + "cell_type": "markdown", + "id": "bb24a3d9", + "metadata": {}, + "source": [ + "> **PII & Compliance Reminder:** For KR data, follow your enterprise policy (mask RRN/phone/account IDs, remove emails) **before** training & logging. Keep train/val/test splits stratified by source and style tags." + ] + }, + { + "cell_type": "markdown", + "id": "e1e883f5", + "metadata": {}, + "source": [ + "### ๐Ÿงช MoE adapters (optional)\n", + "You can target MoE layers with adapters, but treat this as **advanced/experimental**. Start with attention projections first and validate KR benchmarks before expanding scope." + ] + }, + { + "cell_type": "markdown", + "id": "179543e6", + "metadata": {}, + "source": [ + "> **Note:** Keep `transformers`, `peft`, `accelerate`, and `trl` at versions known to support BF16/4โ€‘bit LoRA. \n", + "If you pin `safetensors`, remember that **native MXFP4 serialization is not yet standardized**; loaders may upcast internally." + ] + }, + { + "cell_type": "markdown", + "id": "f8e743f0", + "metadata": {}, + "source": [ + "### ๐Ÿ”Ž Support Matrix โ€” At a glance\n", + "- **Fineโ€‘tuning precision:** BF16/FP16 โœ… ยท QLoRA 4โ€‘bit โœ… ยท **MXFP4 FT โŒ**\n", + "- **Quantization target:** MXFP4 โœ… (postโ€‘training)\n", + "- **API FT (hosted) for OSS models:** โŒ\n", + "- **Openโ€‘source FT (Transformers/TRL/PEFT):** โœ…\n", + "- **LoRA targets:** `q_proj`, `k_proj`, `v_proj`, `o_proj` โœ…; MoE expert adapters **experimental** โš ๏ธ" + ] + }, + { + "cell_type": "markdown", + "id": "f4dec1f6", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "id": "e3d489c2", + "metadata": {}, + "source": [ + "## 0) Goals & Scope ยท ๋ชฉํ‘œ & ๋ฒ”์œ„\n", + "- **KR**: ํ•œ๊ตญ์–ด ์ผ๋ฐ˜ ๋‰ด์Šค + ์ผ์ƒ/์ƒ๋‹ด ๋Œ€ํ™”์ฒด์— ์ตœ์ ํ™”. `style=news_headline|news_lead|news_body|kakao_casual|kakao_formal` ์ œ์–ด.\n", + "- **EN**: Optimize for Korean news writing and modern chat tone; control output via style tags above.\n", + "- **Stack**: `transformers`, `trl(SFTTrainer)`, `peft(LoRA/QLoRA)`, `datasets`.\n", + "- **Hardware**: Single/few GPUs (BF16 preferred). CPU/Mac for lightweight tests." + ] + }, + { + "cell_type": "markdown", + "id": "db97218d", + "metadata": {}, + "source": [ + "## 1) Environment check ยท ํ™˜๊ฒฝ ์ ๊ฒ€" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "5babb2c3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Python: 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0]\n", + "OS/Platform: Linux-6.8.0-60-generic-x86_64-with-glibc2.35\n", + "CUDA_VISIBLE_DEVICES: \n", + "Torch: 2.7.1+cu126 CUDA: True\n", + "GPU: NVIDIA H100 80GB HBM3\n" + ] + } + ], + "source": [ + "import os, sys, platform\n", + "print(\"Python:\", sys.version)\n", + "print(\"OS/Platform:\", platform.platform())\n", + "print(\"CUDA_VISIBLE_DEVICES:\", os.environ.get(\"CUDA_VISIBLE_DEVICES\", \"\"))\n", + "\n", + "try:\n", + " import torch\n", + " print(\"Torch:\", torch.__version__, \"CUDA:\", torch.cuda.is_available())\n", + " if torch.cuda.is_available():\n", + " print(\"GPU:\", torch.cuda.get_device_name(0))\n", + "except Exception as e:\n", + " print(\"Torch not installed or GPU not detected:\", e)" + ] + }, + { + "cell_type": "markdown", + "id": "25688688", + "metadata": {}, + "source": [ + "## 2) ์„ค์ •๊ฐ’ ยท Config" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "c15817f7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Config ready.\n" + ] + } + ], + "source": [ + "from pathlib import Path\n", + "import os\n", + "\n", + "# === Model & Training Params ===\n", + "BASE_URL = \"http://localhost:8000/v1\" # vLLM OpenAI-compatible endpoint\n", + "API_KEY = \"dummy-key\" # vLLM ignores; SDK requires a value\n", + "MODEL = \"openai/gpt-oss-120b\" # must match the model vLLM loaded\n", + "OUTPUT_DIR = \"ft-oss-kr-news-chat-bilingual\"\n", + "\n", + "# Data mix (news : chat)\n", + "MIX_NEWS = 0.6\n", + "MIX_CHAT = 0.4\n", + "\n", + "# LoRA\n", + "LORA_R = 8\n", + "LORA_ALPHA = 16\n", + "LORA_DROPOUT = 0.05\n", + "TARGET_MODULES = [\"q_proj\", \"v_proj\"] # adjust per model\n", + "\n", + "# Training\n", + "EPOCHS = 1\n", + "PER_DEVICE_BS = 2\n", + "GRAD_ACCUM = 8\n", + "LEARNING_RATE = 2e-4\n", + "BF16 = True\n", + "LOG_STEPS = 20\n", + "SAVE_STEPS = 200\n", + "SAVE_TOTAL_LIMIT = 2\n", + "\n", + "print(\"Config ready.\")" + ] + }, + { + "cell_type": "markdown", + "id": "85f258eb", + "metadata": {}, + "source": [ + "## 3) ํŒจํ‚ค์ง€ ์„ค์น˜ ยท Install Deps" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "b1b75968", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "transformers: 4.55.3\n", + "accelerate: 1.10.0\n", + "datasets: 4.0.0\n", + "peft: not installed\n", + "trl: 0.21.0\n", + "bitsandbytes: not installed\n", + "sentencepiece: 0.2.1\n", + "vllm: 0.10.1\n", + "llama_cpp: 0.3.16\n", + "pip: 25.2\n", + "Install cells are commented. Un-comment in your environment.\n" + ] + } + ], + "source": [ + "# %pip install --upgrade pip\n", + "# %pip install transformers accelerate datasets peft trl bitsandbytes sentencepiece\n", + "# (optional) serving/runtimes\n", + "# %pip install vllm\n", + "# %pip install llama-cpp-python\n", + "\n", + "import importlib, pip\n", + "\n", + "for dep in [\"transformers\",\"accelerate\",\"datasets\",\"peft\",\"trl\",\n", + " \"bitsandbytes\",\"sentencepiece\",\"vllm\",\"llama_cpp\"]:\n", + " try:\n", + " print(f\"{dep}: {importlib.import_module(dep).__version__}\")\n", + " except Exception:\n", + " print(f\"{dep}: not installed\")\n", + "\n", + "print(f\"pip: {pip.__version__}\")\n", + "\n", + "print(\"Install cells are commented. Un-comment in your environment.\")" + ] + }, + { + "cell_type": "markdown", + "id": "de8647fd", + "metadata": {}, + "source": [ + "## 4) ๋ฐ์ดํ„ฐ ์†Œ์‹ฑ(ํ•œ๊ตญํ˜•) ยท KRโ€‘Context Data Sourcing" + ] + }, + { + "cell_type": "markdown", + "id": "da22cbd6", + "metadata": {}, + "source": [ + "**KR** \n", + "- ๊ณต๊ฐœ ๋ฒค์น˜๋งˆํฌ(์ฃผ์ œ ๋ถ„๋ฅ˜/์š”์•ฝ/QA) + **ํ—ˆ์šฉ๋œ ๋‰ด์Šค API์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ(์ œ๋ชฉ/์š”์•ฝ/์„น์…˜)** ์ค‘์‹ฌ์œผ๋กœ ์Šคํƒ€์ผ ๋ณด์ •.\n", + "- ๊ธฐ์‚ฌ **์›๋ฌธ ๋Œ€๋Ÿ‰ ์žฌํ•™์Šต์€ ์ €์ž‘๊ถŒ/์•ฝ๊ด€ ์ด์Šˆ** โ†’ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐยท๊ณต๊ฐœ ์ฝ”ํผ์Šค ์œ„์ฃผ.\n", + "- ๋Œ€ํ™”์ฒด๋Š” ํ•ฉ๋ฒ• ๊ณต๊ฐœ ์ฝ”ํผ์Šค(๋ฐ˜๋ง/์กด๋Œ“๋ง/์ด๋ชจํ‹ฐ์ฝ˜/์ถ•์•ฝ์–ด ๋ผ๋ฒจ ํฌํ•จ) ์šฐ์„ .\n", + "- PIPA: ์ฃผ๋ฏผ๋ฒˆํ˜ธ/์—ฐ๋ฝ์ฒ˜/์ด๋ฉ”์ผ/๊ณ„์ขŒ ๋“ฑ ๊ฐœ์ธ์ •๋ณด๋Š” **ํ›ˆ๋ จ ์ „/๋กœ๊ทธ ์ „** ์Šคํฌ๋Ÿฌ๋น™.\n", + "\n", + "**EN** \n", + "- Prefer public KR benchmarks (topic classification / summarization / QA) and **allowed news API metadata** for style calibration.\n", + "- Avoid mass training on news full texts due to license/ToS constraints; use metadata + open corpora.\n", + "- For chat, use lawful open corpora with tone/emoji/informalโ€‘formal annotations.\n", + "- Scrub PII (phone, RRNs, emails, accounts) before training/logging." + ] + }, + { + "cell_type": "markdown", + "id": "9b918411", + "metadata": {}, + "source": [ + "## 5) ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ยท Create Sample Data" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "18db10a6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Created: data/news.jsonl, data/chat.jsonl\n" + ] + } + ], + "source": [ + "import json, pathlib\n", + "pathlib.Path(\"data\").mkdir(exist_ok=True)\n", + "\n", + "news_samples = [\n", + " {\"style\":\"news_lead\",\"topic\":\"๊ฒฝ์ œ\",\"title\":\"๋ฐ˜๋„์ฒด ์ˆ˜์ถœ ํ˜ธ์กฐโ€ฆ 7์›” ์ˆ˜์ถœ์•ก 20% ์ฆ๊ฐ€\",\"summary\":\"์ˆ˜์ถœ ๊ฐœ์„ ์„ธ๊ฐ€ ์ด์–ด์ง€๋ฉฐ ๊ฒฝ๊ธฐ ํšŒ๋ณต ๊ธฐ๋Œ€๊ฐ€ ์ปค์กŒ๋‹ค.\"},\n", + " {\"style\":\"news_headline\",\"topic\":\"์ •์น˜\",\"title\":\"๊ตญํšŒ, ๋ฐ์ดํ„ฐ ์‚ฐ์—… ์œก์„ฑ๋ฒ• ๋ณธํšŒ์˜ ํ†ต๊ณผ\",\"summary\":\"๋ฐ์ดํ„ฐ ํ™œ์šฉ ์ด‰์ง„๊ณผ ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ๋ฅผ ๊ฐ•ํ™”ํ•˜๋Š” ๋‚ด์šฉ.\"},\n", + " {\n", + " \"style\": \"news_lead\",\n", + " \"topic\": \"๊ฒฝ์ œ\",\n", + " \"title\": \"์นด์นด์˜คํŽ˜์ด ๋ณด์•ˆ ์ ๊ฒ€โ€ฆ ๊ณ ๊ฐ๋ฌธ์˜: help+vip@corp.co.kr\",\n", + " \"summary\": \"๊ณ ๊ฐ์„ผํ„ฐ 010-1234-5678๋กœ ๋ฌธ์˜ ํญ์ฃผ. ๊ณ„์ขŒ 110-123-456789 ๊ด€๋ จ ๊ฒฐ์ œ ์˜ค๋ฅ˜ ๋…ผ๋ž€.\"\n", + " },\n", + " {\n", + " \"style\": \"news_headline\",\n", + " \"topic\": \"์‚ฌํšŒ\",\n", + " \"title\": \"๊ฐœ์ธ์ •๋ณด ์œ ์ถœ ์˜ํ˜นโ€ฆ ์ฃผ๋ฏผ๋ฒˆํ˜ธ 901010-1234567 ์œ ํ†ต ์ฃผ์žฅ\",\n", + " \"summary\": \"์„œ์šธํŠน๋ณ„์‹œ ๊ฐ•๋‚จ๊ตฌ ํ…Œํ—ค๋ž€๋กœ 123์—์„œ ์ž๋ฃŒ ํ™•๋ณดโ€ฆ ๋‹ด๋‹น์ž john.doe+news@example.com\"\n", + " }\n", + "]\n", + "\n", + "chat_samples = [\n", + " {\"style\":\"kakao_casual\",\"dialog\":[\"์ฃผ๋ง์— ๋น„ ์˜จ๋Œ€?\",\"์‘ ์ผ์š”์ผ์— ๊ฝค ์˜จ๋‹ค๋”๋ผ โ˜”\",\"ํ— ์šฐ์‚ฐ ์ฑ™๊ฒจ์•ผ๊ฒ ๋‹ค\"]},\n", + " {\"style\":\"kakao_formal\",\"dialog\":[\"์•ˆ๋…•ํ•˜์„ธ์š”. ๋ฐฐ์†ก ์ผ์ • ํ™•์ธ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.\",\"๋‚ด์ผ ์ค‘ ๋„์ฐฉ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.\",\"์•ˆ๋‚ด ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.\"]},\n", + " {\n", + " \"style\": \"kakao_formal\",\n", + " \"dialog\": [\n", + " \"๋ฐฐ์†ก ํ™•์ธ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ์ฃผ๋ฌธ๋ฒˆํ˜ธ ORD-2025-0001 ์ž…๋‹ˆ๋‹ค.\",\n", + " \"์—ฐ๋ฝ์ฒ˜๋Š” 010-2222-3333 ์ž…๋‹ˆ๋‹ค. (์œ ๋‹ˆ์ฝ”๋“œ ํ•˜์ดํ”ˆ)\",\n", + " \"์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ๋Š” ์ œ๊ณตํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.\"\n", + " ]\n", + " }\n", + "]\n", + "\n", + "with open(\"data/news.jsonl\",\"w\",encoding=\"utf-8\") as f:\n", + " for ex in news_samples: f.write(json.dumps(ex, ensure_ascii=False)+\"\\n\")\n", + "with open(\"data/chat.jsonl\",\"w\",encoding=\"utf-8\") as f:\n", + " for ex in chat_samples: f.write(json.dumps(ex, ensure_ascii=False)+\"\\n\")\n", + "\n", + "print(\"Created: data/news.jsonl, data/chat.jsonl\")" + ] + }, + { + "cell_type": "markdown", + "id": "4f1eaa27", + "metadata": {}, + "source": [ + "## 6) ์ „์ฒ˜๋ฆฌ(PIPA) & ์Šคํƒ€์ผ ๋ผ๋ฒจ ยท PII Scrubbing & Style Tags" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "430c1b68", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "data/news.jsonl -> data/news_clean.jsonl | rows: 4, redacted_rows: 2, hits: {'[EMAIL]': 2, '[ACCOUNT]': 1, '[RRN]': 1, '[CITY]': 1}\n", + "data/chat.jsonl -> data/chat_clean.jsonl | rows: 3, redacted_rows: 1, hits: {'[PHONE]': 1}\n" + ] + } + ], + "source": [ + "# Step 6 โ€” PII scrubbing + style tags (no Harmony here)\n", + "import json, re, unicodedata\n", + "from pathlib import Path\n", + "\n", + "# --- Normalization helpers ---\n", + "HYPHENS = dict.fromkeys(map(ord, \"โ€-โ€’โ€“โ€”โ€•๏น˜๏นฃ๏ผ\"), ord(\"-\")) # map unicode hyphens โ†’ ASCII\n", + "def normalize(s: str) -> str:\n", + " if not isinstance(s, str): return s\n", + " s = unicodedata.normalize(\"NFKC\", s)\n", + " s = s.translate(HYPHENS)\n", + " return s\n", + "\n", + "# --- PII patterns (illustrative; tune for production) ---\n", + "RE_EMAIL = re.compile(r\"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}\")\n", + "# KR mobile numbers with spaces/hyphens: 010-1234-5678, 010 1234 5678, etc.\n", + "RE_PHONE = re.compile(r\"\\b01[016789][-\\s]?\\d{3,4}[-\\s]?\\d{4}\\b\")\n", + "# Korean RRN (์ฃผ๋ฏผ๋“ฑ๋ก๋ฒˆํ˜ธ) basic pattern\n", + "RE_RRN = re.compile(r\"\\b\\d{6}-\\d{7}\\b\")\n", + "# Bank-ish account numbers: strictly digits in groups (avoid codes with letters)\n", + "RE_ACCOUNT = re.compile(r\"\\b\\d{2,3}-\\d{2,4}-\\d{3,6}\\b\")\n", + "# Very simple postal address cue (city names) โ€“ conservative, just redact the token (optional)\n", + "RE_CITY = re.compile(r\"(์„œ์šธํŠน๋ณ„์‹œ|๋ถ€์‚ฐ๊ด‘์—ญ์‹œ|๋Œ€๊ตฌ๊ด‘์—ญ์‹œ|์ธ์ฒœ๊ด‘์—ญ์‹œ|๊ด‘์ฃผ๊ด‘์—ญ์‹œ|๋Œ€์ „๊ด‘์—ญ์‹œ|์šธ์‚ฐ๊ด‘์—ญ์‹œ|์„ธ์ข…ํŠน๋ณ„์ž์น˜์‹œ|๊ฒฝ๊ธฐ๋„|๊ฐ•์›๋„|์ถฉ์ฒญ๋ถ๋„|์ถฉ์ฒญ๋‚จ๋„|์ „๋ผ๋ถ๋„|์ „๋ผ๋‚จ๋„|๊ฒฝ์ƒ๋ถ๋„|๊ฒฝ์ƒ๋‚จ๋„|์ œ์ฃผํŠน๋ณ„์ž์น˜๋„)\")\n", + "\n", + "# Allowlist: things that look like PII but arenโ€™t (e.g., bill/order codes w/ letters)\n", + "def looks_like_code(s: str) -> bool:\n", + " return bool(re.search(r\"[A-Za-z]\", s)) # if letters present, treat as code, not account/phone\n", + "\n", + "# Order of application matters (longest/most specific first sometimes helps)\n", + "SCRUBBERS = [\n", + " (\"[RRN]\", RE_RRN),\n", + " (\"[EMAIL]\", RE_EMAIL),\n", + " (\"[PHONE]\", RE_PHONE),\n", + " (\"[ACCOUNT]\", RE_ACCOUNT),\n", + " (\"[CITY]\", RE_CITY), # optional; comment out if you don't want to redact city tokens\n", + "]\n", + "\n", + "def scrub_text(text: str) -> tuple[str, dict]:\n", + " \"\"\"Return (scrubbed_text, hits_dict). Avoid false positives with basic allowlisting.\"\"\"\n", + " if not isinstance(text, str) or not text:\n", + " return text, {}\n", + " orig = text\n", + " text = normalize(text)\n", + " hits = {}\n", + "\n", + " # Guard account-like and phone-like strings that contain letters (likely codes)\n", + " guarded = set()\n", + " for m in RE_ACCOUNT.finditer(text):\n", + " if looks_like_code(m.group(0)):\n", + " guarded.add(m.span())\n", + " for m in RE_PHONE.finditer(text):\n", + " if looks_like_code(m.group(0)):\n", + " guarded.add(m.span())\n", + "\n", + " # Apply scrubs\n", + " for label, pattern in SCRUBBERS:\n", + " out = []\n", + " last = 0\n", + " count = 0\n", + " for m in pattern.finditer(text):\n", + " span = m.span()\n", + " if pattern in (RE_ACCOUNT, RE_PHONE) and span in guarded:\n", + " continue\n", + " out.append(text[last:span[0]])\n", + " out.append(label)\n", + " last = span[1]\n", + " count += 1\n", + " out.append(text[last:])\n", + " text = \"\".join(out)\n", + " if count:\n", + " hits[label] = hits.get(label, 0) + count\n", + "\n", + " return text, hits if text != orig else {}\n", + "\n", + "def scrub_record(rec: dict, kind: str) -> tuple[dict, dict]:\n", + " \"\"\"Scrub fields in a news/chat record; return (new_rec, hits).\"\"\"\n", + " rec = dict(rec) # shallow copy\n", + " total_hits = {}\n", + "\n", + " def scrub_field(key):\n", + " val = rec.get(key)\n", + " new, hits = scrub_text(val) if isinstance(val, str) else (val, {})\n", + " rec[key] = new\n", + " for k, v in hits.items():\n", + " total_hits[k] = total_hits.get(k, 0) + v\n", + "\n", + " if kind == \"news\":\n", + " for key in (\"title\", \"summary\", \"topic\"):\n", + " scrub_field(key)\n", + " elif kind == \"chat\":\n", + " scrub_field(\"style\")\n", + " if isinstance(rec.get(\"dialog\"), list):\n", + " cleaned_dialog = []\n", + " for turn in rec[\"dialog\"]:\n", + " new, hits = scrub_text(turn) if isinstance(turn, str) else (turn, {})\n", + " cleaned_dialog.append(new)\n", + " for k, v in hits.items():\n", + " total_hits[k] = total_hits.get(k, 0) + v\n", + " rec[\"dialog\"] = cleaned_dialog\n", + "\n", + " return rec, total_hits\n", + "\n", + "# --- Style tagger (lightweight labels for later routing/metrics) ---\n", + "def build_style_tags(rec: dict, kind: str) -> list[str]:\n", + " tags = []\n", + " if kind == \"news\":\n", + " tags.append(\"domain:\" + (rec.get(\"topic\") or \"unknown\"))\n", + " tags.append(\"style:\" + (rec.get(\"style\") or \"news\"))\n", + " tags.append(\"tone:formal\")\n", + " tags.append(\"medium:news\")\n", + " elif kind == \"chat\":\n", + " style = (rec.get(\"style\") or \"\").lower()\n", + " tags.append(\"style:\" + (style or \"chat\"))\n", + " tags.append(\"tone:\" + (\"formal\" if \"formal\" in style else \"casual\"))\n", + " tags.append(\"medium:kakao\")\n", + " return [t.replace(\" \", \"_\") for t in tags]\n", + "\n", + "# --- Process files ---\n", + "def process_file(src: str, dst: str, kind: str):\n", + " total = 0\n", + " redacted = 0\n", + " counters = {}\n", + " with open(src, encoding=\"utf-8\") as fin, open(dst, \"w\", encoding=\"utf-8\") as fout:\n", + " for line in fin:\n", + " if not line.strip(): continue\n", + " rec = json.loads(line)\n", + " total += 1\n", + " cleaned, hits = scrub_record(rec, kind)\n", + " cleaned[\"style_tags\"] = build_style_tags(cleaned, kind)\n", + " cleaned[\"_pii_hits\"] = hits # keep for inspection; drop later if you want\n", + " if hits: redacted += 1\n", + " for k, v in hits.items():\n", + " counters[k] = counters.get(k, 0) + v\n", + " fout.write(json.dumps(cleaned, ensure_ascii=False) + \"\\n\")\n", + " print(f\"{src} -> {dst} | rows: {total}, redacted_rows: {redacted}, hits: {counters}\")\n", + "\n", + "process_file(\"data/news.jsonl\", \"data/news_clean.jsonl\", kind=\"news\")\n", + "process_file(\"data/chat.jsonl\", \"data/chat_clean.jsonl\", kind=\"chat\")" + ] + }, + { + "cell_type": "markdown", + "id": "6ac01dca", + "metadata": {}, + "source": [ + "## 7) ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ/ํฌ๋งทํŒ… ยท Load & Format" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "9cd825e3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Created: data/news_harmony.jsonl data/chat_harmony.jsonl\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "6f769d524f424ed5a11781a157cfa796", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Generating news split: 0 examples [00:00, ? examples/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "af2e4dc971884747a719d500caf52722", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Generating chat split: 0 examples [00:00, ? examples/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'train': 3, 'validation': 4}\n" + ] + } + ], + "source": [ + "# Step 7 โ€” Harmony conversion + dataset loading & tokenization\n", + "import json, math\n", + "from pathlib import Path\n", + "from datasets import load_dataset, Dataset, concatenate_datasets\n", + "from transformers import AutoTokenizer\n", + "\n", + "DATA = Path(\"data\")\n", + "assert (DATA / \"news_clean.jsonl\").exists(), \"Run Step 6 first\"\n", + "assert (DATA / \"chat_clean.jsonl\").exists(), \"Run Step 6 first\"\n", + "\n", + "# ---------- 7A) Convert cleaned โ†’ Harmony messages ----------\n", + "\n", + "def news_to_messages(rec):\n", + " # system style from Step 6 tags; default to KR news tone\n", + " system = \"ํ•œ๊ตญ ๋‰ด์Šค ๋ฌธ์ฒด๋กœ ๊ฐ„๊ฒฐํ•˜๊ณ  ์‚ฌ์‹ค ์œ„์ฃผ๋กœ ์ž‘์„ฑ.\"\n", + " # user asks for a headline+lead from topic; assistant is the expected formatted answer\n", + " user = f\"์ฃผ์ œ: {rec.get('topic','์•Œ์ˆ˜์—†์Œ')}. ๊ธฐ์‚ฌ ์ œ๋ชฉ๊ณผ ์š”์•ฝ์„ ์ƒ์„ฑํ•ด์ค˜.\"\n", + " assistant = f\"{rec.get('title','')} โ€” {rec.get('summary','')}\"\n", + " return [{\"role\":\"system\",\"content\":system},\n", + " {\"role\":\"user\",\"content\":user},\n", + " {\"role\":\"assistant\",\"content\":assistant}]\n", + "\n", + "def chat_to_messages(rec):\n", + " # Keep style hint (casual/formal) in system\n", + " style = (rec.get(\"style\") or \"\").lower()\n", + " system = f\"์นด์นด์˜คํ†ก ๋Œ€ํ™” ์Šคํƒ€์ผ. style={style or 'chat'}\"\n", + " dialog = rec.get(\"dialog\") or []\n", + " msgs = [{\"role\":\"system\",\"content\":system}]\n", + " # Alternate user/assistant turns; if odd length, last user stays without assistant label\n", + " roles = [\"user\",\"assistant\"]\n", + " for i, turn in enumerate(dialog[:6]): # cap tiny demos to avoid runaway\n", + " msgs.append({\"role\": roles[i % 2], \"content\": str(turn)})\n", + " # Ensure there is at least one assistant turn for SFT\n", + " if not any(m[\"role\"]==\"assistant\" for m in msgs):\n", + " msgs.append({\"role\":\"assistant\",\"content\":\"๋„ค, ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.\"})\n", + " return msgs\n", + "\n", + "def write_harmony(src, dst, kind):\n", + " convert = news_to_messages if kind==\"news\" else chat_to_messages\n", + " with open(src, encoding=\"utf-8\") as fin, open(dst, \"w\", encoding=\"utf-8\") as fout:\n", + " for line in fin:\n", + " if not line.strip(): continue\n", + " rec = json.loads(line)\n", + " msgs = convert(rec)\n", + " fout.write(json.dumps({\"messages\": msgs}, ensure_ascii=False) + \"\\n\")\n", + "\n", + "write_harmony(DATA/\"news_clean.jsonl\", DATA/\"news_harmony.jsonl\", \"news\")\n", + "write_harmony(DATA/\"chat_clean.jsonl\", DATA/\"chat_harmony.jsonl\", \"chat\")\n", + "print(\"Created:\", DATA/\"news_harmony.jsonl\", DATA/\"chat_harmony.jsonl\")\n", + "\n", + "# ---------- 7B) Load Harmony JSONL with ๐Ÿค— Datasets ----------\n", + "raw = load_dataset(\n", + " \"json\",\n", + " data_files={\"news\": str(DATA/\"news_harmony.jsonl\"),\n", + " \"chat\": str(DATA/\"chat_harmony.jsonl\")}\n", + ")\n", + "\n", + "# Mix train split using your Step-2 mix ratios\n", + "news = raw[\"news\"]\n", + "chat = raw[\"chat\"]\n", + "\n", + "def take_portion(ds, frac):\n", + " n = max(1, int(round(len(ds) * frac)))\n", + " return ds.select(range(n)) if n < len(ds) else ds\n", + "\n", + "news_part = take_portion(news, MIX_NEWS if 'MIX_NEWS' in globals() else 0.5)\n", + "chat_part = take_portion(chat, MIX_CHAT if 'MIX_CHAT' in globals() else 0.5)\n", + "train_ds = concatenate_datasets([news_part, chat_part]).shuffle(seed=42)\n", + "\n", + "# Tiny validation built from remaining examples (if any)\n", + "remaining_news = news.select(range(len(news_part), len(news))) if len(news) > len(news_part) else news_part\n", + "remaining_chat = chat.select(range(len(chat_part), len(chat))) if len(chat) > len(chat_part) else chat_part\n", + "val_candidates = concatenate_datasets([remaining_news, remaining_chat])\n", + "val_ds = val_candidates.shuffle(seed=43).select(range(min(64, len(val_candidates)))) if len(val_candidates) else train_ds.select(range(min(32, len(train_ds))))\n", + "\n", + "dataset = {\"train\": train_ds, \"validation\": val_ds}\n", + "print({k: len(v) for k, v in dataset.items()})\n" + ] + }, + { + "cell_type": "markdown", + "id": "c95c9122", + "metadata": {}, + "source": [ + "## 8) ๋ชจ๋ธ/ํ† ํฌ๋‚˜์ด์ € ๋กœ๋“œ ยท Load Model & Tokenizer" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "db67b6b3", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "1cfc411479e145e4b5b161df311d4b13", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "tokenizer_config.json: 0.00B [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ebea3ddd62e340cc83e2a484a04e3e89", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "tokenizer_config.json: 0.00B [00:00, ?B/s]" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "330fd60c5e1248998f0f5bc8c394b2ce", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "tokenizer.json: 0%| | 0.00/27.9M [00:00\n", + "{{ m['content'] }}<|end|>\n", + "{%- elif m['role'] == 'user' -%}<|user|>\n", + "{{ m['content'] }}<|end|>\n", + "{%- elif m['role'] == 'assistant' -%}<|assistant|>\n", + "{{ m['content'] }}<|end|>\n", + "{%- endif -%}\n", + "{%- endfor -%}\"\"\"\n", + "\n", + "# Ensure pad/eos are sane\n", + "tokenizer.pad_token = tokenizer.eos_token or tokenizer.pad_token\n", + "\n", + "# ---------- 7D) Tokenize with assistant-only labels ----------\n", + "ASST_TOKEN = None\n", + "END_TOKEN = None\n", + "try:\n", + " ASST_TOKEN = tokenizer.convert_tokens_to_ids(\"<|assistant|>\")\n", + " END_TOKEN = tokenizer.convert_tokens_to_ids(\"<|end|>\")\n", + "except Exception:\n", + " # If the base vocab lacks these tokens, it's okay; masking fallback below will still work heuristically\n", + " pass\n", + "\n", + "MAX_LEN = 2048 # you can raise this if you have room\n", + "\n", + "def tokenize_with_labels(example):\n", + " # 1) Render with chat template (includes assistant answer)\n", + " text = tokenizer.apply_chat_template(example[\"messages\"], tokenize=False, add_generation_prompt=False)\n", + " # 2) Tokenize\n", + " enc = tokenizer(text, truncation=True, max_length=MAX_LEN)\n", + " input_ids = enc[\"input_ids\"]\n", + " labels = [-100] * len(input_ids)\n", + "\n", + " # 3) Label only assistant content\n", + " if ASST_TOKEN is not None and END_TOKEN is not None:\n", + " start = None\n", + " for i, tid in enumerate(input_ids):\n", + " if tid == ASST_TOKEN:\n", + " start = i + 1 # learn after the tag\n", + " elif start is not None and tid == END_TOKEN:\n", + " start = None\n", + " elif start is not None:\n", + " labels[i] = input_ids[i]\n", + " else:\n", + " # Heuristic fallback: learn on the last third of tokens (crude but avoids total silence)\n", + " start = int(len(input_ids) * 0.66)\n", + " for i in range(start, len(input_ids)):\n", + " labels[i] = input_ids[i]\n", + "\n", + " return {\"input_ids\": input_ids, \"attention_mask\": enc[\"attention_mask\"], \"labels\": labels}\n", + "\n", + "tokenized_train = dataset[\"train\"].map(tokenize_with_labels, remove_columns=[\"messages\"])\n", + "tokenized_val = dataset[\"validation\"].map(tokenize_with_labels, remove_columns=[\"messages\"])\n", + "\n", + "print(\"Tokenization done.\",\n", + " \"train:\", len(tokenized_train),\n", + " \"val:\", len(tokenized_val),\n", + " \"example lens:\", tokenized_train[0][\"input_ids\"][:12], \"...\")" + ] + }, + { + "cell_type": "markdown", + "id": "f67dd4ef", + "metadata": {}, + "source": [ + "## 9) Fineโ€‘Tuning (LoRA/QLoRA) ยท ์„ธ๋ฐ€ ํŠœ๋‹\n", + "### 9a) Data curation & splits\n", + "_(See Section 7/8 for dataset prep; move relevant snippets here if needed.)_\n", + "### 9b) Hyperparameters (r/alpha/dropout)\n", + "```python\n", + "# Example LoRA hyperparameters\n", + "LORA_R = 8\n", + "LORA_ALPHA = 16\n", + "LORA_DROPOUT = 0.05\n", + "```\n", + "\n", + "### 9c) Merge adapters (BF16)\n", + "```python\n", + "# Example merge step (after training)\n", + "# model = PeftModel.from_pretrained(base_model, adapter_path)\n", + "# merged_model = model.merge_and_unload()\n", + "```\n", + "\n", + "### 9d) Save merged BF16 (`save_pretrained`)\n", + "```python\n", + "# merged_model.save_pretrained(OUTPUT_DIR)\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "c9157315", + "metadata": {}, + "source": [ + "### 9e) Export & Quantize (BF16 โ†’ MXFP4) ยท ๋‚ด๋ณด๋‚ด๊ธฐ & ์–‘์žํ™”\n", + "\n", + "**EN (neutral, framework-agnostic):** \n", + "Public libraries currently do **not** support training/fineโ€‘tuning *directly* in MXFP4. The common pipeline is:\n", + "1) **Train/SFT** in **BF16** (or **QLoRA 4โ€‘bit nf4**). \n", + "2) **Merge LoRA adapters** into the base model (BF16). \n", + "3) **Save** the merged BF16 checkpoint with `save_pretrained()`. \n", + "4) **Postโ€‘training quantize** the merged BF16 tensors to **MXFP4** using a **vendor/toolchainโ€‘provided packer**. \n", + "5) **Save/export** the MXFP4 artifact (same shape as Hugging Face `save_pretrained()` output) for deployment/serving.\n", + "\n", + "> Notes: \n", + "> - If your serving stack supports **LoRA at inference**, you may skip merging and quantization and ship: **base (MXFP4 or BF16) + LoRA adapters**. \n", + "> - If your runtime requires **merged MXFP4**, you must run a **BF16 โ†’ MXFP4** quantization step after merging adapters. \n", + "> - Keep **tokenizer/config** files aligned across BF16 and MXFP4 exports.\n", + "\n", + "**KR (์ค‘๋ฆฝ์ , ๋„๊ตฌ ๋น„์˜์กด):** \n", + "ํ˜„์žฌ ๊ณต๊ฐœ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” MXFP4์—์„œ **์ง์ ‘ ํ•™์Šต/ํŒŒ์ธํŠœ๋‹์„ ์ง€์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค**. ์ผ๋ฐ˜์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: \n", + "1) **BF16**(๋˜๋Š” **QLoRA 4โ€‘bit nf4**)๋กœ **ํ•™์Šต/ํŒŒ์ธํŠœ๋‹** \n", + "2) **LoRA ์–ด๋Œ‘ํ„ฐ ๋ณ‘ํ•ฉ**(BF16 ๊ธฐ์ค€) \n", + "3) `save_pretrained()`๋กœ **๋ณ‘ํ•ฉ๋œ BF16 ์ฒดํฌํฌ์ธํŠธ ์ €์žฅ** \n", + "4) ๋ฒค๋”/ํˆด์ฒด์ธ์—์„œ ์ œ๊ณตํ•˜๋Š” **์–‘์žํ™” ๋„๊ตฌ**๋กœ **BF16 โ†’ MXFP4 ์‚ฌํ›„ ์–‘์žํ™”** \n", + "5) ๋ฐฐํฌ/์„œ๋น™์šฉ **MXFP4 ์•„ํ‹ฐํŒฉํŠธ ์ €์žฅ/๋‚ด๋ณด๋‚ด๊ธฐ** (Hugging Face `save_pretrained()` ๊ตฌ์กฐ์™€ ๋™์ผ)\n", + "\n", + "> ์ฐธ๊ณ : \n", + "> - **์„œ๋น™์—์„œ LoRA๋ฅผ ์ง€์›**ํ•œ๋‹ค๋ฉด, ๋ณ‘ํ•ฉยท์–‘์žํ™”๋ฅผ ์ƒ๋žตํ•˜๊ณ  **๊ธฐ์ €( MXFP4 ๋˜๋Š” BF16 ) + LoRA ์–ด๋Œ‘ํ„ฐ**๋กœ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. \n", + "> - **๋ณ‘ํ•ฉ๋œ MXFP4**๊ฐ€ ํ•„์š”ํ•œ ๋Ÿฐํƒ€์ž„์˜ ๊ฒฝ์šฐ, ์–ด๋Œ‘ํ„ฐ ๋ณ‘ํ•ฉ ํ›„ **BF16 โ†’ MXFP4 ์žฌ์–‘์žํ™”** ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. \n", + "> - **tokenizer/config** ํŒŒ์ผ์€ BF16๊ณผ MXFP4 ์•„ํ‹ฐํŒฉํŠธ ๊ฐ„์— ์ผ๊ด€๋˜๊ฒŒ ์œ ์ง€ํ•˜์„ธ์š”.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "48a5cbc9", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Fineโ€‘tuning skeleton ready. Unโ€‘comment on your machine.\n" + ] + } + ], + "source": [ + "from trl import SFTTrainer, SFTConfig\n", + "from peft import LoraConfig, get_peft_model\n", + "\n", + "lora_cfg = LoraConfig(\n", + " task_type=\"CAUSAL_LM\",\n", + " r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT,\n", + " target_modules=TARGET_MODULES\n", + ")\n", + "\n", + "# base_model = get_peft_model(base_model, lora_cfg)\n", + "\n", + "sft_args = SFTConfig(\n", + " output_dir=OUTPUT_DIR,\n", + " num_train_epochs=EPOCHS,\n", + " per_device_train_batch_size=PER_DEVICE_BS,\n", + " gradient_accumulation_steps=GRAD_ACCUM,\n", + " learning_rate=LEARNING_RATE,\n", + " lr_scheduler_type=\"cosine\",\n", + " bf16=BF16,\n", + " logging_steps=LOG_STEPS,\n", + " save_steps=SAVE_STEPS,\n", + " save_total_limit=SAVE_TOTAL_LIMIT\n", + ")\n", + "\n", + "# trainer = SFTTrainer(model=base_model, args=sft_args, train_dataset=combined, tokenizer=tokenizer)\n", + "# trainer.train()\n", + "# trainer.save_model(OUTPUT_DIR)\n", + "print(\"Fineโ€‘tuning skeleton ready. Unโ€‘comment on your machine.\")" + ] + }, + { + "cell_type": "markdown", + "id": "490798f2", + "metadata": {}, + "source": [ + "## 10) ํ‰๊ฐ€(๋‰ด์Šค/๋Œ€ํ™”) ยท Evaluation (News/Chat)" + ] + }, + { + "cell_type": "markdown", + "id": "d1bdafe4", + "metadata": {}, + "source": [ + "**KR ์ง€ํ‘œ ยท KR Metrics** \n", + "- ๋‰ด์Šค์„ฑ: ์ฃผ์ œ ๋ถ„๋ฅ˜ ์ ํ•ฉ๋„(F1), ์š”์•ฝ ํ’ˆ์งˆ(ROUGEโ€‘1/2/L), ๋…ํ•ด QA(EM/F1). \n", + "- ๋Œ€ํ™”์„ฑ: ์ž์—ฐ์„ฑ/๋งฅ๋ฝ ์œ ์ง€, ๊ฒฝ์–ด/๋ฐ˜๋ง ์ „ํ™˜ ์ •ํ™•๋„, ์ด๋ชจํ‹ฐ์ฝ˜/์ถ•์•ฝ์–ด ์ ์ ˆ์„ฑ.\n", + "\n", + "**EN Notes** \n", + "- Use public KR benchmarks (e.g., topic classification, KorQuADโ€‘like QA) where licenses permit.\n", + "- Mix automatic metrics (F1/ROUGE) with human eval for tone & politeness." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "971b8dbd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Eval stubs ready.\n" + ] + } + ], + "source": [ + "# Example helpers (stub)\n", + "def simple_accuracy(preds, labels):\n", + " return sum(int(p==g) for p,g in zip(preds, labels)) / max(1, len(labels))\n", + "\n", + "# For ROUGE:\n", + "# import evaluate\n", + "# rouge = evaluate.load(\"rouge\")\n", + "# result = rouge.compute(predictions=pred_texts, references=ref_texts)\n", + "# print(result)\n", + "\n", + "print(\"Eval stubs ready.\")" + ] + }, + { + "cell_type": "markdown", + "id": "e0b5594e", + "metadata": {}, + "source": [ + "## 11) Inference Prompt Templates ยท ์ถ”๋ก  ํ”„๋กฌํ”„ํŠธ ํ…œํ”Œ๋ฆฟ" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "1f690452", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.\n", + "Knowledge cutoff: 2024-06\n", + "Current date: 2025-08-21\n", + "\n", + "Reasoning: medium\n", + "\n", + "# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions\n", + "\n", + "๋„ˆ๋Š” ํ•œ๊ตญ ๊ณ ๊ฐ์„ ๋•๋Š” ์œ ๋Šฅํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ๋‹ค.\n", + "\n", + "<|end|><|start|>user<|message|>๊ตญ๋‚ด PIPA ๊ทœ์ •์„ ์ค€์ˆ˜ํ•˜๋ฉด์„œ ์‚ฌ๋‚ด ๋ฌธ์„œ ์š”์•ฝ๊ธฐ๋ฅผ ๊ตฌ์„ฑํ•˜๋ ค๋ฉด ์–ด๋–ค ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์ข‹์„๊นŒ?<|end|><|start|>assistant\n" + ] + } + ], + "source": [ + "from openai_harmony import Message, ChatFormatter\n", + "\n", + "# Example prompt construction using Harmony\n", + "messages = [\n", + " Message(role=\"system\", content=\"๋„ˆ๋Š” ํ•œ๊ตญ ๊ณ ๊ฐ์„ ๋•๋Š” ์œ ๋Šฅํ•œ AI ์–ด์‹œ์Šคํ„ดํŠธ๋‹ค.\"),\n", + " Message(role=\"user\", content=\"๊ตญ๋‚ด PIPA ๊ทœ์ •์„ ์ค€์ˆ˜ํ•˜๋ฉด์„œ ์‚ฌ๋‚ด ๋ฌธ์„œ ์š”์•ฝ๊ธฐ๋ฅผ ๊ตฌ์„ฑํ•˜๋ ค๋ฉด ์–ด๋–ค ์•„ํ‚คํ…์ฒ˜๊ฐ€ ์ข‹์„๊นŒ?\")\n", + "]\n", + "\n", + "prompt = ChatFormatter.to_chat_prompt(messages)\n", + "print(prompt) # For preview; pass to tokenizer when running inference\n" + ] + }, + { + "cell_type": "markdown", + "id": "5216d049", + "metadata": {}, + "source": [ + "## 12) ์ตœ์‹ ์„ฑ ์œ ์ง€ ยท Freshness Strategy" + ] + }, + { + "cell_type": "markdown", + "id": "452decd1", + "metadata": {}, + "source": [ + "- **์ฃผ๊ฐ„ ๋ณด์ • SFT**: ํ—ˆ์šฉ๋œ ๋‰ด์Šค API **๋ฉ”ํƒ€๋ฐ์ดํ„ฐ(์ œ๋ชฉ/์š”์•ฝ/์„น์…˜)** ์ƒ˜ํ”Œ๋ง โ†’ ์Šคํƒ€์ผ ๋ณด์ •. \n", + "- **๋Œ€ํ™”์ฒด ์—…๋ฐ์ดํŠธ**: ์ตœ์‹  ์ถ•์•ฝ์–ด/์‹ ์กฐ์–ด/์ด๋ชจํ‹ฐ์ฝ˜ ์‚ฌ์ „ ๋ฐ˜์˜(์˜ˆ: ใ„ฑใ„ฑ, ใ…‡ใ…‹, ใ…‹ใ…‹, ใ„นใ…‡). \n", + "- **ํšŒ๊ท€ ํ‰๊ฐ€**: ๋™์ผ ์ง€ํ‘œ๋กœ before/after ๋น„๊ต โ†’ ํ˜ผํ•ฉ๋น„/์˜จ๋„/ํŒจ๋„ํ‹ฐ ํŠœ๋‹.\n", + "\n", + "- Weekly calibration SFT using **allowed news API metadata** for style; \n", + "- Update slang/emoji lexicons; \n", + "- Regression evals to track drift and adjust data mix/decoding." + ] + }, + { + "cell_type": "markdown", + "id": "718b9f2a", + "metadata": {}, + "source": [ + "## 13) ์•ˆ์ „/์ปดํ”Œ๋ผ์ด์–ธ์Šค ยท Safety & Compliance" + ] + }, + { + "cell_type": "markdown", + "id": "61ad24ef", + "metadata": {}, + "source": [ + "- ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜/๋ผ์ด์„ ์Šค ํ™•์ธ(๋ฒค์น˜๋งˆํฌ, API, ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ) ยท Verify dataset/API licenses.\n", + "- ๊ฐœ์ธ์ •๋ณด ์Šคํฌ๋Ÿฌ๋น™(ํ›ˆ๋ จ/๋กœ๊ทธ/ํ‰๊ฐ€ ์ „) ยท Scrub PII before training/logging/eval.\n", + "- ์ €์ž‘๊ถŒ/์•ฝ๊ด€ ์ค€์ˆ˜(๊ธฐ์‚ฌ **์›๋ฌธ ๋Œ€๋Ÿ‰ ์žฌํ•™์Šต ๊ธˆ์ง€**) ยท Avoid mass training on full news articles.\n", + "- ์ถœ๋ ฅ ๊ฒ€์ฆ(์Šคํ‚ค๋งˆ/๊ธˆ์น™์–ด/๋ฏผ๊ฐ๋„ ๊ทœ์น™) ยท Output validation & forbiddenโ€‘term filters.\n", + "- ๋ฒ„์ „/ํ‰๊ฐ€ ๋ฆฌํฌํŠธ ๊ด€๋ฆฌ ยท Version datasets/models and keep eval reports." + ] + }, + { + "cell_type": "markdown", + "id": "5cb8464b", + "metadata": {}, + "source": [ + "## 14) ๋ฌธ์ œํ•ด๊ฒฐ & ๋‹ค์Œ ๋‹จ๊ณ„ ยท Troubleshooting & Next Steps" + ] + }, + { + "cell_type": "markdown", + "id": "8ee17077", + "metadata": {}, + "source": [ + "- ํ˜ผํ•ฉ ๋น„์œจ ํŠœ๋‹: (๋‰ด์Šค:๋Œ€ํ™”) 6:4 โ†’ 7:3 ๋˜๋Š” 5:5๋กœ ์กฐ์ • \n", + "- LoRA ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ: r=8~16, ฮฑ=16~32, dropout=0.05~0.1 \n", + "- ์„œ๋น„์Šคํ™”: vLLM/llama.cpp ์„œ๋น™ + ํ† ํ”ฝ/์Šคํƒ€์ผ ๋ผ์šฐํŒ… \n", + "- RAG ๊ฒฐํ•ฉ: ์ตœ์‹  ์‚ฌ์‹ค์„ฑ ๋ณด๊ฐ•์„ ์œ„ํ•ด ๋‰ด์Šค/๋ฌธ์„œ ์ธ๋ฑ์Šค ๊ฒฐํ•ฉ \n", + "- A/B ํ…Œ์ŠคํŠธ: ํ†ค/๊ธธ์ด/์ด๋ชจํ‹ฐ์ฝ˜ ์‚ฌ์šฉ๋Ÿ‰ ๋“ฑ ์‚ฌ์šฉ์ž ๋งŒ์กฑ๋„ ์ธก์ •\n", + "\n", + "- Tune mix ratios, run A/B tests, consider vLLM serving, and pair with RAG for factuality." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 78013e5798ea70d8f89f5f176a399a3719db3aa0 Mon Sep 17 00:00:00 2001 From: heejingithub Date: Tue, 26 Aug 2025 13:08:33 -0400 Subject: [PATCH 2/4] Update and rename gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb to articles/gpt-oss/fine-tune-korean.ipynb --- .../gpt-oss/fine-tune-korean.ipynb | 2 -- 1 file changed, 2 deletions(-) rename gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb => articles/gpt-oss/fine-tune-korean.ipynb (99%) diff --git a/gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb b/articles/gpt-oss/fine-tune-korean.ipynb similarity index 99% rename from gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb rename to articles/gpt-oss/fine-tune-korean.ipynb index b90e73cfbe..14d485d56b 100644 --- a/gpt_oss_ft_kr_bilingual_clientServerArch_CORPUS_SYNCED_TOC.ipynb +++ b/articles/gpt-oss/fine-tune-korean.ipynb @@ -5,8 +5,6 @@ "id": "538f25ce", "metadata": {}, "source": [ - "# ๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡บ๐Ÿ‡ธ Fine-tune gpt-oss for better Korean language performance โ€” **Bilingual (KR ยท EN)**\n", - "August, 2025\n", "\n", "์ด ๋…ธํŠธ๋ถ์€ OpenAI์˜ **gpt-oss (openโ€‘weight)** ๋ชจ๋ธ์„ **ํ•œ๊ตญ ๋‰ด์Šค ๋ฌธ์ฒด + ์ตœ์‹  ๋Œ€ํ™”์ฒด**๋กœ ์„ธ๋ฐ€ ํŠœ๋‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์„\n", "ํ•œ๊ตญ์–ด/์˜์–ด **์ด์ค‘ ์–ธ์–ด**๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. \n", From 650b255e0c338e4a5f37793e03bb33a4802fb5bd Mon Sep 17 00:00:00 2001 From: heejingithub Date: Tue, 26 Aug 2025 13:19:14 -0400 Subject: [PATCH 3/4] Update registry.yaml --- registry.yaml | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/registry.yaml b/registry.yaml index aafb4a938c..6a98a538a7 100644 --- a/registry.yaml +++ b/registry.yaml @@ -4,6 +4,19 @@ # should build pages for, and indicates metadata such as tags, creation date and # authors for each page. +- title: "Fine-tune gpt-oss for better Korean language performance" + path: articles/gpt-oss/fine-tune-korean.ipynb + description: "Guide to fine-tuning an open-weight model for Korean and workflow tips." + authors: + - heejingithub + - danial-openai + - joanneshin-openai + tags: + - gpt-oss + - fine-tuning + - korean + - open-models + - title: Verifying gpt-oss implementations path: articles/gpt-oss/verifying-implementations.md date: 2025-08-11 From 0bd72e3339a82707c6f05fbf4dfad7873e26fffe Mon Sep 17 00:00:00 2001 From: heejingithub Date: Tue, 26 Aug 2025 13:34:51 -0400 Subject: [PATCH 4/4] Update registry.yaml moved invalid tags --- registry.yaml | 2 -- 1 file changed, 2 deletions(-) diff --git a/registry.yaml b/registry.yaml index 6a98a538a7..b462dbd069 100644 --- a/registry.yaml +++ b/registry.yaml @@ -13,8 +13,6 @@ - joanneshin-openai tags: - gpt-oss - - fine-tuning - - korean - open-models - title: Verifying gpt-oss implementations