codefuse-ai · luanzhiwow · Nov 25, 2025
diff --git a/F2LLM/MULTI_MODEL_GUIDE.md b/F2LLM/MULTI_MODEL_GUIDE.md
@@ -0,0 +1,199 @@
+# F2LLM 多模型支持使用指南
+
+## 概述
+
+修改后的F2LLM现在支持多种decoder-only模型，包括Qwen、LLaMA、Baichuan、ChatGLM等系列模型。
+
+## 支持的模型
+
+### 已测试模型
+- **Qwen系列**: Qwen-7B, Qwen-14B, Qwen3-4B等
+- **LLaMA系列**: LLaMA-7B, LLaMA2-13B等  
+- **Baichuan系列**: Baichuan-13B, Baichuan2-13B等
+- **ChatGLM系列**: ChatGLM-6B, ChatGLM2-6B等
+
+### 理论支持的模型
+任何基于transformers库的decoder-only模型都应该可以工作，包括：
+- GPT系列
+- CodeT5+
+- CodeGen
+- StarCoder
+- 以及其他自定义decoder-only模型
+
+## 使用方法
+
+### 1. 模型配置
+
+修改配置文件 `configs/config.json`：
+
+```json
+{
+  "model_path": "your-model-path",
+  "model_type": "auto",  // 可选: auto, qwen, llama, baichuan等
+  "attn_implementation": "flash_attention_2", // flash_attention_2, sdpa, null
+  "use_flash_attention": true,
+  // ... 其他配置
+}
+```
+
+#### 配置说明
+
+- **model_path**: 模型路径或HuggingFace模型名称
+- **model_type**: 模型类型，用于自动适配特殊处理
+- **attn_implementation**: 注意力实现方式
+  - `"flash_attention_2"`: 使用Flash Attention 2（最快，但需要支持）
+  - `"sdpa"`: 使用PyTorch的Scaled Dot Product Attention
+  - `null`: 不使用特殊注意力实现
+- **use_flash_attention**: 是否尝试使用flash attention
+
+### 2.获取训练数据
+#### 方案1：使用huggingface-cli
+
+如果您想使用原始的huggingface-cli命令：
+
+```bash
+# 安装huggingface-hub
+pip install huggingface-hub
+
+# 从huggingface中下载训练数据，若遇网络问题，可以考虑使用镜像
+export HF_ENDPOINT=https://hf-mirror.com
+python -m huggingface_hub.cli download codefuse-ai/F2LLM --repo-type dataset --local-dir training_data --include "*.parquet"
+```
+
+#### 方案2：手动下载
+
+1. 访问网站：https://huggingface.co/datasets/codefuse-ai/F2LLM
+2. 手动下载.parquet文件
+3. 保存到 `training_data/` 目录
+
+### 3. 数据预处理
+
+使用通用分词脚本处理数据：
+
+```bash
+# 基础用法
+python tokenize_data.py --model_path "meta-llama/Llama-2-7b-hf" --max_seq_length 1023
+
+# 完整参数
+python tokenize_data.py \
+    --model_path "baichuan-inc/Baichuan2-13B-Base" \
+    --max_seq_length 1023 \
+    --data_dir "training_data" \
+    --output_dir "data_tokenized" \
+    --num_processes 16
+```
+
+### 4. 训练
+
+```bash
+# 单GPU训练
+accelerate launch --config_file configs/accelerate_config.yaml run.py --config configs/config.json
+
+# 多GPU训练
+accelerate launch --config_file configs/accelerate_config.yaml --num_processes 8 run.py --config configs/config.json
+```
+
+## 模型特定配置
+
+### LLaMA模型
+```json
+{
+  "model_path": "meta-llama/Llama-2-7b-hf",
+  "model_type": "llama",
+  "attn_implementation": "sdpa",
+  "use_flash_attention": true,
+  "max_seq_length": 2048
+}
+```
+
+### Baichuan模型
+```json
+{
+  "model_path": "baichuan-inc/Baichuan2-13B-Base",
+  "model_type": "baichuan", 
+  "attn_implementation": "flash_attention_2",
+  "use_flash_attention": true,
+  "max_seq_length": 2048
+}
+```
+
+### ChatGLM模型
+```json
+{
+  "model_path": "THUDM/chatglm3-6b-base",
+  "model_type": "chatglm",
+  "attn_implementation": null,
+  "use_flash_attention": false,
+  "max_seq_length": 2048
+}
+```
+
+## 故障排除
+
+### 常见问题
+
+1. **Flash Attention不支持**
+   - 错误信息: `FlashAttention only supports Ampere GPUs or newer.`
+   - 解决: 设置 `"use_flash_attention": false` 或 `"attn_implementation": "sdpa"`
+
+2. **内存不足**
+   - 减小 `train_batch_size`
+   - 减小 `max_seq_length`
+   - 使用梯度累积
+
+3. **模型加载失败**
+   - 确保模型路径正确
+   - 检查网络连接（如果是HF模型）
+   - 查看具体的错误信息，调整注意力配置
+
+### 调试建议
+
+1. **逐步测试**
+   ```bash
+   # 先测试模型加载
+   python -c "from transformers import AutoModel; model = AutoModel.from_pretrained('your-model')"
+
+   # 再测试分词
+   python tokenize_data.py --model_path "your-model" --num_processes 1
+   ```
+
+2. **查看日志**
+   - 修改后的代码会输出详细的加载信息
+   - 关注警告信息，它们通常包含有用的回退信息
+
+3. **性能优化**
+   - 优先使用Flash Attention 2（如果硬件支持）
+   - 使用SDPA作为第二选择
+   - 禁用特殊注意力实现作为最后手段
+
+## 性能对比
+
+| 模型 | 注意力实现 | 训练速度 | 内存使用 | 兼容性 |
+|------|------------|----------|----------|---------|
+| Qwen3-4B | flash_attention_2 | ★★★★★ | ★★★★★ | ★★★★☆ |
+| LLaMA2-7B | sdpa | ★★★★☆ | ★★★★☆ | ★★★★★ |
+| Baichuan2-13B | flash_attention_2 | ★★★★★ | ★★★★☆ | ★★★☆☆ |
+| ChatGLM3-6B | default | ★★★☆☆ | ★★★☆☆ | ★★★★★ |
+
+## 扩展支持
+
+如果需要支持新的模型类型，可以：
+
+1. 在 `model.py` 中添加模型特定的处理逻辑
+2. 在配置文件中添加相应的模型类型标识
+3. 测试并验证兼容性
+
+## 注意事项
+
+1. **模型许可**: 确保你有权使用指定的模型
+2. **硬件要求**: 大型模型需要更多GPU内存
+3. **数据格式**: 确保训练数据格式与模型要求一致
+4. **分词器兼容性**: 不同模型可能使用不同的分词器
+
+## 技术支持
+
+如遇到问题，请提供以下信息：
+- 模型名称和版本
+- 完整的错误日志
+- 硬件配置（GPU型号、内存等）
+- 配置文件内容
diff --git a/F2LLM/arguments.py b/F2LLM/arguments.py
@@ -27,6 +27,10 @@ class Args:
     log_interval: int = 20
     checkpointing_steps: int = 100
     validation_steps: int = 100
+    # model configuration
+    model_type: str = "auto"  # auto, qwen, llama, baichuan, etc.
+    attn_implementation: str = "flash_attention_2"  # flash_attention_2, sdpa, None
+    use_flash_attention: bool = True
     # just placeholder, for logging purpose
     num_processes: int=0
 

diff --git a/F2LLM/configs/config.json b/F2LLM/configs/config.json
@@ -1,5 +1,6 @@
 {
   "model_path": "models/qwen3-4b",
+  "model_type": "qwen", 
   "experiment_id": "4b+lr.8e-6+bs.16x32+context.1024+2epochs",
   "train_data_path": "training_data/data_tokenized_qwen",
   "output_dir": "output",
@@ -15,5 +16,7 @@
   "warmup_steps": 500,
   "train_epochs": 2,
   "log_interval": 100,
-  "num_hard_neg": 7
+  "num_hard_neg": 7,
+  "attn_implementation": "flash_attention_2",
+  "use_flash_attention": true
 }
diff --git a/F2LLM/configs/config_gpt_demo.json b/F2LLM/configs/config_gpt_demo.json
@@ -0,0 +1,22 @@
+{
+  "model_path": "microsoft/DialoGPT-medium",
+  "model_type": "gpt2",
+  "experiment_id": "gpt-final-fix",
+  "train_data_path": "data_tokenized/data_tokenized_DialoGPT-medium",
+  "output_dir": "output",
+  "tb_dir": "output/tb",
+  "cache_dir": "cache",
+  "train_batch_size": 1,
+  "checkpointing_steps": 10,
+  "validation_steps": 10,
+  "max_seq_length": 128,
+  "learning_rate": 1e-4,
+  "min_lr": 1e-6,
+  "weight_decay": 0.01,
+  "warmup_steps": 5,
+  "train_epochs": 1,
+  "log_interval": 1,
+  "num_hard_neg": 1,
+  "attn_implementation": null,
+  "use_flash_attention": false
+}
diff --git a/F2LLM/model.py b/F2LLM/model.py
@@ -1,5 +1,6 @@
 import torch
-from transformers import AutoModel, AutoTokenizer
+from transformers import AutoModel, AutoTokenizer, GPT2LMHeadModel, AutoModelForCausalLM
+import warnings
 
 
 class F2LLM:
@@ -12,9 +13,80 @@ def __init__(self,
         self.args = args
         self.dtype = torch.bfloat16
         self.device = None # set after accelerator.prepare
-        self.lm = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype=self.dtype, attn_implementation='flash_attention_2')
+
+        # 根据配置选择注意力实现方式
+        attn_implementation = getattr(args, 'attn_implementation', 'flash_attention_2') if args else 'flash_attention_2'
+        use_flash_attention = getattr(args, 'use_flash_attention', True) if args else True
+
+        # 尝试加载模型，支持多种decoder-only模型
+        try:
+            if use_flash_attention and attn_implementation:
+                # 使用配置的注意力实现
+                self.lm = AutoModelForCausalLM.from_pretrained(
+                    model_path, 
+                    trust_remote_code=True, 
+                    torch_dtype=self.dtype, 
+                    attn_implementation=attn_implementation
+                )
+            else:
+                # 不使用特殊注意力实现
+                self.lm = AutoModelForCausalLM.from_pretrained(
+                    model_path, 
+                    trust_remote_code=True, 
+                    torch_dtype=self.dtype
+                )
+        except Exception as e:
+            if use_flash_attention and attn_implementation:
+                warnings.warn(f"Failed to load model with {attn_implementation}: {e}. Trying fallback options...")
+
+            # 回退策略
+            fallback_options = ['sdpa', None]  # 尝试sdpa，然后是不使用特殊注意力
+            loaded = False
+
+            for fallback_attn in fallback_options:
+                try:
+                    if fallback_attn:
+                        self.lm = AutoModelForCausalLM.from_pretrained(
+                            model_path, 
+                            trust_remote_code=True, 
+                            torch_dtype=self.dtype,
+                            attn_implementation=fallback_attn
+                        )
+                    else:
+                        self.lm = AutoModelForCausalLM.from_pretrained(
+                            model_path, 
+                            trust_remote_code=True, 
+                            torch_dtype=self.dtype
+                        )
+                    warnings.warn(f"Successfully loaded model with {fallback_attn or 'default'} attention")
+                    loaded = True
+                    break
+                except Exception as e2:
+                    warnings.warn(f"Failed to load model with {fallback_attn or 'default'} attention: {e2}")
+                    continue
+
+            if not loaded:
+                raise RuntimeError(f"Failed to load model {model_path} with any attention implementation")
+
         self.lm.config.use_cache = False
-        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+        # 加载分词器，添加trust_remote_code支持更多模型
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            model_path, 
+            trust_remote_code=True,
+            padding_side='right'  # 大多数decoder-only模型需要右侧填充
+        )
+
+        # 确保分词器有pad_token
+        if self.tokenizer.pad_token is None:
+            if self.tokenizer.eos_token is not None:
+                self.tokenizer.pad_token = self.tokenizer.eos_token
+            else:
+                # 添加新的pad_token
+                self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
+                # 需要调整模型embedding大小
+                self.lm.resize_token_embeddings(len(self.tokenizer))
+
         self.max_seq_length = max_seq_length
 
     def set_device(self):
@@ -24,11 +96,23 @@ def forward(self, batch):
         bs = batch['bs']
         num_hard_neg = int((len(batch['input_ids']) - 2*bs) / bs)
 
-        outputs = self.lm(batch['input_ids'],
-                        batch['attention_mask'],
-                        )
+        outputs = self.lm(
+            input_ids=batch['input_ids'],
+            attention_mask=batch['attention_mask'],
+            return_dict=True,
+            output_hidden_states=True
+        )
+
+        # 对于CausalLM模型，获取最后一层的隐藏状态
+        if hasattr(outputs, 'hidden_states') and outputs.hidden_states is not None:
+            # hidden_states是一个元组，包含所有层的隐藏状态
+            passage_features_all_tokens = outputs.hidden_states[-1]
+        elif hasattr(outputs, 'last_hidden_state'):
+            passage_features_all_tokens = outputs.last_hidden_state
+        else:
+            # 回退到使用transformer的输出
+            passage_features_all_tokens = outputs[0]
 
-        passage_features_all_tokens = outputs.last_hidden_state
         return {
             'query_passage_features': torch.stack([passage_features_all_tokens[i, [batch['seq_lens'][i]-1]] for i in range(bs)]),
             'passage_passage_features': torch.stack([passage_features_all_tokens[i, [batch['seq_lens'][i]-1]] for i in range(bs, 2*bs)]),