-
Notifications
You must be signed in to change notification settings - Fork 167
Description
SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers
Summary
Both the V1-7B-base and V1-7B-sft-s3-reasoning models produce nonsensical Chinese characters instead of coherent responses when using standard PyTorch/Transformers inference, despite following all setup instructions.
Environment
- OS: Windows 11 with WSL2 Ubuntu 22.04
- GPU: NVIDIA RTX 3080 Ti (12GB VRAM)
- CUDA: 12.4
- PyTorch: 2.6.0+cu124
- Transformers: 4.45.0 (as specified in model config)
- flash-attention: 2.7.3 (compiled successfully)
- Model: V1-7B-sft-s3-reasoning (all 15 shards downloaded via ModelScope)
Installation Steps Completed
- Installed PyTorch 2.6.0 with CUDA 12.4 support
- Compiled flash-attention 2.7.3 in WSL2 Ubuntu (191MB wheel, successful)
- Installed nvidia-cuda-toolkit for nvcc compiler
- Downloaded V1-7B-sft-s3-reasoning model via ModelScope snapshot_download
- Verified all 15 pytorch_model shards present (~14GB total)
- Used transformers==4.45.0 (downgraded from 4.55.2 due to API incompatibility)
Issues Encountered
Issue 1: Chinese Gibberish Output
Expected: Coherent English responses to simple questions
Actual: Random Chinese characters with no semantic meaning
Example:
Question: "What is 2 + 2? Answer briefly."
Output: "院学院学院学院学院学院学院学院学院学院学院学院城市"
This output doesn't translate to meaningful text even in Chinese.
Issue 2: Extremely Slow Inference
- Model loading: 10+ minutes initially (improved to 2 seconds after tokenizer fix)
- First inference: 80+ seconds for 50 tokens
- Subsequent inferences: 60+ seconds per response
- Expected: 1-3 seconds per inference on RTX 3080 Ti
Issue 3: Tokenizer Configuration Mismatch
Found and fixed a mismatch between model config and tokenizer config:
- Model config.json:
eos_token_id: 151643(which is<|endoftext|>) - Original tokenizer_config.json:
eos_token: "<|im_end|>"(ID 151645) - Fix: Changed to
"eos_token": "<|endoftext|>"to match model expectations
This fixed the slow model loading but did not fix the gibberish output.
Code Used
Test Script
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = '/mnt/c/SpikingBrain/repo/models/Panyuqi/V1-7B-sft-s3-reasoning'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).to('cuda')
model.eval()
# Using apply_chat_template for proper ChatML format
messages = [{"role": "user", "content": "What is 2 + 2? Answer briefly."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=50,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
)
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = full_response[len(prompt):].strip()
print(f"Answer: {response}")Generated Prompt Format
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2? Answer briefly.<|im_end|>
<|im_start|>assistant
This follows the ChatML format specified in tokenizer_config.json.
What We've Verified
✓ Correct model variant (V1-7B-sft-s3-reasoning, not base)
✓ All 15 model shards downloaded and present
✓ tokenizer_config.json present and complete
✓ Proper ChatML format using apply_chat_template()
✓ Correct transformers version (4.45.0)
✓ flash-attention compiled successfully
✓ CUDA working properly (GPU detected and used)
✓ Both .to('cuda') and device_map='auto' attempted
✓ Model config auto_map preserved (required for GLAswaForCausalLM)
Files Checked
tokenizer_config.json
- Qwen2Tokenizer class
- BOS:
<|im_start|>(ID 151644) - EOS:
<|endoftext|>(ID 151643) - fixed to match model - PAD:
<|endoftext|>(ID 151643) - Chat template present and correct
- Model max length: 131072
config.json
- Architecture: GLAswaForCausalLM
- bos_token_id: 151643
- eos_token_id: 151643
- auto_map preserved (required)
- transformers_version: "4.49.0" in chat model, "4.45.0" in base model
generation_config.json
- bos_token_id: 151643
- eos_token_id: 151643
- max_new_tokens: 2048
Questions
-
Is the GLAswaForCausalLM architecture fully compatible with standard transformers?
- The custom architecture may have special requirements not documented in the README
-
Are there additional dependencies or setup steps for the SFT chat model?
- W8ASpike variant was mentioned in docs - does the chat model require something similar?
-
Has anyone successfully run V1-7B-sft-s3-reasoning with standard PyTorch/Transformers?
- Would appreciate working example code if available
-
Why is inference so slow on RTX 3080 Ti?
- 60-80 seconds per response seems excessive for a 7B model on this hardware
-
Is there a known issue with tokenizer configuration in the released model?
- The eos_token mismatch we found suggests possible configuration issues in the release
Additional Context
- Both V1-7B-base and V1-7B-sft-s3-reasoning exhibit the same gibberish output
- The user reports that "ordinary models work via LMStudio in moments"
- This suggests the issue is specific to SpikingBrain-7B, not the environment
Request
Could the maintainers provide:
- A complete working example for V1-7B-sft-s3-reasoning inference
- Verification that the released model weights are correct
- Any additional setup steps or dependencies not mentioned in the README
- Expected inference performance benchmarks for RTX 3080 Ti
Thank you for developing this interesting spiking neural network model! We're eager to get it working properly.
Files Generated During Testing
Available if helpful for debugging:
test_fixed_tokenizer.py- Our test scriptchat_gradio.py- Gradio interface implementationflash_attn_install.log- Flash-attention compilation log- Modified
tokenizer_config.jsonwith eos_token fix