SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers

# SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers

## Summary
Both the V1-7B-base and V1-7B-sft-s3-reasoning models produce nonsensical Chinese characters instead of coherent responses when using standard PyTorch/Transformers inference, despite following all setup instructions.

## Environment
- **OS**: Windows 11 with WSL2 Ubuntu 22.04
- **GPU**: NVIDIA RTX 3080 Ti (12GB VRAM)
- **CUDA**: 12.4
- **PyTorch**: 2.6.0+cu124
- **Transformers**: 4.45.0 (as specified in model config)
- **flash-attention**: 2.7.3 (compiled successfully)
- **Model**: V1-7B-sft-s3-reasoning (all 15 shards downloaded via ModelScope)

## Installation Steps Completed
1. Installed PyTorch 2.6.0 with CUDA 12.4 support
2. Compiled flash-attention 2.7.3 in WSL2 Ubuntu (191MB wheel, successful)
3. Installed nvidia-cuda-toolkit for nvcc compiler
4. Downloaded V1-7B-sft-s3-reasoning model via ModelScope snapshot_download
5. Verified all 15 pytorch_model shards present (~14GB total)
6. Used transformers==4.45.0 (downgraded from 4.55.2 due to API incompatibility)

## Issues Encountered

### Issue 1: Chinese Gibberish Output
**Expected**: Coherent English responses to simple questions
**Actual**: Random Chinese characters with no semantic meaning

**Example**:
```
Question: "What is 2 + 2? Answer briefly."
Output: "院学院学院学院学院学院学院学院学院学院学院学院城市"
```

This output doesn't translate to meaningful text even in Chinese.

### Issue 2: Extremely Slow Inference
- **Model loading**: 10+ minutes initially (improved to 2 seconds after tokenizer fix)
- **First inference**: 80+ seconds for 50 tokens
- **Subsequent inferences**: 60+ seconds per response
- **Expected**: 1-3 seconds per inference on RTX 3080 Ti

### Issue 3: Tokenizer Configuration Mismatch
Found and fixed a mismatch between model config and tokenizer config:
- Model config.json: `eos_token_id: 151643` (which is `<|endoftext|>`)
- Original tokenizer_config.json: `eos_token: "<|im_end|>"` (ID 151645)
- **Fix**: Changed to `"eos_token": "<|endoftext|>"` to match model expectations

This fixed the slow model loading but **did not fix the gibberish output**.

## Code Used

### Test Script
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = '/mnt/c/SpikingBrain/repo/models/Panyuqi/V1-7B-sft-s3-reasoning'

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to('cuda')
model.eval()

# Using apply_chat_template for proper ChatML format
messages = [{"role": "user", "content": "What is 2 + 2? Answer briefly."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors='pt').to('cuda')

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=50,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = full_response[len(prompt):].strip()
print(f"Answer: {response}")
```

### Generated Prompt Format
```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2? Answer briefly.<|im_end|>
<|im_start|>assistant

```

This follows the ChatML format specified in tokenizer_config.json.

## What We've Verified

✓ Correct model variant (V1-7B-sft-s3-reasoning, not base)
✓ All 15 model shards downloaded and present
✓ tokenizer_config.json present and complete
✓ Proper ChatML format using apply_chat_template()
✓ Correct transformers version (4.45.0)
✓ flash-attention compiled successfully
✓ CUDA working properly (GPU detected and used)
✓ Both `.to('cuda')` and `device_map='auto'` attempted
✓ Model config auto_map preserved (required for GLAswaForCausalLM)

## Files Checked

### tokenizer_config.json
- Qwen2Tokenizer class
- BOS: `<|im_start|>` (ID 151644)
- EOS: `<|endoftext|>` (ID 151643) - **fixed to match model**
- PAD: `<|endoftext|>` (ID 151643)
- Chat template present and correct
- Model max length: 131072

### config.json
- Architecture: GLAswaForCausalLM
- bos_token_id: 151643
- eos_token_id: 151643
- auto_map preserved (required)
- transformers_version: "4.49.0" in chat model, "4.45.0" in base model

### generation_config.json
- bos_token_id: 151643
- eos_token_id: 151643
- max_new_tokens: 2048

## Questions

1. **Is the GLAswaForCausalLM architecture fully compatible with standard transformers?**
   - The custom architecture may have special requirements not documented in the README

2. **Are there additional dependencies or setup steps for the SFT chat model?**
   - W8ASpike variant was mentioned in docs - does the chat model require something similar?

3. **Has anyone successfully run V1-7B-sft-s3-reasoning with standard PyTorch/Transformers?**
   - Would appreciate working example code if available

4. **Why is inference so slow on RTX 3080 Ti?**
   - 60-80 seconds per response seems excessive for a 7B model on this hardware

5. **Is there a known issue with tokenizer configuration in the released model?**
   - The eos_token mismatch we found suggests possible configuration issues in the release

## Additional Context

- Both V1-7B-base and V1-7B-sft-s3-reasoning exhibit the same gibberish output
- The user reports that "ordinary models work via LMStudio in moments"
- This suggests the issue is specific to SpikingBrain-7B, not the environment

## Request

Could the maintainers provide:
1. A complete working example for V1-7B-sft-s3-reasoning inference
2. Verification that the released model weights are correct
3. Any additional setup steps or dependencies not mentioned in the README
4. Expected inference performance benchmarks for RTX 3080 Ti

Thank you for developing this interesting spiking neural network model! We're eager to get it working properly.

---

## Files Generated During Testing

Available if helpful for debugging:
- `test_fixed_tokenizer.py` - Our test script
- `chat_gradio.py` - Gradio interface implementation
- `flash_attn_install.log` - Flash-attention compilation log
- Modified `tokenizer_config.json` with eos_token fix


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers #31

SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers

Summary

Environment

Installation Steps Completed

Issues Encountered

Issue 1: Chinese Gibberish Output

Issue 2: Extremely Slow Inference

Issue 3: Tokenizer Configuration Mismatch

Code Used

Test Script

Generated Prompt Format

What We've Verified

Files Checked

tokenizer_config.json

config.json

generation_config.json

Questions

Additional Context

Request

Files Generated During Testing

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers #31

Description

SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers

Summary

Environment

Installation Steps Completed

Issues Encountered

Issue 1: Chinese Gibberish Output

Issue 2: Extremely Slow Inference

Issue 3: Tokenizer Configuration Mismatch

Code Used

Test Script

Generated Prompt Format

What We've Verified

Files Checked

tokenizer_config.json

config.json

generation_config.json

Questions

Additional Context

Request

Files Generated During Testing

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions