Skip to content

SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers #31

@SoulFireMage

Description

@SoulFireMage

SpikingBrain-7B Model Produces Gibberish Output in PyTorch/Transformers

Summary

Both the V1-7B-base and V1-7B-sft-s3-reasoning models produce nonsensical Chinese characters instead of coherent responses when using standard PyTorch/Transformers inference, despite following all setup instructions.

Environment

  • OS: Windows 11 with WSL2 Ubuntu 22.04
  • GPU: NVIDIA RTX 3080 Ti (12GB VRAM)
  • CUDA: 12.4
  • PyTorch: 2.6.0+cu124
  • Transformers: 4.45.0 (as specified in model config)
  • flash-attention: 2.7.3 (compiled successfully)
  • Model: V1-7B-sft-s3-reasoning (all 15 shards downloaded via ModelScope)

Installation Steps Completed

  1. Installed PyTorch 2.6.0 with CUDA 12.4 support
  2. Compiled flash-attention 2.7.3 in WSL2 Ubuntu (191MB wheel, successful)
  3. Installed nvidia-cuda-toolkit for nvcc compiler
  4. Downloaded V1-7B-sft-s3-reasoning model via ModelScope snapshot_download
  5. Verified all 15 pytorch_model shards present (~14GB total)
  6. Used transformers==4.45.0 (downgraded from 4.55.2 due to API incompatibility)

Issues Encountered

Issue 1: Chinese Gibberish Output

Expected: Coherent English responses to simple questions
Actual: Random Chinese characters with no semantic meaning

Example:

Question: "What is 2 + 2? Answer briefly."
Output: "院学院学院学院学院学院学院学院学院学院学院学院城市"

This output doesn't translate to meaningful text even in Chinese.

Issue 2: Extremely Slow Inference

  • Model loading: 10+ minutes initially (improved to 2 seconds after tokenizer fix)
  • First inference: 80+ seconds for 50 tokens
  • Subsequent inferences: 60+ seconds per response
  • Expected: 1-3 seconds per inference on RTX 3080 Ti

Issue 3: Tokenizer Configuration Mismatch

Found and fixed a mismatch between model config and tokenizer config:

  • Model config.json: eos_token_id: 151643 (which is <|endoftext|>)
  • Original tokenizer_config.json: eos_token: "<|im_end|>" (ID 151645)
  • Fix: Changed to "eos_token": "<|endoftext|>" to match model expectations

This fixed the slow model loading but did not fix the gibberish output.

Code Used

Test Script

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = '/mnt/c/SpikingBrain/repo/models/Panyuqi/V1-7B-sft-s3-reasoning'

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to('cuda')
model.eval()

# Using apply_chat_template for proper ChatML format
messages = [{"role": "user", "content": "What is 2 + 2? Answer briefly."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors='pt').to('cuda')

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=50,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = full_response[len(prompt):].strip()
print(f"Answer: {response}")

Generated Prompt Format

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is 2 + 2? Answer briefly.<|im_end|>
<|im_start|>assistant

This follows the ChatML format specified in tokenizer_config.json.

What We've Verified

✓ Correct model variant (V1-7B-sft-s3-reasoning, not base)
✓ All 15 model shards downloaded and present
✓ tokenizer_config.json present and complete
✓ Proper ChatML format using apply_chat_template()
✓ Correct transformers version (4.45.0)
✓ flash-attention compiled successfully
✓ CUDA working properly (GPU detected and used)
✓ Both .to('cuda') and device_map='auto' attempted
✓ Model config auto_map preserved (required for GLAswaForCausalLM)

Files Checked

tokenizer_config.json

  • Qwen2Tokenizer class
  • BOS: <|im_start|> (ID 151644)
  • EOS: <|endoftext|> (ID 151643) - fixed to match model
  • PAD: <|endoftext|> (ID 151643)
  • Chat template present and correct
  • Model max length: 131072

config.json

  • Architecture: GLAswaForCausalLM
  • bos_token_id: 151643
  • eos_token_id: 151643
  • auto_map preserved (required)
  • transformers_version: "4.49.0" in chat model, "4.45.0" in base model

generation_config.json

  • bos_token_id: 151643
  • eos_token_id: 151643
  • max_new_tokens: 2048

Questions

  1. Is the GLAswaForCausalLM architecture fully compatible with standard transformers?

    • The custom architecture may have special requirements not documented in the README
  2. Are there additional dependencies or setup steps for the SFT chat model?

    • W8ASpike variant was mentioned in docs - does the chat model require something similar?
  3. Has anyone successfully run V1-7B-sft-s3-reasoning with standard PyTorch/Transformers?

    • Would appreciate working example code if available
  4. Why is inference so slow on RTX 3080 Ti?

    • 60-80 seconds per response seems excessive for a 7B model on this hardware
  5. Is there a known issue with tokenizer configuration in the released model?

    • The eos_token mismatch we found suggests possible configuration issues in the release

Additional Context

  • Both V1-7B-base and V1-7B-sft-s3-reasoning exhibit the same gibberish output
  • The user reports that "ordinary models work via LMStudio in moments"
  • This suggests the issue is specific to SpikingBrain-7B, not the environment

Request

Could the maintainers provide:

  1. A complete working example for V1-7B-sft-s3-reasoning inference
  2. Verification that the released model weights are correct
  3. Any additional setup steps or dependencies not mentioned in the README
  4. Expected inference performance benchmarks for RTX 3080 Ti

Thank you for developing this interesting spiking neural network model! We're eager to get it working properly.


Files Generated During Testing

Available if helpful for debugging:

  • test_fixed_tokenizer.py - Our test script
  • chat_gradio.py - Gradio interface implementation
  • flash_attn_install.log - Flash-attention compilation log
  • Modified tokenizer_config.json with eos_token fix

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions