Tonggu is a classical Chinese LLM developed by the Deep Learning and Visual Computing Laboratory (SCUT-DLVCLab) at South China University of Technology. It has strong capabilities in multimodal Classical Chinese Studies (CCS).
ACCN-INS: 358,000 multimodal fine-tuning data samples from ancient texts, covering tasks such as ancient text recognition, reading comprehension, and classical Chinese translation.
The CCS358K dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MSDS dataset, please first fill in this Application Form and email them to us. When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of classical Chinese. We will give you the download link and the decompression password after your application has been received and approved. All users must follow all use conditions; otherwise, the authorization will be revoked.
TongGu-VL-2B-Instruct: A 2B-parameter multimodal model for classical Chinese literature. It was instruction-tuned on 358K classical multimodal documents and supports functions such as text recognition and calligraphy appreciation.
- 2025/07/06 The TongGu paper was accepted at ACM MM 2025.
# transformers == 4.48.2
import torch
from transformers import AutoProcessor
from transformers import AutoModelForCausalLM
from qwen_vl_utils import process_vision_info
model_id = "SCUT-DLVCLab/TongGu-VL-2B-Instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
def use_model(input_image, input_prompt):
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": input_image,
},
{"type": "text", "text": input_prompt},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
guided_text = messages[0]["content"][1]["text"] + '<|vision_start|><|image_pad|><|vision_end|>'
print(guided_text)
inputs_ocr = processor(text=[guided_text], images=image_inputs, videos=video_inputs, padding=False, return_tensors="pt")
inputs["input_ids_ocr"] = inputs_ocr["input_ids"]
inputs["attention_mask_ocr"] = inputs_ocr["attention_mask"]
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, top_p=0.95, top_k=50)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
return output_text[0]
image = "you image here"
prompt = "Identify the text in the image."
print(use_model(image, prompt))@inproceedings{cao2025tonggu,
title={TongGu-VL: Advancing Visual-Language Understanding in Chinese Classical Studies through Parameter Sensitivity-Guided Instruction Tuning},
author={Cao, Jiahuan and Liu, Yang and Zhang, Peirong and Shi, Yongxin and Ding, Kai and Jin, Lianwen},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={11111--11120},
year={2025}
}
After extensive instruction tuning on large-scale data, TongGu-VL has developed strong multimodal understanding capabilities for classical Chinese literature, such as text recognition and calligraphy appreciation. However, due to limitations in model scale, the autoregressive generation paradigm, and other factors, TongGu-VL may still generate misleading responses containing factual errors or harmful content with biases/discrimination. Please use it with caution and exercise critical judgment. Do not disseminate harmful content generated by TongGu-VL on the internet. Any adverse consequences arising from such dissemination are the sole responsibility of the disseminator.
