Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

Abstract

Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions. We also release a large-scale synthetic conversational dataset to facilitate further research.

Results

Our model achieves a competitive average win rate of 42.3% against groundtruth labels. This suggests that internalizing ASR CoT introduces only minor quality degradation.

Setup

For cuda machines

make

For hpu machines

make install-hpu

Run

pdm start

Run linters

Run pdm run lint

Run formatters

Run pdm run lint-format

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.githooks		.githooks
.github/workflows		.github/workflows
assets		assets
configs		configs
data/gigaspeech_processing		data/gigaspeech_processing
scripts		scripts
src		src
.bandit.yml		.bandit.yml
.env.sample		.env.sample
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pylintrc		.pylintrc
README.md		README.md
makefile		makefile
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

Abstract

Results

Setup

For cuda machines

For hpu machines

Run

Run linters

Run formatters

About

Uh oh!

Releases

Packages

Languages

Robinysh/SpeechLLM

Folders and files

Latest commit

History

Repository files navigation

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

Abstract

Results

Setup

For cuda machines

For hpu machines

Run

Run linters

Run formatters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages