-
Notifications
You must be signed in to change notification settings - Fork 33
Open
Description
When running speech recognition with ReazonSpeech, the model only outputs a single word, regardless of the length or content of the input audio. This happens even with clear audio files containing multiple words or full sentence
Code:
import librosa
import soundfile as sf
import io
import tempfile
import numpy as np
# from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path
from reazonspeech.k2.asr import load_model, transcribe, audio_from_path
# === Load ReazonSpeech model from Hugging Face ===
# model = load_model("reazon-research/reazonspeech-k2-v2-ja-en")
model = load_model(device="cpu", precision="fp32", language="ja") # or language="ja-en" for bilingual model
# === Step 1: Load and resample audio to 16,000 Hz ===
audio_path = r'D:\Image_Based_searchengine\product_images\audio (3).wav'
y, sr = librosa.load(audio_path, sr=16000, mono=True)
# === Step 2: Amplify the audio by 1.5x and clip to avoid distortion ===
amplified_y = np.clip(y * 1.5, -1.0, 1.0)
# === Step 3: Write amplified audio to an in-memory buffer ===
buffer = io.BytesIO()
sf.write(buffer, amplified_y, 16000, format='WAV', subtype='PCM_16')
buffer.seek(0)
# === Step 4: Save buffer to a temp WAV file for ASR model ===
with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp:
tmp.write(buffer.read())
temp_wav_path = tmp.name
# === Step 5: Transcribe ===
audio = audio_from_path(temp_wav_path)
print("audio.samplerate:", audio.samplerate)
ret = transcribe(model, audio)
print("Transcribed Text:", ret.text)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels