Skip to content

Commit 004dc88

Browse files
authored
Merge pull request #100 from danielferr85/main
(BETA) STT with OmniASR CTC models for OpenAI Transcriptions endpoint and restore default Max Context size for easy use for beginners users
2 parents 5a13339 + a151fb8 commit 004dc88

File tree

7 files changed

+503
-6
lines changed

7 files changed

+503
-6
lines changed

README.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# RKLLama: LLM Server and Client for Rockchip 3588/3576
22

3-
### [Version: 0.0.53](#New-Version)
3+
### [Version: 0.0.54](#New-Version)
44

55
Video demo ( version 0.0.1 ):
66

@@ -52,6 +52,7 @@ A server to run and interact with LLM models optimized for Rockchip RK3588(S) an
5252
* `/v1/embeddings`
5353
* `/v1/images/generations`
5454
* `/v1/audio/speech`
55+
* `/v1/audio/transcriptions`
5556
- **Tool/Function Calling** - Complete support for tool calls with multiple LLM formats (Qwen, Llama 3.2+, others).
5657
- **Pull models directly from Huggingface.**
5758
- **Include a API REST with documentation.**
@@ -70,6 +71,7 @@ A server to run and interact with LLM models optimized for Rockchip RK3588(S) an
7071
- **Multimodal Suport** - Use Qwen2VL/Qwen2.5VL/Qwen3VL/MiniCPMV4/MiniCPMV4.5/InternVL3.5 vision models to ask questions about images (base64, local file or URL image address). More than one image in the same request is allowed.
7172
- **Image Generation** - Generate images with OpenAI Image generation endpoint using model LCM Stable Diffusion 1.5 RKNN models.
7273
- **Text to Speech (TTS)** - Generate speech with OpenAI Audio Speech endpoint using models for Piper TTS running encoder with ONNX and decoder with RKNN.
74+
- **Speech to Text (STT)** - Generate transcriptions with OpenAI Audio Transcriptions endpoint using models for omniASR-CTC running the model with RKNN.
7375

7476

7577
## Documentation
@@ -408,6 +410,29 @@ Example directory structure for multimodal:
408410
5. Execute the script export_encoder_decoder.py to export the encoder and decoder IN ONNX format.
409411
6. Execute the script export_rknn.py to export the decoder in RKNN format (you must uhave installed the rknn-toolkit version 2.3.2).
410412

413+
414+
### **For Transcriptions Generation (STT) Installation**
415+
1. Download a model from https://huggingface.co/danielferr85/omniASR-ctc-rknn from Hugging Face.
416+
2. Create a folder inside the models directory in RKLLAMA for the model, For example: **omniasr-ctc:300m**
417+
3. Copy the model (.rknn) and vocabulary (.txt) file from the choosed model to the new directory model created in RKLLMA.
418+
4. The structure of the model **MUST** be like this:
419+
420+
```
421+
~/RKLLAMA/models/
422+
└── omniasr-ctc:300m
423+
└── model.rknn
424+
└── vocab.txt
425+
426+
```
427+
428+
5. Done! You are ready to test the OpenAI endpoint /v1/audio/transcriptions to generate transcriptions. You can add it to OpenWebUI in the Audio section for STT.
429+
430+
**IMPORTANT**
431+
- The model can have any name but must ended with extension .rknn
432+
- The vocabulary of the model can have any name but must ended with extension .txt
433+
- You must use rknn-toolkit 2.3.2 for RKNN conversion because is the one used by RKLLAMA
434+
435+
411436
## Configuration
412437

413438
RKLLAMA uses a flexible configuration system that loads settings from multiple sources in a priority order:

pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "rkllama"
3-
version = "0.0.53"
3+
version = "0.0.54"
44
authors = [
55
{ name="NotPunchnox", email="punchnoxpro@gmail.com" },
66
{ name="TomJacobsUK", email="tom@tomjacobs.co.uk" },
@@ -26,6 +26,8 @@ dependencies = [
2626
"piper-tts==1.3.0",
2727
"pydub",
2828
"ffmpeg",
29+
"soxr",
30+
"soundfile",
2931
"rknn-toolkit-lite2 @ file:./src/rkllama/lib/rknn_toolkit_lite2-2.3.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl ; python_version == '3.12'",
3032
"rknn-toolkit-lite2 @ file:./src/rkllama/lib/rknn_toolkit_lite2-2.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl ; python_version == '3.11'",
3133
"rknn-toolkit-lite2 @ file:./src/rkllama/lib/rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl ; python_version == '3.10'",

src/rkllama/api/server_utils.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -958,3 +958,62 @@ def handle_complete(cls, model_name,input,voice,response_format,stream_format,vo
958958
# Return the audio
959959
return audio
960960

961+
962+
963+
class GenerateTranscriptionsEndpointHandler(EndpointHandler):
964+
"""Handler for v1/audio/transcriptions endpoint requests"""
965+
966+
@staticmethod
967+
def format_complete_response(text, response_format):
968+
"""Format a complete non-streaming response for generate endpoint"""
969+
970+
response ={
971+
"text": text,
972+
"usage": {
973+
"type": "tokens",
974+
"input_tokens": 0,
975+
"input_token_details": {
976+
"text_tokens": 0,
977+
"audio_tokens": 0
978+
},
979+
"output_tokens": 0,
980+
"total_tokens": 0
981+
}
982+
}
983+
984+
return response
985+
986+
@classmethod
987+
def handle_request(cls, model_name,file, language, response_format, stream):
988+
"""Process a generate request with proper format handling"""
989+
990+
if DEBUG_MODE:
991+
logger.debug(f"GenerateTranscriptionsEndpointHandler: processing request for {model_name}")
992+
993+
# Check if streaming or not
994+
if stream:
995+
996+
# Streaming not supported yet for audio generation
997+
return Response("Streaming not supported yet for audio transcription", status=400)
998+
999+
1000+
else:
1001+
# Transcription output
1002+
transcription_text = cls.handle_complete(model_name,file, language, response_format)
1003+
1004+
# Return response
1005+
return cls.format_complete_response(transcription_text, response_format)
1006+
1007+
@classmethod
1008+
def handle_complete(cls, model_name,file, language, response_format):
1009+
"""Handle complete generate transcription response"""
1010+
1011+
# Use config for models path
1012+
model_dir = os.path.join(rkllama.config.get_path("models"), model_name)
1013+
1014+
# Send the task of generate transcription to the model
1015+
transcription_text = variables.worker_manager_rkllm.generate_transcription(model_name, model_dir, file, language, response_format)
1016+
1017+
# Return the transcription text
1018+
return transcription_text
1019+

0 commit comments

Comments
 (0)