NotPunchnox
diff --git a/‎README.md‎
Lines changed: 26 additions & 1 deletion b/‎README.md‎
Lines changed: 26 additions & 1 deletion
diff --git a/‎pyproject.toml‎
Lines changed: 3 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎src/rkllama/api/server_utils.py‎
Lines changed: 59 additions & 0 deletions b/‎src/rkllama/api/server_utils.py‎
Lines changed: 59 additions & 0 deletions
@@ -1,6 +1,6 @@
 # RKLLama: LLM Server and Client for Rockchip 3588/3576
 
-### [Version: 0.0.53](#New-Version)
+### [Version: 0.0.54](#New-Version)
 
 Video demo ( version 0.0.1 ):
 
@@ -52,6 +52,7 @@ A server to run and interact with LLM models optimized for Rockchip RK3588(S) an
    * `/v1/embeddings`
    * `/v1/images/generations`
    * `/v1/audio/speech`
+   * `/v1/audio/transcriptions`
 - **Tool/Function Calling** - Complete support for tool calls with multiple LLM formats (Qwen, Llama 3.2+, others).
 - **Pull models directly from Huggingface.**
 - **Include a API REST with documentation.**
@@ -70,6 +71,7 @@ A server to run and interact with LLM models optimized for Rockchip RK3588(S) an
 - **Multimodal Suport** - Use Qwen2VL/Qwen2.5VL/Qwen3VL/MiniCPMV4/MiniCPMV4.5/InternVL3.5 vision models to ask questions about images (base64, local file or URL image address). More than one image in the same request is allowed.
 - **Image Generation** - Generate images with OpenAI Image generation endpoint using model LCM Stable Diffusion 1.5 RKNN models.
 - **Text to Speech (TTS)** - Generate speech with OpenAI Audio Speech endpoint using models for Piper TTS running encoder with ONNX and decoder with RKNN.
+- **Speech to Text (STT)** - Generate transcriptions with OpenAI Audio Transcriptions endpoint using models for omniASR-CTC running the model with RKNN.
 
 
 ## Documentation
@@ -408,6 +410,29 @@ Example directory structure for multimodal:
    5. Execute the script export_encoder_decoder.py to export the encoder and decoder IN ONNX format.
    6. Execute the script export_rknn.py to export the decoder in RKNN format (you must uhave installed the rknn-toolkit version 2.3.2).
 
+
+### **For Transcriptions Generation (STT) Installation**
+1. Download a model from https://huggingface.co/danielferr85/omniASR-ctc-rknn from Hugging Face.
+2. Create a folder inside the models directory in RKLLAMA for the model, For example: **omniasr-ctc:300m** 
+3. Copy the model (.rknn) and vocabulary (.txt) file from the choosed model to the new directory model created in RKLLMA.
+4. The structure of the model **MUST** be like this:
+
+   ```
+   ~/RKLLAMA/models/
+       └── omniasr-ctc:300m
+           └── model.rknn
+           └── vocab.txt
+          
+   ```
+
+5. Done! You are ready to test the OpenAI endpoint /v1/audio/transcriptions to generate transcriptions. You can add it to OpenWebUI in the Audio section for STT.
+
+**IMPORTANT**
+- The model can have any name but must ended with extension .rknn
+- The vocabulary of the model can have any name but must ended with extension .txt 
+- You must use rknn-toolkit 2.3.2 for RKNN conversion because is the one used by RKLLAMA
+
+
 ## Configuration
 
 RKLLAMA uses a flexible configuration system that loads settings from multiple sources in a priority order:
 
@@ -1,6 +1,6 @@
 [project]
 name = "rkllama"
-version = "0.0.53"
+version = "0.0.54"
 authors = [
     { name="NotPunchnox", email="punchnoxpro@gmail.com" },
     { name="TomJacobsUK", email="tom@tomjacobs.co.uk" },
@@ -26,6 +26,8 @@ dependencies = [
     "piper-tts==1.3.0",
     "pydub",
     "ffmpeg",
+    "soxr",
+    "soundfile",
     "rknn-toolkit-lite2 @ file:./src/rkllama/lib/rknn_toolkit_lite2-2.3.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl ; python_version == '3.12'",
     "rknn-toolkit-lite2 @ file:./src/rkllama/lib/rknn_toolkit_lite2-2.3.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl ; python_version == '3.11'",
     "rknn-toolkit-lite2 @ file:./src/rkllama/lib/rknn_toolkit_lite2-2.3.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl ; python_version == '3.10'",
 
@@ -958,3 +958,62 @@ def handle_complete(cls, model_name,input,voice,response_format,stream_format,vo
         # Return the audio
         return audio
 
+
+
+class GenerateTranscriptionsEndpointHandler(EndpointHandler):
+    """Handler for v1/audio/transcriptions endpoint requests"""
+    
+    @staticmethod
+    def format_complete_response(text, response_format):
+        """Format a complete non-streaming response for generate endpoint"""
+
+        response ={
+            "text": text,
+            "usage": {
+                "type": "tokens",
+                "input_tokens": 0,
+                "input_token_details": {
+                "text_tokens": 0,
+                "audio_tokens": 0
+                },
+                "output_tokens": 0,
+                "total_tokens": 0
+            }
+        }
+        
+        return response
+    
+    @classmethod
+    def handle_request(cls, model_name,file, language, response_format, stream):
+        """Process a generate request with proper format handling"""
+        
+        if DEBUG_MODE:
+            logger.debug(f"GenerateTranscriptionsEndpointHandler: processing request for {model_name}")
+        
+        # Check if streaming or not
+        if stream:
+
+            # Streaming not supported yet for audio generation
+            return Response("Streaming not supported yet for audio transcription", status=400)
+        
+
+        else:
+            # Transcription output 
+            transcription_text =  cls.handle_complete(model_name,file, language, response_format)
+        
+            # Return response
+            return cls.format_complete_response(transcription_text, response_format)
+    
+    @classmethod
+    def handle_complete(cls, model_name,file, language, response_format):
+        """Handle complete generate transcription response"""
+
+        # Use config for models path
+        model_dir = os.path.join(rkllama.config.get_path("models"), model_name)
+
+        # Send the task of generate transcription to the model
+        transcription_text = variables.worker_manager_rkllm.generate_transcription(model_name, model_dir, file, language, response_format)
+        
+        # Return the transcription text
+        return transcription_text
+