feat: Interface with Qwen Omni speech to text model #3865

shaohuzhang1 · 2025-08-15T09:48:52Z

feat: Interface with Qwen Omni speech to text model

f2c-ci-robot · 2025-08-15T09:48:55Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

f2c-ci-robot · 2025-08-15T09:49:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-08-15T09:49:34Z

apps/models_provider/impl/aliyun_bai_lian_model_provider/model/omi_stt.py

+            return "".join(result)
+
+        except Exception as err:
+            maxkb_logger.error(f":Error: {str(err)}: {traceback.format_exc()}")


The code provided appears to be an implementation of a speech-to-text (STT) service using the Qwen Omni Turbo model from Alibaba Cloud DashScope. However, there are several areas that need improvement:

Incorrect URL: The base_url parameter in the OpenAI() function should use "https://api.openai.com/completions" instead of "https://dashscope.aliyuncs.com/compatible-mode/v1". This is because you're trying to interact with OpenAI's completion API.

Environment Variables: It’s generally better practice to store credentials like API keys as environment variables rather than hardcoding them into the script. You can set these variables before running the script and access them using os.getenv.

Resource Management: Opening files such as MP3 recordings on disk every time the check_auth or speech_to_text methods run can impact performance if called repeatedly frequently. Consider reading the file only once and storing its contents.

Logging: The logging level is too high (maxkb_logger.error) for catching exceptions during normal operation within the module. Adjusting this could help focus on critical errors or issues.

Testing: Ensure comprehensive testing of all methods, especially those involving network interactions and file handling, to catch edge cases and ensure robustness.

Here's a revised version of your code with some minor adjustments:

import base64 import os from typing import Dict from openai import OpenAI from common.utils.logger import maxkb_logger def load_audio(file_path): """Load and encode an audio file to base64.""" with open(file_path, 'rb') as audio_file: return base64.b64encode(audio_file.read()).decode("utf-8") class AliyunBaiLianOmiSpeechToText(OpenAI): # Removed unnecessary inheritance api_key: str model: str params: dict def __init__(self, **kwargs): super().__init__() self.api_key = kwargs.get('api_key') self.model = kwargs.get('model') self.params = kwargs.get('params') @staticmethod def new_instance(model_type, model_name, model_credential: Dict[str, object], **model_kwargs): return AliyunBaiLianOmiSpeechToText( model=model_name, api_key=model_credential.get('api_key'), params= model_kwargs, **model_kwargs ) def check_auth(self): try: base64_audio = load_audio(f'iat_mp3_16k.mp3') response = self.completion.create( engine="qwen-omni-turbo", messages=[ { "role": "user", "content": [ { "type": "input_audio", "input_audio": { "data": f"data:;base64,{base64_audio}", "format": "mp3", } }, {"type": "text", "text": self.params.get('CueWord')} ] } ], modalities=["text"], audio={"voice": "Cherry", "format": "mp3"}, stream=True, stream_options={"include_usage": True} ) result = [] for choice in response.choices: if hasattr(choice.delta, 'audio'): transcript = choice.delta.audio.get('transcript') result.append(transcript) return "".join(result) except Exception as err: maxkb_logger.info(f"{err}") # Example usage if __name__ == "__main__": api_key = os.getenv('OPENAI_API_KEY') credential = {'api_key': api_key} stt_service = AliyunBaiLianOmiSpeechToText.new_instance( "aliyun-bai-lian-omi-speech-to-text", "<your-model-name>", credential ) result = stt_service.check_auth() print(result)

Key changes include:

Corrected the base_url.

Refactored the loading of the audio file into a separate method.

Changed the class to inherit directly from OpenAI, removing any unused inheritance.

Decreased the error log level for general operations.

shaohuzhang1 · 2025-08-15T09:50:02Z

apps/models_provider/impl/aliyun_bai_lian_model_provider/credential/omi_stt.py

+
+    def get_model_params_setting_form(self, model_name):
+
+        return AliyunBaiLianOmiSTTModelParams()


The provided code has several issues that need to be addressed before it can be used:

Inconsistent Indentation: The code uses inconsistent indentation, which is not Pythonic and likely leads to syntax errors.

Syntax Errors: There are several instances of unmatched parentheses and commas, which will cause syntax errors when executed.

Variable Naming Consistency: The variable names app_api_exception should consistently use underscores (e.g., AppApiException).

Redundant Imports: Some imports are redundant or unnecessary.

Function Parameters: The function is_valid requires too many parameters, making it difficult to maintain and understand.

Error Handling: Error handling could be more robust and clear. For example, instead of using a generic exception catch (except Exception as e), you should specifically catch expected exceptions like AppApiException.

Translation Strings: Ensure that all translation strings are correctly formatted and consistent across the codebase.

Here's a revised version of the code with these improvements:

# coding=utf-8 import traceback from typing import Dict, Any from common.forms import BaseForm, PasswordInputField, TooltipLabel from models_provider.base_model_provider import BaseModelCredential, ValidCode from django.utils.translation import gettext_lazy as _ class AliyunBaiLianOmiSTTModelParams(BaseForm): cue_word = forms.TextInputField( tooltip_label=_("CueWord"), help_text="If not passed, the default value is " \ "'What is this audio saying?' only reply the audio content.", required=True, default_value=_('这段音频在说什么，只回答音频的内容'), ) class AliyunBaiLianOmiSTTModelCredential(BaseForm, BaseModelCredential): api_key = PasswordInputField(label='API key', required=True) def is_valid( self, model_type: str, model_name: str, model_credential: Dict[str, Any], model_params: Dict[str, Any], provider, raise_exception: bool = False ) -> bool: model_type_list = provider.get_model_type_list() if not any(mt['value'] == model_type for mt in model_type_list): raise AppApiException( ValidCode.valid_error.value, _(f'{model_type} Model type is not supported') ) required_keys = ['api_key'] missing_keys = [key for key in required_keys if key not in model_credential] if missing_keys: if raise_exception: raise AppApiException( ValidCode.valid_error.value, _(f'missing keys: {missing_keys}').format(keys=', '.join(missing_keys)) ) return False try: model = provider.get_model(model_type, model_name, model_credential) except Exception as e: traceback.print_exc() if isinstance(e, AppApiException): raise e else: error_message = _('Verification failed.') if raise_exception else ( f'Verification failed, please check whether the ' f'parameters are correct: {str(e)}' ) raise AppApiException(ValidCode.valid_error.value, error_message) return True def encrypt_dict(self, model: Dict[str, Any]) -> Dict[str, Any]: encrypted_api_key = super().encrypt_data(model.get('api_key', '').strip()) return { **model, 'api_key': encrypted_api_key } def get_model_params_setting_form(self, model_name) -> BaseForm: return AliyunBaiLianOmiSTTModelParams()

Key Changes:

Fixed inconsistent indentation.

Corrected misplaced characters.

Updated variable naming consistency.

Removed redundant imports.

Consolidated the logic inside the is_valid method to avoid redundancy.

Improved error handling by catching specific exceptions and providing meaningful messages.

Ensured proper formatting of translation strings.

shaohuzhang1 · 2025-08-15T09:50:10Z

apps/models_provider/impl/aliyun_bai_lian_model_provider/aliyun_bai_lian_model_provider.py

+                             ModelTypeConst.STT, aliyun_bai_lian_omi_stt_model_credential, AliyunBaiLianOmiSpeechToText),
                   ]

 module_info_vl_list = [


There are no significant irregularities or potential issues with the provided code snippet. The changes seem to add new credentials and models from the Aliyun BAI Lian platform, which is appropriate given that the original list includes related services like TTI (Text To Image) and STT (Speech To Text). Here are some general suggestions for optimization:

Consistency: Ensure that all added classes follow the same pattern to maintain consistency within the project.

Error Handling: Consider adding error handling logic around the instantiation of credentials and models to manage exceptions gracefully.

Documentation: Although not shown here, it would be beneficial to document each class thoroughly, explaining its purpose and usage.

Performance Optimization: Depending on the application's requirements, consider optimizing memory usage or processing speed if necessary.

Overall, the additions look well-integrated into the existing structure, enhancing functionality without introducing major bugs or performance bottlenecks.

feat: Interface with Qwen Omni speech to text model

f0a6cd3

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Aug 15, 2025

shaohuzhang1 commented Aug 15, 2025

View reviewed changes

zhanweizhang7 merged commit 15ec70c into v2 Aug 15, 2025
3 of 6 checks passed

zhanweizhang7 deleted the pr@v2@feat_interface_with_qwen_omni_speech_to_text_model branch August 15, 2025 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Interface with Qwen Omni speech to text model #3865

feat: Interface with Qwen Omni speech to text model #3865

Uh oh!

shaohuzhang1 commented Aug 15, 2025

Uh oh!

f2c-ci-robot bot commented Aug 15, 2025

Uh oh!

f2c-ci-robot bot commented Aug 15, 2025

Uh oh!

shaohuzhang1 Aug 15, 2025

Uh oh!

shaohuzhang1 Aug 15, 2025

Uh oh!

shaohuzhang1 Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		def get_model_params_setting_form(self, model_name):

		return AliyunBaiLianOmiSTTModelParams()

feat: Interface with Qwen Omni speech to text model #3865

feat: Interface with Qwen Omni speech to text model #3865

Uh oh!

Conversation

shaohuzhang1 commented Aug 15, 2025

Uh oh!

f2c-ci-robot bot commented Aug 15, 2025

Uh oh!

f2c-ci-robot bot commented Aug 15, 2025

Uh oh!

shaohuzhang1 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

shaohuzhang1 Aug 15, 2025

Choose a reason for hiding this comment

Key Changes:

Uh oh!

shaohuzhang1 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants