Conversation
|
✅ DCO Check Passed Thanks @BBC-Esq, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
3f7b50a to
85b6fef
Compare
Previously it either chose MLX or fell back to vanilla whisper. This auto chooses whispers2t if the criteria are met. Signed-off-by: Chintella-Esq <bbc@chintellalaw.com>
|
Here goes nothing. I put a lot of time and effort into this PR and I jumped through all the freakin hoops as far as certifying the PR, correcting the syntax...Hopefully big company IBM does not equal shitty customer experience and/or online culture when simply trying to contribute to a cool idea. I don't even see a freakin button to request a review by whomever maintains this repo...Oh well, here goes nothing! |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…o this commit: e6eaa55 Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
…reby add my Signed-off-by to this commit: Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
…igned-off-by to this commit: e6eaa55 Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
…lalaw.com>, hereby add my Signed-off-by to this commit: e6eaa55 Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
BBC-Esq
left a comment
There was a problem hiding this comment.
I'm trying my best to adhere to the multiple bots and their instructions...any help would be much appreciated. Thanks. lol.
…hereby add my Signed-off-by to this commit: e6eaa55 Signed-off-by: BBC, Esquire <bbc@chintellalaw.com>
|
Hi @BBC-Esq We’ll review it shortly and let you know if anything needs clarification or further work. We take all contributions seriously and will keep you posted on the status of this PR. If you have any questions in the meantime, just drop a comment here or open an issue. Thanks again for your work! |
Thanks for your response. I can try to change the PR to remove the unrelated scripts if need be. They were included by accident when I ran ruff on my entire fork's codebase so...or feel free to modify the PR to the discrete topic at hand. Was taken agack by the number of bots all with different criteria but I'll review the links you sent. |
|
Any chance this could be reviewed? |
Signed-off-by: Chintella-Esq <bbc@chintellalaw.com>
Hello, you said you guys would be following up shortly? Is there anything preventing this from being merged? @ceberam |
|
@BBC-Esq let me follow up. We were right in the middle of a big refactoring to clean up the pipelines/stages/models/etc (see the merge conflicts). This is now getting finalized. |
ceberam
left a comment
There was a problem hiding this comment.
Thanks again @BBC-Esq for the contribution and for exploring improvements in the ASR backend.
I’ve reviewed the PR and I’m not comfortable with the current form for several reasons:
1. Licensing
The repository currently does not contain a LICENSE file. Without an explicit open-source license, there is no clear legal grant of rights to use, modify, or redistribute the code.
Since this project is part of the LF AI & Data, we need all dependencies to have a clear, identifiable open-source license. Even though this repository appears to be a fork of WhisperS2T (which is MIT-licensed), that license must be explicitly included in the forked repository for it to be valid for redistribution.
Until this is clarified and properly addressed upstream, we cannot safely introduce this dependency.
2. Project health
The repository currently has:
- No visible community adoption (0 stars/forks)
- Limited signals of review
- No documented benchmark results supporting the performance claims
If performance is the main motivation, it would help to include reproducible benchmarks or comparative results demonstrating the advantage over existing ASR backend models. Without that, it is difficult to justify introducing an additional dependency with unclear maintenance status.
3. PR scope and conflicts
This PR also includes styling and formatting changes across multiple files that are unrelated to the new dependency. While some of those changes may be acceptable in isolation (e.g., I also prefer str | None to Optional[str]), they are out of scope for this feature and make the review more difficult. Our CONTRIBUTING.md guidelines already describe how to run the automatic style checks locally, which should prevent unrelated formatting diffs. In addition, as pointed out by @PeterStaar-IBM , recent changes have introduced some merge conflicts that should be addressed once the refactoring has finalized.
I would be happy to re-review again once:
- The upstream repository clearly includes an appropriate open-source license.
- There is objective evidence supporting the claimed performance benefits.
- The PR is cleaned up to only include changes strictly required for the ASR feature.
- The merge conflicts are resolved.
Thanks again for the contribution and for understanding the need to keep compliance, scope, and maintainability in mind.
|
@ceberam I appreciate such a thorough and professional response. I, too, care about the code base and your message makes sense.
Regarding "unclear maintenance" status, I totally get that. You don't want to modify your guys' code base, add a dependency (even an optional one), and then have a part of your code base outdated because someone abandons WhisperS2T-Reborn. At a minimum, it would just be a hassle to unwind the "integration" so to speak. Since I maintain WhisperS2T-Reborn, it logically makes sense to explain about my background a little to hopefully assuage your concern. First off, I'm a lawyer by trade but I program as hobby. I originally became fascinated by LLMs and machine learning upon learning about "vector databases" and how they enable the search-ability of massive amounts of data; for example, data obtained during discovery in a lawsuit. In my profession, there's frequently large amounts of data in the form of text messages, emails, audio recordings, video clips, etc... As an offshoot of that, transcribing audio and then being able to search it "semantically is a huge benefit for litigation purposes. It means that I no longer have to listen to hours of audio to pinpoint the minute/second where something was said that is important for litigation purposes, which ultimately saves the client billables. This is essentially what lead me to learn about Whisper models. The "vanilla" Whisper code, while revolutionary at the time, is painstakingly slow, which leads me to faster back-ends like Ctranslate2 upon which WhisperS2T-Reborn is based. That was 3-4 years ago now and I've been working with Whisper, vector databases, and other related technologies ever since. I was frequently in contact with the maintainer of the original "WhisperS2T" repository until he abandoned it in 2024 because he was offered a high-paying job, presumably using WhisperS2T to pad his resume. As you know, this frequently happens - i.e. people use their Github repositories to pad their resumes to land a high-paying job...nothing wrong with that, but often as part of a new job you can't work on competing open source projects and/or just don't have the time so... At any rate, I decided to maintain it since it was slowly becoming incompatible with other APIs such as transformers as their code bases improved/changed. I also maintain the models that WhisperS2T-Reborn relies upon by default, which you can see here on huggingface. I'd like to highlight that all of the models are converted from the original "float32" Whisper models, NOT the "float16" versions that OpenAI has decided to upload for better or worse. See the config here to see what I'm referring to. ORIGINALLY, OpenAI uploaded float32 versions of all models but then replaced them with float16 versions. I disagreed with this because, yes, float16 is compatible with more GPUs and the file size is less, but if you're running on CPU it'll just have to be upcast to float32 anyways and you cannot regain the quality loss from the initial conversion from float32 to float16. Moreover, more recent GPUs support bfloat16...People wishing for the additional stability that bfloat16 offers won't get it...after all, when converting a float16 model to bfloat16 you cannot regain the lost precision. In short, all of my models are converted from the original float32 versions so there is no quality loss. The WhisperS2T-Reborn library can choose the best model based on a user's hardware. For GPUs that support bfloat16, it'll choose bfloat16...and for older GPUs it'll fallback to float16. CPUs use float32 or...a user can still choose to use float32 on GPU for maximum quality. This required downloading the original float32 weights of all whisper models from the history of the various repositories, adding any updated configuration "JSON" files (e.g. if OpenAI found a bug in something), then converting them to the Ctranslate2 format. My repository provides float32, float16, and bfloatg16 versions of all Whisper models and the Ctranslate2 backend also supports "int8" if that's desirable. Moreover, regarding Ctranslate2, it has an excellent feature whereby it'll convert at runtime if need be. For example, if you accidentally downloaded a bfloat16 model but your GPU only supports float16, it'll automatically convert and run in float16 without a hitch. Granted, there's a very small quality loss but at least it will not throw an error. Ctranslate2 also recently adopted ROCm support in addition to CUDA, and already supports fast execution on both Intel and AMD cpus.
Basically, I've been working with Whisper models since approximately 2023 and don't plan to quite anytime soon, if that's any assurance regarding the "health" of the WhisperS2T-Reborn library... Regarding item 3, I'll redo the PR to only pertain to the subject matter...it's just that without those additional changes your review is throwing errors...would be willing to do a separate PR to resolve those errors, however. I'll also do the stylistic checks locally like you suggest. Cheers! |
Signed-off-by: Chintella-Esq <bbc@chintellalaw.com>
Add Fast WhisperS2T (based on ctranslate2) ASR Backend
This PR adds support for the
whisper-s2t-rebornlibrary as a high-performance ASR backend option, providing significantly faster transcription speeds. Within the "pipeline," it automatically chooses this backend if the criteria are met - e.g. cuda, etc. Previously, it was eithe MLX with a simple fallback to vanilla openai whisper. Vanilla whisper is not the final fallback if neither MLX nor WhisperS2T are available.##New Files/Classes:
InlineAsrWhisperS2TOptions- Configuration class for WhisperS2T with options for compute type, batch size, beam size, and more_WhisperS2TModel- Model wrapper class implementing the WhisperS2T transcription pipelineModified Files:
pipeline_options_asr_model.py- AddedWHISPER_S2TtoInferenceAsrFrameworkenum; AddedInlineAsrWhisperS2TOptionsconfiguration classasr_model_specs.py- Added 12 pre-configured model specs (tiny, base, small, medium, large-v3 + distilled variants); ExtendedAsrModelTypeenumasr_pipeline.py- Added_WhisperS2TModelclass with device parsing fix for CTranslate2 compatibility; UpdatedAsrPipelineto handle new backendcli/main.py- Added CLI support for all WhisperS2T model variantsNotes
_parse_device()helper to handle device strings like"cuda:0"→("cuda", 0)for CTranslate2 compatibilityI will give you the script I used to test this below. However, PLEASE note that I always use a "set_cuda_paths" that relies on pip-installed versions of CUDA/cuDNN. This function sets the relevant cuda-related paths to the pip-installed locations so I don't have to install/reinstall various versions of CUDA. If you want to test with the stereotypical systemwide installation don't use "set_cuda_paths."
That said, to run the test script: (1) create a venv, (2) activate it, (3) run the following command, and (4) run the script pasted below:
HERE IS THE SCRIPT: