Conversation
We will implement detection models, thus we need to scope and adjust the project structure inside models folder
This commit implements a new language detection endpoint that uses the Lingua library for highly accurate language identification. The implementation: - Adds a new /api/v1/detect endpoint that accepts text and returns detected language with confidence score - Implements LinguaDetectionModel with support for 75 languages - Updates factory pattern to use Lingua as the default detection model Lingua was chosen for its superior accuracy on short texts and informal language, making it ideal for real-world applications despite supporting fewer languages than the translation models.
The /api/v1/translate endpoint now can detect the source language if: - src_lang is set to "auto" - src_lang is empty "" - src_lang is missing In any of the cases, we will use detection model (Lingua) to detect the src langauge and proceed with translation. In that case, we will return two additional attributes in the response - detected_lang: string - detection_confidence: float
We have a bug that the worker goes OOM due to the docker image size. Fix this to enable build in CI
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Language Detection Feature PR
Overview
This PR introduces automatic language detection capabilities to Babeltron, allowing users to translate text without specifying the source language. The implementation uses the Lingua library, which provides highly accurate language detection even for short text snippets.
Key Features
1. Automatic Language Detection
src_langto "auto", leave it empty, or omit it entirely to trigger detection2. Dedicated Language Detection Endpoint
/detectendpoint for standalone language detection3. Lingua Integration
LinguaDetectionModelclass that conforms to theDetectionModelBaseinterface4. Metrics and Monitoring
detection_usedlabel to translation metrics to track when detection is used5. Testing
Configuration
DETECTION_MODEL_TYPEenvironment variable to configure the detection modelPerformance Considerations
Documentation
Usage Examples
Translation with Automatic Detection
Response:
{ "translation": "Hello, how are you?", "model_type": "m2m100", "architecture": "cpu_compiled", "detected_lang": "fr", "detection_confidence": 0.95 }Standalone Language Detection
Response:
{ "language": "fr", "confidence": 0.95 }Future Improvements
This feature enhances Babeltron's usability by removing the need for users to know the source language of their text, making the translation service more accessible and user-friendly.