Skip to content

feat(api): add language detection#14

Merged
hspedro merged 7 commits intomainfrom
feat/lang-detection
Mar 14, 2025
Merged

feat(api): add language detection#14
hspedro merged 7 commits intomainfrom
feat/lang-detection

Conversation

@hspedro
Copy link
Owner

@hspedro hspedro commented Mar 14, 2025

Language Detection Feature PR

Overview

This PR introduces automatic language detection capabilities to Babeltron, allowing users to translate text without specifying the source language. The implementation uses the Lingua library, which provides highly accurate language detection even for short text snippets.

Key Features

1. Automatic Language Detection

  • Added support for automatic source language detection in the translation API
  • Users can set src_lang to "auto", leave it empty, or omit it entirely to trigger detection
  • The detected language is returned in the response along with a confidence score
  • Implemented proper error handling for detection failures

2. Dedicated Language Detection Endpoint

  • Added a new /detect endpoint for standalone language detection
  • Returns the detected language code (ISO 639-1) and confidence score
  • Useful for applications that need language identification without translation

3. Lingua Integration

  • Integrated the Lingua library, which supports 75 languages
  • Created a mapping between Lingua's language enum and ISO 639-1 codes
  • Implemented the LinguaDetectionModel class that conforms to the DetectionModelBase interface
  • Added factory pattern for detection model creation and configuration

4. Metrics and Monitoring

  • Added a new detection_used label to translation metrics to track when detection is used
  • Enhanced OpenTelemetry tracing with detection-specific spans and attributes
  • Added detection time tracking and logging
  • Metrics can be used to analyze detection usage patterns and performance impact

5. Testing

  • Added comprehensive unit tests for the detection model
  • Added tests for the detection endpoint
  • Added tests for automatic detection in the translation endpoint
  • Updated existing tests to work with the new detection features

Configuration

  • Added DETECTION_MODEL_TYPE environment variable to configure the detection model
  • Default model is set to "lingua"
  • Detection model is loaded at application startup

Performance Considerations

  • Lingua is optimized for speed and accuracy
  • Detection adds minimal overhead to translation requests
  • Detection results are logged with timing information for performance monitoring

Documentation

  • Updated API documentation with information about automatic language detection
  • Added examples showing how to use the detection features
  • Added descriptions of detection confidence scores

Usage Examples

Translation with Automatic Detection

POST /api/v1/translate
{
  "text": "Bonjour, comment ça va?",
  "tgt_lang": "en"
}

Response:

{
  "translation": "Hello, how are you?",
  "model_type": "m2m100",
  "architecture": "cpu_compiled",
  "detected_lang": "fr",
  "detection_confidence": 0.95
}

Standalone Language Detection

POST /api/v1/detect
{
  "text": "Bonjour, comment ça va?"
}

Response:

{
  "language": "fr",
  "confidence": 0.95
}

Future Improvements

  • Add caching for detection results to improve performance for repeated texts
  • Explore additional detection models for comparison
  • Add language-specific optimizations for common languages
  • Consider adding language detection confidence thresholds for fallback to user-specified languages

This feature enhances Babeltron's usability by removing the need for users to know the source language of their text, making the translation service more accessible and user-friendly.

hspedro added 4 commits March 14, 2025 10:36
We will implement detection models, thus we need to scope and
adjust the project structure inside models folder
This commit implements a new language detection endpoint that uses the Lingua
library for highly accurate language identification. The implementation:

- Adds a new /api/v1/detect endpoint that accepts text and returns detected
  language with confidence score
- Implements LinguaDetectionModel with support for 75 languages
- Updates factory pattern to use Lingua as the default detection model

Lingua was chosen for its superior accuracy on short texts and informal
language, making it ideal for real-world applications despite supporting
fewer languages than the translation models.
The /api/v1/translate endpoint now can detect the source language if:

- src_lang is set to "auto"
- src_lang is empty ""
- src_lang is missing

In any of the cases, we will use detection model (Lingua) to detect the
src langauge and proceed with translation. In that case, we will return
two additional attributes in the response

- detected_lang: string
- detection_confidence: float
@hspedro hspedro self-assigned this Mar 14, 2025
hspedro added 3 commits March 14, 2025 13:55
We have a bug that the worker goes OOM due to the
docker image size. Fix this to enable build in CI
@hspedro hspedro merged commit 183659e into main Mar 14, 2025
2 checks passed
@hspedro hspedro deleted the feat/lang-detection branch March 14, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant