feat(api): add language detection by hspedro · Pull Request #14 · hspedro/babeltron

hspedro · 2025-03-14T16:51:05Z

Language Detection Feature PR

Overview

This PR introduces automatic language detection capabilities to Babeltron, allowing users to translate text without specifying the source language. The implementation uses the Lingua library, which provides highly accurate language detection even for short text snippets.

Key Features

1. Automatic Language Detection

Added support for automatic source language detection in the translation API
Users can set src_lang to "auto", leave it empty, or omit it entirely to trigger detection
The detected language is returned in the response along with a confidence score
Implemented proper error handling for detection failures

2. Dedicated Language Detection Endpoint

Added a new /detect endpoint for standalone language detection
Returns the detected language code (ISO 639-1) and confidence score
Useful for applications that need language identification without translation

3. Lingua Integration

Integrated the Lingua library, which supports 75 languages
Created a mapping between Lingua's language enum and ISO 639-1 codes
Implemented the LinguaDetectionModel class that conforms to the DetectionModelBase interface
Added factory pattern for detection model creation and configuration

4. Metrics and Monitoring

Added a new detection_used label to translation metrics to track when detection is used
Enhanced OpenTelemetry tracing with detection-specific spans and attributes
Added detection time tracking and logging
Metrics can be used to analyze detection usage patterns and performance impact

5. Testing

Added comprehensive unit tests for the detection model
Added tests for the detection endpoint
Added tests for automatic detection in the translation endpoint
Updated existing tests to work with the new detection features

Configuration

Added DETECTION_MODEL_TYPE environment variable to configure the detection model
Default model is set to "lingua"
Detection model is loaded at application startup

Performance Considerations

Lingua is optimized for speed and accuracy
Detection adds minimal overhead to translation requests
Detection results are logged with timing information for performance monitoring

Documentation

Updated API documentation with information about automatic language detection
Added examples showing how to use the detection features
Added descriptions of detection confidence scores

Usage Examples

Translation with Automatic Detection

POST /api/v1/translate
{
  "text": "Bonjour, comment ça va?",
  "tgt_lang": "en"
}

Response:

{
  "translation": "Hello, how are you?",
  "model_type": "m2m100",
  "architecture": "cpu_compiled",
  "detected_lang": "fr",
  "detection_confidence": 0.95
}

Standalone Language Detection

POST /api/v1/detect
{
  "text": "Bonjour, comment ça va?"
}

Response:

{
  "language": "fr",
  "confidence": 0.95
}

Future Improvements

Add caching for detection results to improve performance for repeated texts
Explore additional detection models for comparison
Add language-specific optimizations for common languages
Consider adding language detection confidence thresholds for fallback to user-specified languages

This feature enhances Babeltron's usability by removing the need for users to know the source language of their text, making the translation service more accessible and user-friendly.

We will implement detection models, thus we need to scope and adjust the project structure inside models folder

This commit implements a new language detection endpoint that uses the Lingua library for highly accurate language identification. The implementation: - Adds a new /api/v1/detect endpoint that accepts text and returns detected language with confidence score - Implements LinguaDetectionModel with support for 75 languages - Updates factory pattern to use Lingua as the default detection model Lingua was chosen for its superior accuracy on short texts and informal language, making it ideal for real-world applications despite supporting fewer languages than the translation models.

The /api/v1/translate endpoint now can detect the source language if: - src_lang is set to "auto" - src_lang is empty "" - src_lang is missing In any of the cases, we will use detection model (Lingua) to detect the src langauge and proceed with translation. In that case, we will return two additional attributes in the response - detected_lang: string - detection_confidence: float

We have a bug that the worker goes OOM due to the docker image size. Fix this to enable build in CI

hspedro added 4 commits March 14, 2025 10:36

refactor(api): translate models to its folder

ca13997

We will implement detection models, thus we need to scope and adjust the project structure inside models folder

feat(monitoring): add detect label

62583a1

hspedro self-assigned this Mar 14, 2025

hspedro added 3 commits March 14, 2025 13:55

fix(api): model_type as model attribute

860f934

chore(ci): disable push to docker in release

bd5d9fa

We have a bug that the worker goes OOM due to the docker image size. Fix this to enable build in CI

fix(poetry): lock and monitoring tests

2f43fae

hspedro merged commit 183659e into main Mar 14, 2025
2 checks passed

hspedro deleted the feat/lang-detection branch March 14, 2025 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add language detection#14

feat(api): add language detection#14
hspedro merged 7 commits intomainfrom
feat/lang-detection

hspedro commented Mar 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hspedro commented Mar 14, 2025

Language Detection Feature PR

Overview

Key Features

1. Automatic Language Detection

2. Dedicated Language Detection Endpoint

3. Lingua Integration

4. Metrics and Monitoring

5. Testing

Configuration

Performance Considerations

Documentation

Usage Examples

Translation with Automatic Detection

Standalone Language Detection

Future Improvements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant