The speech-to-text (
The system employs two primary models based on the Connectionist Temporal Classification (
-
$\text{RNN + CTC}$ : Uses Sequential modeling with$\text{RNNs}$ with$\text{CTC}$ for alignment-free decoding. -
$\text{Conformer + CTC}$ : Combines$\text{CNNs}$ and$\text{Transformers}$ for better feature learning and decoding with$\text{CTC}$ .
Performance is measured at both the word and character levels.
-
$\text{WER}$ ($\text{Word Error Rate}$ ): Measures the proportion of word-level errors. -
$\text{CER}$ ($\text{Character Error Rate}$ ): Provides a fine-grained error measure at the character level.
Various strategies are used to generate the final transcription output.
- Greedy: A fast method that picks the highest probability tokens.
- Beam Search: Explores multiple paths for better accuracy.
-
$\text{LM}$ Integration: Uses external language models for a refined output.
The models were evaluated using
| Model | Train Loss Trend | Val Loss Trend | Final Test |
Final Test |
|---|---|---|---|---|
| Steady decrease, smooth convergence | Flattens after Epoch 6 | 30.28% | 8.50% | |
| Plateaus early around 2.6 | Fluctuates, remains high | 27.21% | 8.34% |
| Conformer + CTC | RNN + CTC |
|---|---|
![]() |
![]() |
The text translation task focuses on translating English to German and progressively employs advanced sequence modeling techniques.
The models evolve from basic sequence-to-sequence learning to modern attention-based architectures.
-
Vanilla
$\text{Seq2Seq}$ : An Encoder-Decoder model used for translating$\text{English}$ to$\text{German}$ using word-level tokens. -
$\text{Seq2Seq}$ with$\text{Attention}$ : Enhances translation by allowing the decoder to focus on relevant input words dynamically. -
$\text{Transformer}$ : Uses stacked self-attention layers to understand more complex patterns between all the words in the sequence.
Translation quality is assessed using a standard metric:
-
$\text{BLEU Score}$ : Measures translation quality by comparing generated sentences to reference translations (higher is better).
Various techniques are employed during both training and inference to optimize translation output:
- Greedy Decoding: Picks the most likely word at each step; this method is fast but can miss optimal translations.
- Teacher Forcing: Used during training, it uses the correct previous word as input to the decoder for faster learning.
- Attention Mechanism: Guides the decoder to focus on relevant input words for better translation quality.
Translation quality was measured using the
| Methods | Test |
Translation quality |
|---|---|---|
| Vanilla |
|
poor translations |
|
|
predictions are short, repetitive, and contains many |
|
|
|
Fluent, accurate, close to reference |
| Training | Validation |
|---|---|
![]() |
![]() |



