Skip to content

Commit f30212d

Browse files
committed
Migrating CT documents to Azure AI Foundry
1 parent cd2770e commit f30212d

File tree

50 files changed

+1774
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+1774
-1
lines changed
Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
title: Azure AI Translator Custom Translation for beginners
3+
titleSuffix: Azure AI services
4+
description: A user guide for understanding the end-to-end customized machine translation process.
5+
author: laujan
6+
manager: nitinme
7+
ms.service: azure-ai-translator
8+
ms.author: lajanuar
9+
ms.date: 01/27/2025
10+
ms.topic: overview
11+
---
12+
13+
# Azure AI Translator Custom Translation for beginners
14+
15+
[Custom Translation](overview.md) enables you to a build translation system that reflects your business, industry, and domain-specific terminology and style. Training and deploying a custom system is easy and doesn't require any programming skills. The customized translation system seamlessly integrates into your existing applications, workflows, and websites and is available on Azure through the same cloud-based [Microsoft Text Translation API](../reference/v3-0-translate.md?tabs=curl) service that powers billions of translations every day.
16+
17+
[Custom Translation](overview.md) empowers you to build a translation system that truly captures your business's unique language, industry terminology, and domain-specific style. With an intuitive interface, training, testing, and deploying your custom model is simple and requires no programming expertise. Seamlessly integrate your tailored translation system into your existing applications, workflows, and websites—all backed by the cloud-based [Azure AI Translator Text Translation API](../reference/v3-0-translate.md?tabs=curl) service that powers billions of translations each day.
18+
19+
The platform enables users to build and publish custom translation systems to and from English. The Custom Translator supports more than 100 languages that map directly to the languages available for Neural machine translation (NMT). For a complete list, *see* [Translator language support](../language-support.md).
20+
21+
## Is a custom translation model the right choice for you?
22+
23+
A well-trained custom translation model excels at delivering accurate, domain-specific translations by learning from your previously translated in-domain documents. This approach ensures that your specialized terms and phrases are used in context, producing fluent, natural translations that respect the target language’s grammatical nuances.
24+
25+
Keep in mind that developing a full custom translation model requires a substantial amount of training data—typically at least 10,000 parallel sentences. If you don’t have enough data to train a comprehensive model, you might consider building a dictionary-only model to capture essential terminology, or you can rely on the high-quality, out-of-the-box translations offered by the Text Translation API.
26+
27+
Ultimately, if you need translations that reflect your industry’s specific language and you have ample training resources, a custom translation model can be the ideal choice for your organization.
28+
29+
:::image type="content" source="media/how-to/for-beginners.png" alt-text="Screenshot illustrating the difference between custom and general models.":::
30+
31+
## What does training a custom translation model involve?
32+
33+
Building a custom translation model requires:
34+
35+
* Understanding your use-case.
36+
37+
* Obtaining in-domain translated data (preferably human translated).
38+
39+
* Assessing translation quality or target language translations.
40+
41+
## How do I evaluate my use-case?
42+
43+
Having clarity on your use-case and what success looks like is the first step towards sourcing proficient training data. Here are a few considerations:
44+
45+
* Is your desired outcome specified and how is it measured?
46+
47+
* Is your business domain identified?
48+
49+
* Do you have in-domain sentences of similar terminology and style?
50+
51+
* Does your use-case involve multiple domains? If yes, should you build one translation system or multiple systems?
52+
53+
* Do you have requirements impacting regional data residency at-rest and in-transit?
54+
55+
* Are the target users in one or multiple regions?
56+
57+
## How should I source my data?
58+
59+
Finding in-domain quality data is often a challenging task that varies based on user classification. Here are some questions you can ask yourself as you evaluate what data is available to you:
60+
61+
* Does your company have previous translation data available that you can use? Enterprises often have a wealth of translation data accumulated over many years of using human translation.
62+
63+
* Do you have a vast amount of monolingual data? Monolingual data is data in only one language. If so, can you get translations for this data?
64+
65+
* Can you crawl online portals to collect source sentences and synthesize target sentences?
66+
67+
## What should I use for training material?
68+
69+
| Source | What it does | Rules to follow |
70+
|---|---|---|
71+
| Bilingual training documents | Teaches the system your terminology and style. | **Be liberal**. Any in-domain human translation is better than machine translation. Add and remove documents as you go and try to improve the [BLEU score](concepts/bleu-score.md?WT.mc_id=aiml-43548-heboelma). |
72+
| Tuning documents | Trains the Neural Machine Translation parameters. | **Be strict**. Compose them to be optimally representative of what you are going to translate in the future. |
73+
| Test documents | Calculate the [BLEU score](concepts/bleu-score.md?WT.mc_id=aiml-43548-heboelma).| **Be strict**. Compose test documents to be optimally representative of what you plan to translate in the future. |
74+
| Phrase dictionary | Forces the given translation 100% of the time. | **Be restrictive**. A phrase dictionary is case-sensitive and any word or phrase listed is translated in the way you specify. In many cases, it's better to not use a phrase dictionary and let the system learn. |
75+
| Sentence dictionary | Forces the given translation 100% of the time. | **Be strict**. A sentence dictionary is case-insensitive and good for common in domain short sentences. For a sentence dictionary match to occur, the entire submitted sentence must match the source dictionary entry. If only a portion of the sentence matches, the entry doesn't match. |
76+
77+
## What is a BLEU score?
78+
79+
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the precision or accuracy of text that is machine translated from one language to another. Custom Translation uses the BLEU metric as one way of conveying translation accuracy.
80+
81+
A BLEU score is a number between zero and 100. A score of zero indicates a low quality translation where nothing in the translation matched the reference. A score of 100 indicates a perfect translation that is identical to the reference. It's not necessary to attain a score of 100 - a BLEU score between 40 and 60 indicates a high-quality translation.
82+
83+
[Read more](concepts/bleu-score.md?WT.mc_id=aiml-43548-heboelma)
84+
85+
## What happens if I don't submit tuning or testing data?
86+
87+
Tuning and test sentences are optimally representative of what you plan to translate in the future. If you don't submit any tuning or testing data, Custom Translation automatically excludes sentences from your training documents to use as tuning and test data.
88+
89+
| System-generated | Manual-selection |
90+
|---|---|
91+
| Convenient. | Enables fine-tuning for your future needs.|
92+
| Good, if you know that your training data is representative of what you are planning to translate. | Provides more freedom to compose your training data.|
93+
| Easy to redo when you grow or shrink the domain. | Allows for more data and better domain coverage.|
94+
|Changes each training run.| Remains static over repeated training runs|
95+
96+
## How is training material processed by Custom Translation?
97+
98+
To prepare for training, documents undergo a series of processing and filtering steps. Knowledge of the filtering process can help with understanding the sentence count displayed as well as the steps you can take to prepare training documents for training with Custom Translation. The filtering steps are as follows:
99+
100+
* ### Sentence alignment
101+
102+
If your document isn't in `XLIFF`, `XLSX`, `TMX`, or `ALIGN` format, Custom Translation aligns the sentences of your source and target documents to each other, sentence-by-sentence. Translator doesn't perform document alignment—it follows your naming convention for the documents to find a matching document in the other language. Within the source text, Custom Translation tries to find the corresponding sentence in the target language. It uses document markup like embedded HTML tags to help with the alignment.
103+
104+
If you see a large discrepancy between the number of sentences in the source and target documents, your source document can't be parallel, or couldn't be aligned. The document pairs with a large difference (>10%) of sentences on each side warrant a second look to make sure they're indeed parallel.
105+
106+
* ### Tuning and testing data extraction
107+
108+
Tuning and testing data is optional. If you don't provide it, the system removes an appropriate percentage from your training documents to use for tuning and testing. The removal happens dynamically as part of the training process. Since this step occurs as part of training, your uploaded documents aren't affected. You can see the final used sentence counts for each category of data—training, tuning, testing, and dictionary—on the Model details page after training succeeds.
109+
110+
* ### Length filter
111+
112+
* Removes sentences with only one word on either side.
113+
* Removes sentences with more than 100 words on either side. Chinese, Japanese, Korean are exempt.
114+
* Removes sentences with fewer than three characters. Chinese, Japanese, Korean are exempt.
115+
* Removes sentences with more than 2,000 characters for Chinese, Japanese, Korean.
116+
* Removes sentences with less than 1% alphanumeric characters.
117+
* Removes dictionary entries containing more than 50 words.
118+
119+
* ### White space
120+
121+
* Replaces any sequence of white-space characters including tabs and CR/LF sequences with a single space character.
122+
* Removes leading or trailing space in the sentence.
123+
124+
* ### Sentence end punctuation
125+
126+
* Replaces multiple sentence-end punctuation characters with a single instance. Japanese character normalization.
127+
128+
* Converts full width letters and digits to half-width characters.
129+
130+
* ### Unescaped XML tags
131+
132+
Transforms unescaped tags into escaped tags:
133+
134+
| Tag | Becomes |
135+
|---|---|
136+
| \< | \< |
137+
| \> | \> |
138+
| \& | \& |
139+
140+
* ### Invalid characters
141+
142+
Custom Translation removes sentences that contain Unicode character U+FFFD. The character U+FFFD indicates a failed encoding conversion.
143+
144+
* ### Invalid HTML tags
145+
146+
Custom Translation removes valid tags during training. Invalid tags cause unpredictable results and should be manually removed.
147+
148+
## What steps should I take before uploading data?
149+
150+
* Remove sentences with invalid encoding.
151+
* Remove Unicode control characters.
152+
* Align sentences (source-to-target), if feasible.
153+
* Remove source and target sentences that don't match the source and target languages.
154+
* When source and target sentences have mixed languages, ensure that untranslated words are intentional, for example, names of organizations and products.
155+
* Avoid teaching errors to your model by making certain that grammar and typography are correct.
156+
* Have one source sentence mapped to one target sentence. Although our training process handles source and target lines containing multiple sentences, one-to-one mapping is a best practice.
157+
* Remove invalid HTML tags before uploading training data.
158+
159+
## How do I evaluate the results?
160+
161+
After your model is successfully trained, you can view the model's BLEU score and baseline model BLEU score on the model details page. We use the same set of test data to generate both the model's BLEU score and the baseline BLEU score. This data helps you make an informed decision regarding which model would be better for your use-case.
162+
163+
## Next steps
164+
165+
> [!div class="nextstepaction"]
166+
> [Try create project](../azure-ai-foundry/how-to-custom-translation-create-project.md)
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
title: "What is a BLEU score? - Custom Translation"
3+
titleSuffix: Azure AI services
4+
description: BLEU is a measurement of the differences between machine translation and human-created reference translations of the same source sentence.
5+
author: laujan
6+
manager: nitinme
7+
ms.service: azure-ai-translator
8+
ms.topic: conceptual
9+
ms.date: 01/27/2025
10+
ms.author: lajanuar
11+
ms.custom: cogserv-non-critical-translator
12+
#Customer intent: As an Custom Translation user, I want to understand how BLEU score works so that I understand system test outcome better.
13+
---
14+
15+
# What is a BLEU score?
16+
17+
[BLEU (Bilingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) is a measurement of the difference between an automatic translation and human-created reference translations of the same source sentence.
18+
19+
## Scoring process
20+
21+
The BLEU algorithm compares consecutive phrases of the automatic translation
22+
with the consecutive phrases it finds in the reference translation, and counts
23+
the number of matches, in a weighted fashion. These matches are position
24+
independent. A higher match degree indicates a higher degree of similarity with
25+
the reference translation, and higher score. Intelligibility and grammatical correctness aren't taken into account.
26+
27+
## How BLEU works?
28+
29+
The BLEU score's strength is that it correlates well with human judgment. BLEU averages out
30+
individual sentence judgment errors over a test corpus, rather than attempting
31+
to devise the exact human judgment for every sentence.
32+
33+
A more extensive discussion of BLEU scores is [here](https://youtu.be/-UqDljMymMg).
34+
35+
BLEU results depend strongly on the breadth of your domain; consistency of
36+
test, training and tuning data; and how much data you have
37+
available for training. If your models are trained within a narrow domain, and
38+
your training data is consistent with your test data, you can expect a high
39+
BLEU score.
40+
41+
>[!NOTE]
42+
>A comparison between BLEU scores is only justifiable when BLEU results are compared with the same Test set, the same language pair, and the same MT engine. A BLEU score from a different test set is bound to be different.
43+
44+
## Next steps
45+
46+
> [!div class="nextstepaction"]
47+
> [BLEU score evaluation](../how-to-custom-translation-test-model.md)
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
title: Translation Customization - Custom Translation
3+
titleSuffix: Azure AI services
4+
description: Use the Microsoft Translator Hub to build your own machine translation system using your preferred terminology and style.
5+
#services: cognitive-services
6+
author: laujan
7+
manager: nitinme
8+
ms.service: azure-ai-translator
9+
ms.topic: conceptual
10+
ms.date: 01/28/2025
11+
ms.author: lajanuar
12+
---
13+
14+
# Customize your text translations
15+
16+
Custom Translation is a feature of the Azure AI Translator service. Custom Translation allows users to customize Azure AI Translator's advanced neural machine translation when translating text using Translator (version 3 only).
17+
18+
The feature can also be used to customize speech translation when used with [Azure AI Speech](../../../speech-service/index.yml).
19+
20+
## Custom Translation
21+
22+
With Custom Translator, you can build neural translation systems that understand the terminology used in your own business and industry. The customized translation system integrates into existing applications, workflows, and websites.
23+
24+
### How does it work?
25+
26+
Use your previously translated documents (leaflets, webpages, documentation, etc.) to build a translation system that reflects your domain-specific terminology and style, better than a standard translation system. Users can upload `TMX`,`XLIFF`,`TXT`, `DOCX`, and `XLSX` documents.
27+
28+
The system also accepts data that is parallel at the document level but isn't yet aligned at the sentence level. If users have access to versions of the same content in multiple languages but in separate documents, Custom Translator os able to automatically match sentences across documents. The system can also use monolingual data in either or both languages to complement the parallel training data to improve the translations.
29+
30+
The customized system is then available through a regular call to Translator using the category parameter.
31+
32+
Given the appropriate type and amount of training data it isn't uncommon to expect gains between 5 and 10, or even more `BLEU` points on translation quality by using Custom Translator.
33+
34+
More details about the various levels of customization based on available data can be found in the [Custom Translation User Guide](../overview.md).
35+
36+
## Next steps
37+
38+
> [!div class="nextstepaction"]
39+
> [Set up a customized language system using Custom Translation](../overview.md)
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
---
2+
title: "Data Filtering - Custom Translation"
3+
titleSuffix: Azure AI services
4+
description: When you submit documents to be used for training a custom system, the documents undergo a series of processing and filtering steps.
5+
author: laujan
6+
manager: nitinme
7+
ms.service: azure-ai-translator
8+
ms.date: 01/28/2025
9+
ms.author: lajanuar
10+
ms.topic: conceptual
11+
ms.custom: cogserv-non-critical-translator
12+
#Customer intent: As a Custom Translation, I want to understand how data is filtered before training a model.
13+
---
14+
15+
# Custom Translation Data filtering
16+
17+
When you submit documents to be used for training, the documents undergo a series of processing and filtering steps. These steps are explained here. The knowledge of the filtering can help you understand the sentence count displayed in Custom Translation and the steps you can take yourself to prepare the documents for training with Custom Translation.
18+
19+
## Sentence alignment
20+
21+
If your document isn't in XLIFF, `TMX`, or ALIGN format, Custom Translation aligns the sentences of your source and target documents to each other, sentence by sentence. Custom Translation doesn't perform document alignment – it follows your naming of the documents to find the matching document of the other language. Within the document, Custom Translation tries to find the corresponding sentence in the other language. It uses document markup like embedded HTML tags to help with the alignment.
22+
23+
If you see a large discrepancy between the number of sentences in the source and target documents, your documents can't be parallel. The document pairs with a large difference (>10%) of sentences on each side warrant a second look to make sure they're indeed parallel. Custom Translation shows a warning next to the document if the sentence count differs suspiciously.
24+
25+
## Deduplication
26+
27+
Custom Translation removes the sentences that are present in test and tuning documents from training data. The removal happens dynamically inside of the training run, not in the data processing step. Custom Translation reports the sentence count to you in the project overview before such removal. Deduplication doesn't apply if you choose to upload your own test and tuning documents.
28+
29+
## Length filter
30+
31+
* Remove sentences with only one word on either side.
32+
* Remove sentences with more than 100 words on either side.  Chinese, Japanese, Korean are exempt.
33+
* Remove sentences with fewer than three characters. Chinese, Japanese, Korean are exempt.
34+
* Remove sentences with more than 2,000 characters for Chinese, Japanese, Korean.
35+
* Remove sentences with less than 1% alpha characters.
36+
* Remove dictionary entries containing more than 50 words.
37+
38+
## White space
39+
40+
* Replace any sequence of white-space characters including tabs and CR/LF sequences with a single space character.
41+
* Remove leading or trailing space in the sentence
42+
43+
## Sentence end punctuation
44+
45+
Replace multiple sentence end punctuation characters with a single instance.
46+
47+
## Japanese character normalization
48+
49+
Convert full width letters and digits to half-width characters.
50+
51+
## Unescaped XML tags
52+
53+
Filtering transforms unescaped tags into escaped tags:
54+
* `<` becomes `<`
55+
* `>` becomes `>`
56+
* `&` becomes `&`
57+
58+
## Invalid characters
59+
60+
Custom Translation removes sentences that contain Unicode character U+FFFD. The character U+FFFD indicates a failed encoding conversion.
61+
62+
## Next steps
63+
64+
> [!div class="nextstepaction"]
65+
> [Learn how to train a model](../how-to-custom-translation-train-model.md)

0 commit comments

Comments
 (0)