Skip to content

Commit 54943d5

Browse files
authored
Merge pull request #4982 from MicrosoftDocs/release-build-2025-custom-translator
[DO NOT MERGE] release-build-2025-custom-translator -> main -- 05/19/2025 - TBD
2 parents 9069bec + 94607ee commit 54943d5

File tree

125 files changed

+2358
-199
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

125 files changed

+2358
-199
lines changed

articles/ai-services/translator/containers/install-run.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -578,7 +578,7 @@ The container provides two endpoints for returning records regarding its usage.
578578

579579
The following endpoint provides a report summarizing all of the usage collected in the mounted billing record directory.
580580

581-
```HTTP
581+
```bash
582582
https://<service>/records/usage-logs/
583583
```
584584

@@ -590,7 +590,7 @@ https://<service>/records/usage-logs/
590590

591591
The following endpoint provides a report summarizing usage over a specific month and year:
592592

593-
```HTTP
593+
```bash
594594
https://<service>/records/usage-logs/{MONTH}/{YEAR}
595595
```
596596

articles/ai-services/translator/containers/translate-text-parameters.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ ms.author: lajanuar
1919

2020
Send a `POST` request to:
2121

22-
```HTTP
22+
```bash
2323
POST http://localhost:{port}/translate?api-version=3.0&&from={from}&to={to}
2424
```
2525

@@ -65,17 +65,17 @@ Request parameters passed on the query string are:
6565

6666
| Query parameter | Description |
6767
| --- | --- |
68-
| textType | _Optional parameter_. <br>Defines whether the text being translated is plain text or HTML text. Any HTML needs to be a well-formed, complete element. Possible values are: `plain` (default) or `html`. |
69-
| includeSentenceLength | _Optional parameter_. <br>Specifies whether to include sentence boundaries for the input text and the translated text. Possible values are: `true` or `false` (default). |
68+
| textType | _Optional parameter_. <br>Defines whether the text being translated is plain text or HTML text. Any HTML needs to be a well-formed, complete element. Accepted values are: `plain` (default) or `html`. |
69+
| includeSentenceLength | _Optional parameter_. <br>Specifies whether to include sentence boundaries for the input text and the translated text. Accepted values are: `true` or `false` (default). |
7070

7171
### Request headers
7272

7373
| Headers | Description |Condition|
7474
| --- | --- |---|
75-
| Authentication headers |*See* [available options for authentication](../text-translation/reference/v3/reference.md#authentication). |*Required request header*|
75+
| Authentication headers |*See* [available options for authentication](../text-translation/reference/authentication.md). |*Required request header*|
7676
| Content-Type |Specifies the content type of the payload. <br>Accepted value is `application/json; charset=UTF-8`. |*Required request header*|
7777
| Content-Length |The length of the request body. |*Optional*|
78-
| X-ClientTraceId | A client-generated GUID to uniquely identify the request. You can omit this header if you include the trace ID in the query string using a query parameter named `ClientTraceId`. |*Optional*|
78+
| X-ClientTraceId | A client-generated GUID to uniquely identify the request. You can omit this optional header if you include the trace ID in the query string using a query parameter named `ClientTraceId`. |*Optional*|
7979

8080
## Request body
8181

@@ -121,7 +121,7 @@ A successful response is a JSON array with one result for each string in the inp
121121

122122
## Response status codes
123123

124-
If an error occurs, the request returns a JSON error response. The error code is a 6-digit number combining the 3-digit HTTP status code followed by a 3-digit number to further categorize the error. Common error codes can be found on the [v3 Translator reference page](../text-translation/reference/v3/reference.md#errors).
124+
If an error occurs, the request returns a JSON error response. The error code is a 6-digit number combining the 3-digit HTTP status code followed by a 3-digit number to further categorize the error. Common error codes can be found on the [Translator status and error code page](../text-translation/reference/status-response-codes.md).
125125

126126
## Code samples: translate text
127127

articles/ai-services/translator/containers/transliterate-text-parameters.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,12 @@ Convert characters or letters of a source language to the corresponding characte
1919

2020
`POST` request:
2121

22-
```HTTP
22+
```bash
2323
POST http://localhost:{port}/transliterate?api-version=3.0&language={language}&fromScript={fromScript}&toScript={toScript}
2424

2525
```
2626

27-
*See* [**Virtual Network Support**](../text-translation/reference/v3/reference.md#virtual-network-support) for Translator service selected network and private endpoint configuration and support.
27+
*See* [**Virtual Network Support**](../text-translation/reference/authentication.md#virtual-network-support) for Translator service selected network and private endpoint configuration and support.
2828

2929
## Request parameters
3030

@@ -44,10 +44,10 @@ Request parameters passed on the query string are:
4444

4545
| Headers | Description |Condition|
4646
| --- | --- | ---|
47-
| Authentication headers | *See* [available options for authentication](../text-translation/reference/v3/reference.md#authentication)|*Required request header*|
47+
| Authentication headers | *See* [available options for authentication](../text-translation/reference/authentication.md)|*Required request header*|
4848
| Content-Type | Specifies the content type of the payload. Possible value: `application/json` |*Required request header*|
4949
| Content-Length |The length of the request body. |*Optional*|
50-
| X-ClientTraceId |A client-generated GUID to uniquely identify the request. You can omit this header if you include the trace ID in the query string using a query parameter named `ClientTraceId`. |*Optional*|
50+
| X-ClientTraceId |A client-generated GUID to uniquely identify the request. You can omit this optional header if you include the trace ID in the query string using a query parameter named `ClientTraceId`. |*Optional*|
5151

5252
## Response body
5353

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
title: Azure AI Translator custom translation for beginners
3+
titleSuffix: Azure AI services
4+
description: User guide for understanding the end-to-end customized machine translation process using Azure AI Foundry.
5+
author: laujan
6+
manager: nitinme
7+
ms.service: azure-ai-translator
8+
ms.author: lajanuar
9+
ms.date: 05/19/2025
10+
ms.topic: overview
11+
---
12+
13+
# Azure AI Translator custom translation for beginners
14+
15+
[Custom translation](overview.md) enables you to a build translation system that reflects your business, industry, and domain-specific terminology and style. Training and deploying a custom system is easy and doesn't require any programming skills. The customized translation system seamlessly integrates into your existing applications, workflows, and websites and is available on Azure through the same cloud-based [Microsoft Text Translation API](../../text-translation/reference/v3/translate.md) service that powers billions of translations every day.
16+
17+
[Custom translation](overview.md) empowers you to build a translation system that truly captures your business's unique language, industry terminology, and domain-specific style. With an intuitive interface, training, testing, and deploying your custom model is simple and requires no programming expertise. Seamlessly integrate your tailored translation system into your existing applications, workflows, and websites—all backed by the cloud-based [Azure AI Translator Text Translation API](../../text-translation/reference/v3/translate.md?tabs=curl) service that powers billions of translations each day.
18+
19+
The platform enables users to build and publish custom translation systems to and from English. The Custom Translator supports more than 100 languages that map directly to the languages available for Neural machine translation (NMT). For a complete list, *see* [Translator language support](../../../language-support.md).
20+
21+
## Is a custom translation model the right choice for you?
22+
23+
A well-trained custom translation model excels at delivering accurate, domain-specific translations by learning from your previously translated in-domain documents. This approach ensures that your specialized terms and phrases are used in context, producing fluent, natural translations that respect the target language's grammatical nuances.
24+
25+
Keep in mind that developing a full custom translation model requires a substantial amount of training data—typically at least 10,000 parallel sentences. If you don't have enough data to train a comprehensive model, you might consider building a dictionary-only model to capture essential terminology, or you can rely on the high-quality, out-of-the-box translations offered by the Text Translation API.
26+
27+
Ultimately, if you need translations that reflect your industry's specific language and you have ample training resources, a custom translation model can be the ideal choice for your organization.
28+
29+
:::image type="content" source="../media/how-to/for-beginners.png" alt-text="Screenshot illustrating the difference between custom and general models.":::
30+
31+
## What does training a custom translation model involve?
32+
33+
Building a custom translation model requires:
34+
35+
* Understanding your use-case.
36+
37+
* Obtaining in-domain translated data (preferably human translated).
38+
39+
* Assessing translation quality or target language translations.
40+
41+
## How do I evaluate my use-case?
42+
43+
Having clarity on your use-case and what success looks like is the first step towards sourcing proficient training data. Here are a few considerations:
44+
45+
* Is your desired outcome specified and how is it measured?
46+
47+
* Is your business domain identified?
48+
49+
* Do you have in-domain sentences of similar terminology and style?
50+
51+
* Does your use-case involve multiple domains? If yes, should you build one translation system or multiple systems?
52+
53+
* Do you have requirements impacting regional data residency at-rest and in-transit?
54+
55+
* Are the target users in one or multiple regions?
56+
57+
## How should I source my data?
58+
59+
Finding in-domain quality data is often a challenging task that varies based on user classification. Here are some questions you can ask yourself as you evaluate what data is available to you:
60+
61+
* Does your company have previous translation data available that you can use? Enterprises often have a wealth of translation data accumulated over many years of using human translation.
62+
63+
* Do you have a vast amount of monolingual data? Monolingual data is data in only one language. If so, can you get translations for this data?
64+
65+
* Can you crawl online portals to collect source sentences and synthesize target sentences?
66+
67+
## What should I use for training material?
68+
69+
| Source | What it does | Rules to follow |
70+
|---|---|---|
71+
| Bilingual training documents | Teaches the system your terminology and style. | **Be liberal**. Any in-domain human translation is better than machine translation. Add and remove documents as you go and try to improve the [BLEU score](concepts/bleu-score.md?WT.mc_id=aiml-43548-heboelma). |
72+
| Tuning documents | Trains the Neural Machine Translation parameters. | **Be strict**. Compose them to be optimally representative of what you are going to translate in the future. |
73+
| Test documents | Calculate the [BLEU score](concepts/bleu-score.md?WT.mc_id=aiml-43548-heboelma).| **Be strict**. Compose test documents to be optimally representative of what you plan to translate in the future. |
74+
| Phrase dictionary | Forces the given translation 100% of the time. | **Be restrictive**. A phrase dictionary is case-sensitive and any word or phrase listed is translated in the way you specify. In many cases, it's better to not use a phrase dictionary and let the system learn. |
75+
| Sentence dictionary | Forces the given translation 100% of the time. | **Be strict**. A sentence dictionary is case-insensitive and good for common in domain short sentences. For a sentence dictionary match to occur, the entire submitted sentence must match the source dictionary entry. If only a portion of the sentence matches, the entry doesn't match. |
76+
77+
## What is a BLEU score?
78+
79+
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the precision or accuracy of text that is machine translated from one language to another. Custom translation uses the BLEU metric as one way of conveying translation accuracy.
80+
81+
A BLEU score is a number between zero and 100. A score of zero indicates a low quality translation where nothing in the translation matched the reference. A score of 100 indicates a perfect translation that is identical to the reference. It's not necessary to attain a score of 100 - a BLEU score between 40 and 60 indicates a high-quality translation.
82+
83+
[Read more](concepts/bleu-score.md?WT.mc_id=aiml-43548-heboelma)
84+
85+
## What happens if I don't submit tuning or testing data?
86+
87+
Tuning and test sentences are optimally representative of what you plan to translate in the future. If you don't submit any tuning or testing data, custom translation automatically excludes sentences from your training documents to use as tuning and test data.
88+
89+
| System-generated | Manual-selection |
90+
|---|---|
91+
| Convenient. | Enables fine-tuning for your future needs.|
92+
| Good, if you know that your training data is representative of what you are planning to translate. | Provides more freedom to compose your training data.|
93+
| Easy to redo when you grow or shrink the domain. | Allows for more data and better domain coverage.|
94+
|Changes each training run.| Remains static over repeated training runs|
95+
96+
## How is training material processed by custom translation?
97+
98+
To prepare for training, documents undergo a series of processing and filtering steps. Knowledge of the filtering process can help with understanding the sentence count displayed as well as the steps you can take to prepare training documents for training with custom translation. The filtering steps are as follows:
99+
100+
* ### Sentence alignment
101+
102+
If your document isn't in `XLIFF`, `XLSX`, `TMX`, or `ALIGN` format, custom translation aligns the sentences of your source and target documents to each other, sentence-by-sentence. Translator doesn't perform document alignment—it follows your naming convention for the documents to find a matching document in the other language. Within the source text, custom translation tries to find the corresponding sentence in the target language. It uses document markup like embedded HTML tags to help with the alignment.
103+
104+
If you see a large discrepancy between the number of sentences in the source and target documents, your source document can't be parallel, or couldn't be aligned. The document pairs with a large difference (>10%) of sentences on each side warrant a second look to make sure they're indeed parallel.
105+
106+
* ### Tuning and testing data extraction
107+
108+
Tuning and testing data is optional. If you don't provide it, the system removes an appropriate percentage from your training documents to use for tuning and testing. The removal happens dynamically as part of the training process. Since this step occurs as part of training, your uploaded documents aren't affected. You can see the final used sentence counts for each category of data—training, tuning, testing, and dictionary—on the Model details page after training succeeds.
109+
110+
* ### Length filter
111+
112+
* Removes sentences with only one word on either side.
113+
* Removes sentences with more than 100 words on either side. Chinese, Japanese, Korean are exempt.
114+
* Removes sentences with fewer than three characters. Chinese, Japanese, Korean are exempt.
115+
* Removes sentences with more than 2,000 characters for Chinese, Japanese, Korean.
116+
* Removes sentences with less than 1% alphanumeric characters.
117+
* Removes dictionary entries containing more than 50 words.
118+
119+
* ### White space
120+
121+
* Replaces any sequence of white-space characters including tabs and CR/LF sequences with a single space character.
122+
* Removes leading or trailing space in the sentence.
123+
124+
* ### Sentence end punctuation
125+
126+
* Replaces multiple sentence-end punctuation characters with a single instance. Japanese character normalization.
127+
128+
* Converts full width letters and digits to half-width characters.
129+
130+
* ### Unescaped XML tags
131+
132+
Transforms unescaped tags into escaped tags:
133+
134+
| Tag | Becomes |
135+
|---|---|
136+
| \&lt; | \&amp;lt; |
137+
| \&gt; | \&amp;gt; |
138+
| \&amp; | \&amp;amp; |
139+
140+
* ### Invalid characters
141+
142+
Custom translation removes sentences that contain Unicode character U+FFFD. The character U+FFFD indicates a failed encoding conversion.
143+
144+
* ### Invalid HTML tags
145+
146+
Custom translation removes valid tags during training. Invalid tags cause unpredictable results and should be manually removed.
147+
148+
## What steps should I take before uploading data?
149+
150+
* Remove sentences with invalid encoding.
151+
* Remove Unicode control characters.
152+
* Align sentences (source-to-target), if feasible.
153+
* Remove source and target sentences that don't match the source and target languages.
154+
* When source and target sentences have mixed languages, ensure that untranslated words are intentional, for example, names of organizations and products.
155+
* Avoid teaching errors to your model by making certain that grammar and typography are correct.
156+
* Have one source sentence mapped to one target sentence. Although our training process handles source and target lines containing multiple sentences, one-to-one mapping is a best practice.
157+
* Remove invalid HTML tags before uploading training data.
158+
159+
## How do I evaluate the results?
160+
161+
After your model is successfully trained, you can view the model's BLEU score and baseline model BLEU score on the model details page. We use the same set of test data to generate both the model's BLEU score and the baseline BLEU score. This data helps you make an informed decision regarding which model would be better for your use-case.
162+
163+
## Next steps
164+
165+
> [!div class="nextstepaction"]
166+
> [Try create project](../azure-ai-foundry/how-to/create-project.md)
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
title: Azure AI Foundry custom translation BLEU score
3+
titleSuffix: Azure AI services
4+
description: The BLEU score measures the differences between machine translation and human-created reference translations of the same source sentence.
5+
author: laujan
6+
manager: nitinme
7+
ms.service: azure-ai-translator
8+
ms.topic: conceptual
9+
ms.date: 05/19/2025
10+
ms.author: lajanuar
11+
ms.custom: cogserv-non-critical-translator
12+
#Customer intent: As an custom translation user, I want to understand how BLEU score works so that I understand system test outcome better.
13+
---
14+
15+
# Azure AI Foundry custom translation BLEU score
16+
17+
[BLEU (Bilingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) is a measurement of the difference between an automatic translation and human-created reference translations of the same source sentence.
18+
19+
## Scoring process
20+
21+
The BLEU algorithm compares consecutive phrases of the automatic translation with the consecutive phrases it finds in the reference translation, and counts the number of matches, in a weighted fashion. These matches are position independent. A higher match degree indicates a higher degree of similarity with the reference translation, and higher score. Intelligibility and grammatical correctness aren't taken into account.
22+
23+
## How BLEU works?
24+
25+
The BLEU score's strength is that it correlates well with human judgment. BLEU averages out individual sentence judgment errors over a test corpus, rather than attempting to devise the exact human judgment for every sentence.
26+
27+
For a more extensive discussion of BLEU scores, *see* [Microsoft Translator Hub - Discussion of BLEU Score](https://youtu.be/-UqDljMymMg). BLEU results depend strongly on the breadth of your domain; consistency of test, training and tuning data; and how much data you have available for training. If your models are trained within a narrow domain, and your training data is consistent with your test data, you can expect a high BLEU score.
28+
29+
>[!NOTE]
30+
>A comparison between BLEU scores is only justifiable when BLEU results are compared with the same Test set, the same language pair, and the same MT engine. A BLEU score from a different test set is bound to be different.
31+
32+
## Next steps
33+
34+
> [!div class="nextstepaction"]
35+
> [BLEU score evaluation](../how-to/test-model.md)

0 commit comments

Comments
 (0)