Skip to content

Commit 165ec7a

Browse files
authored
Merge pull request #187216 from Ja-Dunn/speech-service-project-jd-batch2
edit pass: speech-service-project-jd-batch2
2 parents f878bae + fbdf7f2 commit 165ec7a

File tree

5 files changed

+260
-252
lines changed

5 files changed

+260
-252
lines changed

articles/cognitive-services/Speech-Service/conversation-transcription.md

Lines changed: 29 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
2-
title: Conversation Transcription (Preview) - Speech service
2+
title: Conversation transcription (preview) - Speech service
33
titleSuffix: Azure Cognitive Services
4-
description: Conversation Transcription is a solution for meetings, that combines recognition, speaker ID, and diarization to provide transcription of any conversation.
4+
description: You use the conversation transcription feature for meetings. It combines recognition, speaker ID, and diarization to provide transcription of any conversation.
55
services: cognitive-services
66
author: eric-urban
77
manager: nitinme
@@ -13,70 +13,73 @@ ms.author: eur
1313
ms.custom: ignite-fall-2021
1414
---
1515

16-
# What is Conversation Transcription (Preview)?
16+
# What is conversation transcription (preview)?
1717

18-
Conversation Transcription is a [speech-to-text](speech-to-text.md) solution that provides real-time or asynchronous transcription of any conversation. Conversation Transcription combines speech recognition, speaker identification, and sentence attribution to determine who said what and when in a conversation.
18+
Conversation transcription is a [speech-to-text](speech-to-text.md) solution that provides real-time or asynchronous transcription of any conversation. This feature, which is currently in preview, combines speech recognition, speaker identification, and sentence attribution to determine who said what, and when, in a conversation.
1919

2020
## Key features
2121

22-
- **Timestamps** - Each speaker utterance has a timestamp, so that you can easily find when a phrase was said.
23-
- **Readable transcripts** - Transcripts have formatting and punctuation added automatically to ensure the text closely matches what was being said.
24-
- **User profiles** - User profiles are generated by collecting user voice samples and sending them to signature generation.
25-
- **Speaker identification** - Speakers are identified using user profiles and a _speaker identifier_ is assigned to each.
26-
- **Multi-speaker diarization** - Determine who said what by synthesizing the audio stream with each speaker identifier.
27-
- **Real-time transcription** – Provide live transcripts of who is saying what and when while the conversation is happening.
28-
- **Asynchronous transcription** – Provide transcripts with higher accuracy by using a multichannel audio stream.
22+
You might find the following features of conversation transcription useful:
23+
24+
- **Timestamps:** Each speaker utterance has a timestamp, so that you can easily find when a phrase was said.
25+
- **Readable transcripts:** Transcripts have formatting and punctuation added automatically to ensure the text closely matches what was being said.
26+
- **User profiles:** User profiles are generated by collecting user voice samples and sending them to signature generation.
27+
- **Speaker identification:** Speakers are identified by using user profiles, and a _speaker identifier_ is assigned to each.
28+
- **Multi-speaker diarization:** Determine who said what by synthesizing the audio stream with each speaker identifier.
29+
- **Real-time transcription:** Provide live transcripts of who is saying what, and when, while the conversation is happening.
30+
- **Asynchronous transcription:** Provide transcripts with higher accuracy by using a multichannel audio stream.
2931

3032
> [!NOTE]
31-
> Although Conversation Transcription does not put a limit on the number of speakers in the room, it is optimized for 2-10 speakers per session.
33+
> Although conversation transcription doesn't put a limit on the number of speakers in the room, it's optimized for 2-10 speakers per session.
3234
3335
## Get started
3436

3537
See the real-time conversation transcription [quickstart](how-to-use-conversation-transcription.md) to get started.
3638

3739
## Use cases
3840

39-
To make meetings inclusive for everyone, such as participants who are deaf and hard of hearing, it is important to have transcription in real time. Conversation Transcription in real-time mode takes meeting audio and determines who is saying what, allowing all meeting participants to follow the transcript and participate in the meeting without a delay.
40-
41-
### Improved efficiency
41+
To make meetings inclusive for everyone, such as participants who are deaf and hard of hearing, it's important to have transcription in real time. Conversation transcription in real-time mode takes meeting audio and determines who is saying what, allowing all meeting participants to follow the transcript and participate in the meeting, without a delay.
4242

43-
Meeting participants can focus on the meeting and leave note-taking to Conversation Transcription. Participants can actively engage in the meeting and quickly follow up on next steps, using the transcript instead of taking notes and potentially missing something during the meeting.
43+
Meeting participants can focus on the meeting and leave note-taking to conversation transcription. Participants can actively engage in the meeting and quickly follow up on next steps, using the transcript instead of taking notes and potentially missing something during the meeting.
4444

4545
## How it works
4646

47-
This is a high-level overview of how Conversation Transcription works.
47+
The following diagram shows a high-level overview of how the feature works.
4848

49-
![The Import Conversation Transcription Diagram](media/scenarios/conversation-transcription-service.png)
49+
![Diagram that shows the relationships among different pieces of the conversation transcription solution.](media/scenarios/conversation-transcription-service.png)
5050

5151
## Expected inputs
5252

53-
- **Multi-channel audio stream** – For specification and design details, see [Microphone array recommendations](./speech-sdk-microphone.md).
54-
- **User voice samples** – Conversation Transcription needs user profiles in advance of the conversation for speaker identification. You will need to collect audio recordings from each user, then send the recordings to the [Signature Generation Service](https://aka.ms/cts/signaturegenservice) to validate the audio and generate user profiles.
53+
Conversation transcription uses two types of inputs:
54+
55+
- **Multi-channel audio stream:** For specification and design details, see [Microphone array recommendations](./speech-sdk-microphone.md).
56+
- **User voice samples:** Conversation transcription needs user profiles in advance of the conversation for speaker identification. Collect audio recordings from each user, and then send the recordings to the [signature generation service](https://aka.ms/cts/signaturegenservice) to validate the audio and generate user profiles.
57+
58+
User voice samples for voice signatures are required for speaker identification. Speakers who don't have voice samples are recognized as *unidentified*. Unidentified speakers can still be differentiated when the `DifferentiateGuestSpeakers` property is enabled (see the following example). The transcription output then shows speakers as, for example, *Guest_0* and *Guest_1*, instead of recognizing them as pre-enrolled specific speaker names.
5559

56-
User voice samples for voice signatures are required for speaker identification. Speakers who do not have voice samples will be recognized as "Unidentified". Unidentified speakers can still be differentiated when the `DifferentiateGuestSpeakers` property is enabled (see example below). The transcription output will then show speakers as "Guest_0", "Guest_1", etc. instead of recognizing as pre-enrolled specific speaker names.
5760
```csharp
5861
config.SetProperty("DifferentiateGuestSpeakers", "true");
5962
```
6063

6164
## Real-time vs. asynchronous
6265

63-
Conversation Transcription offers three transcription modes:
66+
The following sections provide more detail about transcription modes you can choose.
6467

6568
### Real-time
6669

67-
Audio data is processed live to return speaker identifier + transcript. Select this mode if your transcription solution requirement is to provide conversation participants a live transcript view of their ongoing conversation. For example, building an application to make meetings more accessible the deaf and hard of hearing participants is an ideal use case for real-time transcription.
70+
Audio data is processed live to return the speaker identifier and transcript. Select this mode if your transcription solution requirement is to provide conversation participants a live transcript view of their ongoing conversation. For example, building an application to make meetings more accessible to participants with hearing loss or deafness is an ideal use case for real-time transcription.
6871

6972
### Asynchronous
7073

71-
Audio data is batch processed to return speaker identifier and transcript. Select this mode if your transcription solution requirement is to provide higher accuracy without live transcript view. For example, if you want to build an application to allow meeting participants to easily catch up on missed meetings, then use the asynchronous transcription mode to get high-accuracy transcription results.
74+
Audio data is batch processed to return the speaker identifier and transcript. Select this mode if your transcription solution requirement is to provide higher accuracy, without the live transcript view. For example, if you want to build an application to allow meeting participants to easily catch up on missed meetings, then use the asynchronous transcription mode to get high-accuracy transcription results.
7275

7376
### Real-time plus asynchronous
7477

75-
Audio data is processed live to return speaker identifier + transcript, and, in addition, a request is created to also get a high-accuracy transcript through asynchronous processing. Select this mode if your application has a need for real-time transcription but also requires a higher accuracy transcript for use after the conversation or meeting occurred.
78+
Audio data is processed live to return the speaker identifier and transcript, and, in addition, requests a high-accuracy transcript through asynchronous processing. Select this mode if your application has a need for real-time transcription, and also requires a higher accuracy transcript for use after the conversation or meeting occurred.
7679

7780
## Language support
7881

79-
Currently, Conversation Transcription supports [all speech-to-text languages](language-support.md#speech-to-text) in the following regions: `centralus`, `eastasia`, `eastus`, `westeurope`. If you require additional locale support, contact the [Conversation Transcription Feature Crew](mailto:[email protected]).
82+
Currently, conversation transcription supports [all speech-to-text languages](language-support.md#speech-to-text) in the following regions: `centralus`, `eastasia`, `eastus`, `westeurope`.
8083

8184
## Next steps
8285

0 commit comments

Comments
 (0)