Skip to content

Commit cf32847

Browse files
committed
Revised custom tutorial based on testing
1 parent 50c33d0 commit cf32847

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

articles/search/tutorial-create-custom-analyzer.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ ms.date: 03/07/2024
1313

1414
# Tutorial: Create a custom analyzer for phone numbers
1515

16-
In search solutions, strings that have complex patterns or special characters can be a challenge to work with because the [default analyzer](search-analyzers.md) strips out or misinterprets meaningful parts of a pattern, resulting in a poor search experience when users can't find the information they expected. Phone numbers are a classic example of strings that are hard to analyze. They come in a variety of formats, and they include special characters that the default analyzer ignores.
16+
In search solutions, strings that have complex patterns or special characters can be a challenge to work with because the [default analyzer](search-analyzers.md) strips out or misinterprets meaningful parts of a pattern, resulting in a poor search experience when users can't find the information they expected. Phone numbers are a classic example of strings that are hard to analyze. They come in various formats, and they include special characters that the default analyzer ignores.
1717

1818
With phone numbers as its subject, this tutorial takes a close look at the problems of patterned data, and shows you to solve that problem using a [custom analyzer](index-add-custom-analyzers.md). The approach outlined here can be used as-is for phone numbers, or adapted for fields having the same characteristics (patterned, with special characters), such as URLs, emails, postal codes, and dates.
1919

@@ -35,7 +35,7 @@ The following services and tools are required for this tutorial.
3535

3636
### Download files
3737

38-
Source code for this tutorial is in the [custom-analyzers](https://github.com/Azure-Samples/azure-search-postman-samples/tree/main/custom-analyzers) folder in the [Azure-Samples/azure-search-postman-samples](https://github.com/Azure-Samples/azure-search-postman-samples) GitHub repository.
38+
Source code for this tutorial is the [custom-analyzer.rest](https://github.com/Azure-Samples/azure-search-postman-samples/tree/main/custom-analyzers/custom-analyzer.rest) file in the [Azure-Samples/azure-search-postman-samples](https://github.com/Azure-Samples/azure-search-postman-samples) GitHub repository.
3939

4040
### Copy a key and URL
4141

@@ -96,7 +96,7 @@ A valid API key establishes trust, on a per request basis, between the applicati
9696
9797
1. Select **Send request**. You should have an `HTTP/1.1 201 Created` response and the response body should include the JSON representation of the index schema.
9898
99-
1. Load data into the index, using documents that contain a variety of phone number formats. This is your test data.
99+
1. Load data into the index, using documents that contain various phone number formats. This is your test data.
100100
101101
```http
102102
### Load documents
@@ -226,11 +226,11 @@ Analyzers consist of three components:
226226
+ A [**Tokenizer**](#Tokenizers) that breaks the input text into tokens, which become keys in the search index.
227227
+ [**Token filters**](#TokenFilters) that manipulate the tokens produced by the tokenizer.
228228
229-
In the diagram below, you can see how these three components work together to tokenize a sentence:
229+
In the following diagram, you can see how these three components work together to tokenize a sentence:
230230
231231
:::image type="content" source="media/tutorial-create-custom-analyzer/analyzers-explained.png" alt-text="Diagram of Analyzer process to tokenize a sentence":::
232232
233-
These tokens are then stored in an inverted index, which allows for fast, full-text searches. An inverted index enables full-text search by mapping all unique terms extracted during lexical analysis to the documents in which they occur. You can see an example in the diagram below:
233+
These tokens are then stored in an inverted index, which allows for fast, full-text searches. An inverted index enables full-text search by mapping all unique terms extracted during lexical analysis to the documents in which they occur. You can see an example in the next diagram:
234234
235235
:::image type="content" source="media/tutorial-create-custom-analyzer/inverted-index-explained.png" alt-text="Example inverted index":::
236236
@@ -242,7 +242,7 @@ All of search comes down to searching for the terms stored in the inverted index
242242
243243
:::image type="content" source="media/tutorial-create-custom-analyzer/query-architecture-explained.png" alt-text="Diagram of Analyzer process ranking similarity":::
244244
245-
If the query terms don't match the terms in your inverted index, results won't be returned. To learn more about how queries work, see this article on [full text search](search-lucene-query-architecture.md).
245+
If the query terms don't match the terms in your inverted index, results aren't returned. To learn more about how queries work, see this article on [full text search](search-lucene-query-architecture.md).
246246
247247
> [!Note]
248248
> [Partial term queries](search-query-partial-matching.md) are an important exception to this rule. These queries (prefix query, wildcard query, regex query) bypass the lexical analysis process unlike regular term queries. Partial terms are only lowercased before being matched against terms in the index. If an analyzer isn't configured to support these types of queries, you'll often receive unexpected results because matching terms don't exist in the index.
@@ -350,7 +350,7 @@ For phone numbers, we want to remove whitespace and special characters because n
350350
]
351351
```
352352

353-
The filter above will remove `-` `(` `)` `+` `.` and spaces from the input.
353+
The filter removes `-` `(` `)` `+` `.` and spaces from the input.
354354

355355
|Input|Output|
356356
|-|-|
@@ -609,7 +609,7 @@ The analyzer described in the previous section is designed to maximize the flexi
609609

610610
The following example shows an alternative analyzer that's more efficient in tokenization, but has drawbacks.
611611

612-
Given an input of `14255550100`, the analyzer can't logically chunk the phone number. For example, it can't separate the country code, `1`, from the area code, `425`. This discrepancy would lead to the number above not being returned if a user didn't include a country code in their search.
612+
Given an input of `14255550100`, the analyzer can't logically chunk the phone number. For example, it can't separate the country code, `1`, from the area code, `425`. This discrepancy would lead to the phone number not being returned if a user didn't include a country code in their search.
613613

614614
```json
615615
"analyzers": [
@@ -640,13 +640,13 @@ Given an input of `14255550100`, the analyzer can't logically chunk the phone nu
640640
]
641641
```
642642

643-
You can see in the example below that the phone number is split into the chunks you would normally expect a user to be searching for.
643+
You can see in the following example that the phone number is split into the chunks you would normally expect a user to be searching for.
644644

645645
|Input|Output|
646646
|-|-|
647647
|`(321) 555-0199`|`[321, 555, 0199, 321555, 5550199, 3215550199]`|
648648

649-
Depending on your requirements, this may be a more efficient approach to the problem.
649+
Depending on your requirements, this might be a more efficient approach to the problem.
650650

651651
## Takeaways
652652

0 commit comments

Comments
 (0)