Skip to content

Commit 9d0fa43

Browse files
committed
clarified analyzers used for special character preservation
1 parent ab8bed1 commit 9d0fa43

File tree

1 file changed

+23
-17
lines changed

1 file changed

+23
-17
lines changed

articles/search/query-simple-syntax.md

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ author: bevloh
88
ms.author: beloh
99
ms.service: cognitive-search
1010
ms.topic: conceptual
11-
ms.date: 03/16/2022
11+
ms.date: 10/27/2022
1212
---
1313

1414
# Simple query syntax in Azure Cognitive Search
@@ -21,7 +21,7 @@ Although the simple parser is based on the [Apache Lucene Simple Query Parser](h
2121

2222
## Example (simple syntax)
2323

24-
This example shows a simple query, distinguished by `"queryType": "simple"` and valid syntax. Although query type is set below, it's the default and can be omitted unless you are reverting from an alternative type. The following example is a search over independent terms, with a requirement that all matching documents include "pool".
24+
This example shows a simple query, distinguished by `"queryType": "simple"` and valid syntax. Although query type is set below, it's the default and can be omitted unless you're reverting from an alternative type. The following example is a search over independent terms, with a requirement that all matching documents include "pool".
2525

2626
```http
2727
POST https://{{service-name}}.search.windows.net/indexes/hotel-rooms-sample/docs/search?api-version=2020-06-30
@@ -34,7 +34,7 @@ POST https://{{service-name}}.search.windows.net/indexes/hotel-rooms-sample/docs
3434

3535
The "searchMode" parameter is relevant in this example. Whenever boolean operators are on the query, you should generally set `"searchMode=all"` to ensure that *all* of the criteria is matched. Otherwise, you can use the default `"searchMode=any"` that favors recall over precision.
3636

37-
For additional examples, see [Simple query syntax examples](search-query-simple-examples.md). For details about the query request and parameters, see [Search Documents (REST API)](/rest/api/searchservice/Search-Documents).
37+
For more examples, see [Simple query syntax examples](search-query-simple-examples.md). For details about the query request and parameters, see [Search Documents (REST API)](/rest/api/searchservice/Search-Documents).
3838

3939
## Keyword search on terms and phrases
4040

@@ -46,21 +46,21 @@ Strings passed to the "search" parameter can include terms or phrases in any sup
4646

4747
Depending on your search client, you might need to escape the quotation marks in a phrase search. For example, in Postman in a POST request, a phrase search on `"Roach Motel"` in the request body would be specified as `"\"Roach Motel\""`.
4848

49-
By default, all strings passed in the "search" parameter undergo lexical analysis. Make sure you understand the tokenization behavior of the analyzer you are using. Often, when query results are unexpected, the reason can be traced to how terms are tokenized at query time. You can [test tokenization on specific strings](/rest/api/searchservice/test-analyzer) to confirm the output.
49+
By default, all strings passed in the "search" parameter undergo lexical analysis. Make sure you understand the tokenization behavior of the analyzer you're using. Often, when query results are unexpected, the reason can be traced to how terms are tokenized at query time. You can [test tokenization on specific strings](/rest/api/searchservice/test-analyzer) to confirm the output.
5050

5151
Any text input with one or more terms is considered a valid starting point for query execution. Azure Cognitive Search will match documents containing any or all of the terms, including any variations found during analysis of the text.
5252

53-
As straightforward as this sounds, there is one aspect of query execution in Azure Cognitive Search that *might* produce unexpected results, increasing rather than decreasing search results as more terms and operators are added to the input string. Whether this expansion actually occurs depends on the inclusion of a NOT operator, combined with a "searchMode" parameter setting that determines how NOT is interpreted in terms of AND or OR behaviors. For more information, see the NOT operator under [Boolean operators](#boolean-operators).
53+
As straightforward as this sounds, there's one aspect of query execution in Azure Cognitive Search that *might* produce unexpected results, increasing rather than decreasing search results as more terms and operators are added to the input string. Whether this expansion actually occurs depends on the inclusion of a NOT operator, combined with a "searchMode" parameter setting that determines how NOT is interpreted in terms of AND or OR behaviors. For more information, see the NOT operator under [Boolean operators](#boolean-operators).
5454

5555
## Boolean operators
5656

57-
You can embed Boolean operators in a query string to improve the precision of a match. In the simple syntax, boolean operators are character-based. Text operators, such as the word AND, are not supported.
57+
You can embed Boolean operators in a query string to improve the precision of a match. In the simple syntax, boolean operators are character-based. Text operators, such as the word AND, aren't supported.
5858

5959
| Character | Example | Usage |
6060
|----------- |--------|-------|
6161
| `+` | `pool + ocean` | An AND operation. For example, `pool + ocean` stipulates that a document must contain both terms.|
6262
| `|` | `pool | ocean` | An OR operation finds a match when either term is found. In the example, the query engine will return match on documents containing either `pool` or `ocean` or both. Because OR is the default conjunction operator, you could also leave it out, such that `pool ocean` is the equivalent of `pool | ocean`.|
63-
| `-` | `pool – ocean` | A NOT operation returns matches on documents that exclude the term. </p>To get the expected behavior on a NOT expression, set `"searchMode=all"` on the request. Otherwise, under the default of `"searchMode=any"`, you will get matches on `pool`, plus matches on all documents that do not contain `ocean`, which could be a lot of documents. The "searchMode" parameter on a query request controls whether a term with the NOT operator is ANDed or ORed with other terms in the query (assuming there is no `+` or `|` operator on the other terms). Using `"searchMode=all"` increases the precision of queries by including fewer results, and by default - will be interpreted as "AND NOT". </p>When deciding on a "searchMode" setting, consider the user interaction patterns for queries in various applications. Users who are searching for information are more likely to include an operator in a query, as opposed to e-commerce sites that have more built-in navigation structures. |
63+
| `-` | `pool – ocean` | A NOT operation returns matches on documents that exclude the term. </p>To get the expected behavior on a NOT expression, set `"searchMode=all"` on the request. Otherwise, under the default of `"searchMode=any"`, you'll get matches on `pool`, plus matches on all documents that don't contain `ocean`, which could be a lot of documents. The "searchMode" parameter on a query request controls whether a term with the NOT operator is ANDed or ORed with other terms in the query (assuming there's no `+` or `|` operator on the other terms). Using `"searchMode=all"` increases the precision of queries by including fewer results, and by default - will be interpreted as "AND NOT". </p>When deciding on a "searchMode" setting, consider the user interaction patterns for queries in various applications. Users who are searching for information are more likely to include an operator in a query, as opposed to e-commerce sites that have more built-in navigation structures. |
6464

6565
<a name="prefix-search"></a>
6666

@@ -72,7 +72,7 @@ For "starts with" queries, add a suffix operator (`*`) as the placeholder for th
7272
|----------- |--------|-------|
7373
| `*` | `lingui*` will match on "linguistic" or "linguini" | The asterisk (`*`) represents one or more characters of arbitrary length, ignoring case. |
7474

75-
Similar to filters, a prefix query looks for an exact match. As such, there is no relevance scoring (all results receive a search score of 1.0). Be aware that prefix queries can be slow, especially if the index is large and the prefix consists of a small number of characters. An alternative methodology, such as edge n-gram tokenization, might perform faster. Terms using prefix search can't be longer than 1000 characters.
75+
Similar to filters, a prefix query looks for an exact match. As such, there's no relevance scoring (all results receive a search score of 1.0). Be aware that prefix queries can be slow, especially if the index is large and the prefix consists of a small number of characters. An alternative methodology, such as edge n-gram tokenization, might perform faster. Terms using prefix search can't be longer than 1000 characters.
7676

7777
Simple syntax supports prefix matching only. For suffix or infix matching against the end or middle of a term, use the [full Lucene syntax for wildcard search](query-lucene-syntax.md#bkmk_wildcard).
7878

@@ -82,7 +82,7 @@ In the simple syntax, search operators include these characters: `+ | " ( ) ' \`
8282

8383
If any of these characters are part of a token in the index, escape it by prefixing it with a single backslash (`\`) in the query. For example, suppose you used a custom analyzer for whole term tokenization, and your index contains the string "Luxury+Hotel". To get an exact match on this token, insert an escape character: `search=luxury\+hotel`.
8484

85-
To make things simple for the more typical cases, there are two exceptions to this rule where escaping is not needed:
85+
To make things simple for the more typical cases, there are two exceptions to this rule where escaping isn't needed:
8686

8787
+ The NOT operator `-` only needs to be escaped if it's the first character after a whitespace. If the `-` appears in the middle (for example, in `3352CDD0-EF30-4A2E-A512-3B30AF40F3FD`), you can skip escaping.
8888

@@ -93,39 +93,45 @@ To make things simple for the more typical cases, there are two exceptions to th
9393
9494
## Encoding unsafe and reserved characters in URLs
9595

96-
Ensure all unsafe and reserved characters are encoded in a URL. For example, '#' is an unsafe character because it is a fragment/anchor identifier in a URL. The character must be encoded to `%23` if used in a URL. '&' and '=' are examples of reserved characters as they delimit parameters and specify values in Azure Cognitive Search. For more information, see [RFC1738: Uniform Resource Locators (URL)](https://www.ietf.org/rfc/rfc1738.txt).
96+
Ensure all unsafe and reserved characters are encoded in a URL. For example, '#' is an unsafe character because it's a fragment/anchor identifier in a URL. The character must be encoded to `%23` if used in a URL. '&' and '=' are examples of reserved characters as they delimit parameters and specify values in Azure Cognitive Search. For more information, see [RFC1738: Uniform Resource Locators (URL)](https://www.ietf.org/rfc/rfc1738.txt).
9797

9898
Unsafe characters are ``" ` < > # % { } | \ ^ ~ [ ]``. Reserved characters are `; / ? : @ = + &`.
9999

100100
## Special characters
101101

102-
In some circumstances, you may want to search for a special character, like an '❤' emoji or the '€' sign. In such cases, make sure that the analyzer you use does not filter those characters out. The standard analyzer bypasses many special characters, excluding them from your index.
102+
Special characters can range from currency symbols like '$' or '€', to emojis. Many analyzers, including the default standard analyzer, will exclude special characters during indexing, which means they won't be represented in your index.
103103

104-
Analyzers that will tokenize special characters include the "whitespace" analyzer, which takes into consideration any character sequences separated by whitespaces as tokens (so the "❤" string would be considered a token). Also, a language analyzer like the Microsoft English analyzer ("en.microsoft"), would take the "€" string as a token. You can [test an analyzer](/rest/api/searchservice/test-analyzer) to see what tokens it generates for a given query.
104+
If you need special character representation, you can assign an analyzer that preserves them:
105105

106-
When using Unicode characters, make sure symbols are properly escaped in the query url (for instance for "❤" would use the escape sequence `%E2%9D%A4+`). Postman does this translation automatically.
106+
+ The "whitespace" analyzer considers any character sequence separated by white spaces as tokens (so the '❤' emoji would be considered a token).
107+
108+
+ A language analyzer, such as the Microsoft English analyzer ("en.microsoft"), would take the '$' or '€' string as a token.
109+
110+
For confirmation, you can [test an analyzer](/rest/api/searchservice/test-analyzer) to see what tokens are generated for a given string. As you might expect, you might not get full tokenization from a single analyzer. A workaround is to create multiple fields that contain the same content, but with different analyzer assignments (for example,"description_en", "description_fr", and so forth for language analyzers).
111+
112+
When using Unicode characters, make sure symbols are properly escaped in the query url (for instance for '❤' would use the escape sequence `%E2%9D%A4+`). Postman does this translation automatically.
107113

108114
## Precedence (grouping)
109115

110116
You can use parentheses to create subqueries, including operators within the parenthetical statement. For example, `motel+(wifi|luxury)` will search for documents containing the "motel" term and either "wifi" or "luxury" (or both).
111117

112118
## Query size limits
113119

114-
If your application generates search queries programmatically, we recommend designing it in such a way that it does not generate queries of unbounded size.
120+
If your application generates search queries programmatically, we recommend designing it in such a way that it doesn't generate queries of unbounded size.
115121

116-
+ For GET, the length of the URL cannot exceed 8 KB.
122+
+ For GET, the length of the URL can't exceed 8 KB.
117123

118124
+ For POST (and any other request), where the body of the request includes `search` and other parameters such as `filter` and `orderby`, the maximum size is 16 MB. Additional limits include:
119125
+ The maximum length of the search clause is 100,000 characters.
120126
+ The maximum number of clauses in `search` (expressions separated by AND or OR) is 1024.
121127
+ The maximum search term size is 1000 characters for [prefix search](#prefix-queries).
122-
+ There is also a limit of approximately 32 KB on the size of any individual term in a query.
128+
+ There's also a limit of approximately 32 KB on the size of any individual term in a query.
123129

124130
For more information on query limits, see [API request limits](search-limits-quotas-capacity.md#api-request-limits).
125131

126132
## Next steps
127133

128-
If you will be constructing queries programmatically, review [Full text search in Azure Cognitive Search](search-lucene-query-architecture.md) to understand the stages of query processing and the implications of text analysis.
134+
If you'll be constructing queries programmatically, review [Full text search in Azure Cognitive Search](search-lucene-query-architecture.md) to understand the stages of query processing and the implications of text analysis.
129135

130136
You can also review the following articles to learn more about query construction:
131137

0 commit comments

Comments
 (0)