Skip to content

Commit 00e6558

Browse files
committed
[DOCS] Add introduction to Elasticsearch. (#43075)
* [DOCS] Add introduction to Elasticsearch. * [DOCS] Incorporated review comments. * [DOCS] Minor edits to add an abbreviated title and cross refs. * [DOCS] Added sizing tips & link to quantatative sizing video.
1 parent c399c66 commit 00e6558

File tree

2 files changed

+270
-0
lines changed

2 files changed

+270
-0
lines changed

docs/reference/index.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010

1111
include::../Versions.asciidoc[]
1212

13+
include::intro.asciidoc[]
14+
1315
include::getting-started.asciidoc[]
1416

1517
include::setup.asciidoc[]

docs/reference/intro.asciidoc

Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
[[elasticsearch-intro]]
2+
= You know, for search (and analysis)
3+
[partintro]
4+
--
5+
{es} is the distributed search and analytics engine at the heart of
6+
the {stack}. {ls} and {beats} facilitate collecting, aggregating, and
7+
enriching your data and storing it in {es}. {kib} enables you to
8+
interactively explore, visualize, and share insights into your data and manage
9+
and monitor the stack. {es} is where the indexing, search, and analysis
10+
magic happen.
11+
12+
{es} provides real-time search and analytics for all types of data. Whether you
13+
have structured or unstructured text, numerical data, or geospatial data,
14+
{es} can efficiently store and index it in a way that supports fast searches.
15+
You can go far beyond simple data retrieval and aggregate information to discover
16+
trends and patterns in your data. And as your data and query volume grows, the
17+
distributed nature of {es} enables your deployment to grow seamlessly right
18+
along with it.
19+
20+
While not _every_ problem is a search problem, {es} offers speed and flexibility
21+
to handle data in a wide variety of use cases:
22+
23+
* Add a search box to an app or website
24+
* Store and analyze logs, metrics, and security event data
25+
* Use machine learning to automatically model the behavior of your data in real
26+
time
27+
* Automate business workflows using {es} as a storage engine
28+
* Manage, integrate, and analyze spatial information using {es} as a geographic
29+
information system (GIS)
30+
* Store and process genetic data using {es} as a bioinformatics research tool
31+
32+
We’re continually amazed by the novel ways people use search. But whether
33+
your use case is similar to one of these, or you're using {es} to tackle a new
34+
problem, the way you work with your data, documents, and indices in {es} is
35+
the same.
36+
--
37+
38+
[[documents-indices]]
39+
== Data in: documents and indices
40+
41+
{es} is a distributed document store. Instead of storing information as rows of
42+
columnar data, {es} stores complex data structures that have been serialized
43+
as JSON documents. When you have multiple {es} nodes in a cluster, stored
44+
documents are distributed across the cluster and can be accessed immediately
45+
from any node.
46+
47+
When a document is stored, it is indexed and fully searchable in near
48+
real-time--within 1 second. {es} uses a data structure called an
49+
inverted index that supports very fast full-text searches. An inverted index
50+
lists every unique word that appears in any document and identifies all of the
51+
documents each word occurs in.
52+
53+
An index can be thought of as an optimized collection of documents and each
54+
document is a collection of fields, which are the key-value pairs that contain
55+
your data. By default, {es} indexes all data in every field and each indexed
56+
field has a dedicated, optimized data structure. For example, text fields are
57+
stored in inverted indices, and numeric and geo fields are stored in BKD trees.
58+
The ability to use the per-field data structures to assemble and return search
59+
results is what makes {es} so fast.
60+
61+
{es} also has the ability to be schema-less, which means that documents can be
62+
indexed without explicitly specifying how to handle each of the different fields
63+
that might occur in a document. When dynamic mapping is enabled, {es}
64+
automatically detects and adds new fields to the index. This default
65+
behavior makes it easy to index and explore your data--just start
66+
indexing documents and {es} will detect and map booleans, floating point and
67+
integer values, dates, and strings to the appropriate {es} datatypes.
68+
69+
Ultimately, however, you know more about your data and how you want to use it
70+
than {es} can. You can define rules to control dynamic mapping and explicitly
71+
define mappings to take full control of how fields are stored and indexed.
72+
73+
Defining your own mappings enables you to:
74+
75+
* Distinguish between full-text string fields and exact value string fields
76+
* Perform language-specific text analysis
77+
* Optimize fields for partial matching
78+
* Use custom date formats
79+
* Use data types such as `geo_point` and `geo_shape` that cannot be automatically
80+
detected
81+
82+
It’s often useful to index the same field in different ways for different
83+
purposes. For example, you might want to index a string field as both a text
84+
field for full-text search and as a keyword field for sorting or aggregating
85+
your data. Or, you might choose to use more than one language analyzer to
86+
process the contents of a string field that contains user input.
87+
88+
The analysis chain that is applied to a full-text field during indexing is also
89+
used at search time. When you query a full-text field, the query text undergoes
90+
the same analysis before the terms are looked up in the index.
91+
92+
[[search-analyze]]
93+
== Information out: search and analyze
94+
95+
While you can use {es} as a document store and retrieve documents and their
96+
metadata, the real power comes from being able to easily access the full suite
97+
of search capabilities built on the Apache Lucene search engine library.
98+
99+
{es} provides a simple, coherent REST API for managing your cluster and indexing
100+
and searching your data. For testing purposes, you can easily submit requests
101+
directly from the command line or through the Developer Console in {kib}. From
102+
your applications, you can use the
103+
https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client]
104+
for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python
105+
or Ruby.
106+
107+
[float]
108+
[[search-data]]
109+
=== Searching your data
110+
111+
The {es} REST APIs support structured queries, full text queries, and complex
112+
queries that combine the two. Structured queries are
113+
similar to the types of queries you can construct in SQL. For example, you
114+
could search the `gender` and `age` fields in your `employee` index and sort the
115+
matches by the `hire_date` field. Full-text queries find all documents that
116+
match the query string and return them sorted by _relevance_—how good a
117+
match they are for your search terms.
118+
119+
In addition to searching for individual terms, you can perform phrase searches,
120+
similarity searches, and prefix searches, and get autocomplete suggestions.
121+
122+
Have geospatial or other numerical data {es} you want to search? {es} indexes
123+
non-textual data in optimized data structures that support
124+
high-performance geo and numerical queries.
125+
126+
You can access all of these search capabilities using {es}'s
127+
comprehensive JSON-style query language (<<query-dsl, Query DSL>>). You can also
128+
construct <<sql-overview, SQL-style queries>> to search and aggregate data
129+
natively inside {es}, and JDBC and ODBC drivers enable a broad range of
130+
third-party applications to interact with {es} via SQL.
131+
132+
[float]
133+
[[analyze-data]]
134+
=== Analyzing your data
135+
136+
{es} aggregations enable you to build complex summaries of your data and gain
137+
insight into key metrics, patterns, and trends. Instead of just finding the
138+
proverbial “needle in a haystack”, aggregations enable you to answer questions
139+
like:
140+
141+
* How many needles are in the haystack?
142+
* What is the average length of the needles?
143+
* What is the median length of the needles, broken down by manufacturer?
144+
* How many needles were added to the haystack in each of the last six months?
145+
146+
You can also use aggregations to answer more subtle questions, such as:
147+
148+
* What are your most popular needle manufacturers?
149+
* Are there any unusual or anomalous clumps of needles?
150+
151+
Because aggregations leverage the same data-structures used for search, they are
152+
also very fast. This enables you to analyze and visualize your data in real time.
153+
Your reports and dashboards update as your data changes so you can take action
154+
based on the latest information.
155+
156+
What’s more, aggregations operate alongside search requests. You can search
157+
documents, filter results, and perform analytics at the same time, on the same
158+
data, in a single request. And because aggregations are calculated in the
159+
context of a particular search, you’re not just displaying a count of all
160+
size 7 needles, you’re displaying a count of the size 7 needles
161+
that match your users' search criteria--for example, all size 7 _non-stick
162+
embroidery_ needles.
163+
164+
[float]
165+
[[more-features]]
166+
==== But wait, there’s more
167+
168+
Want to automate the analysis of your time-series data? You can use
169+
{stack-ov}/ml-overview.html[machine learning] features to create accurate
170+
baselines of normal behavior in your data and identify anomalous patterns. With
171+
machine learning, you can detect:
172+
173+
* Anomalies related to temporal deviations in values, counts, or frequencies
174+
* Statistical rarity
175+
* Unusual behaviors for a member of a population
176+
177+
And the best part? You can do this without having to specify algorithms, models,
178+
or other data science-related configurations.
179+
180+
[[scalability]]
181+
== Scalability and resilience: clusters, nodes, and shards
182+
++++
183+
<titleabbrev>Scalability and resilience</titleabbrev>
184+
++++
185+
186+
{es} is built to be always available and to scale with your needs. It does this
187+
by being distributed by nature. You can add servers (nodes) to a cluster to
188+
increase capacity and {es} automatically distributes your data and query load
189+
across all of the available nodes. No need to overhaul your application, {es}
190+
knows how to balance multi-node clusters to provide scale and high availability.
191+
The more nodes, the merrier.
192+
193+
How does this work? Under the covers, an {es} index is really just a logical
194+
grouping of one or more physical shards, where each shard is actually a
195+
self-contained index. By distributing the documents in an index across multiple
196+
shards, and distributing those shards across multiple nodes, {es} can ensure
197+
redundancy, which both protects against hardware failures and increases
198+
query capacity as nodes are added to a cluster. As the cluster grows (or shrinks),
199+
{es} automatically migrates shards to rebalance the cluster.
200+
201+
There are two types of shards: primaries and replicas. Each document in an index
202+
belongs to one primary shard. A replica shard is a copy of a primary shard.
203+
Replicas provide redundant copies of your data to protect against hardware
204+
failure and increase capacity to serve read requests
205+
like searching or retrieving a document.
206+
207+
The number of primary shards in an index is fixed at the time that an index is
208+
created, but the number of replica shards can be changed at any time, without
209+
interrupting indexing or query operations.
210+
211+
[float]
212+
[[it-depends]]
213+
=== It depends...
214+
215+
There are a number of performance considerations and trade offs with respect
216+
to shard size and the number of primary shards configured for an index. The more
217+
shards, the more overhead there is simply in maintaining those indices. The
218+
larger the shard size, the longer it takes to move shards around when {es}
219+
needs to rebalance a cluster.
220+
221+
Querying lots of small shards makes the processing per shard faster, but more
222+
queries means more overhead, so querying a smaller
223+
number of larger shards might be faster. In short...it depends.
224+
225+
As a starting point:
226+
227+
* Aim to keep the average shard size between a few GB and a few tens of GB. For
228+
use cases with time-based data, it is common to see shards in the 20GB to 40GB
229+
range.
230+
231+
* Avoid the gazillion shards problem. The number of shards a node can hold is
232+
proportional to the available heap space. As a general rule, the number of
233+
shards per GB of heap space should be less than 20.
234+
235+
The best way to determine the optimal configuration for your use case is
236+
through https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing[
237+
testing with your own data and queries].
238+
239+
[float]
240+
[[disaster-ccr]]
241+
=== In case of disaster
242+
243+
For performance reasons, the nodes within a cluster need to be on the same
244+
network. Balancing shards in a cluster across nodes in different data centers
245+
simply takes too long. But high-availability architectures demand that you avoid
246+
putting all of your eggs in one basket. In the event of a major outage in one
247+
location, servers in another location need to be able to take over. Seamlessly.
248+
The answer? {ccr-cap} (CCR).
249+
250+
CCR provides a way to automatically synchronize indices from your primary cluster
251+
to a secondary remote cluster that can serve as a hot backup. If the primary
252+
cluster fails, the secondary cluster can take over. You can also use CCR to
253+
create secondary clusters to serve read requests in geo-proximity to your users.
254+
255+
{ccr-cap} is active-passive. The index on the primary cluster is
256+
the active leader index and handles all write requests. Indices replicated to
257+
secondary clusters are read-only followers.
258+
259+
[float]
260+
[[admin]]
261+
=== Care and feeding
262+
263+
As with any enterprise system, you need tools to secure, manage, and
264+
monitor your {es} clusters. Security, monitoring, and administrative features
265+
that are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}]
266+
as a control center for managing a cluster. Features like <<rollup-overview,
267+
data rollups>> and <<index-lifecycle-management, index lifecycle management>>
268+
help you intelligently manage your data over time.

0 commit comments

Comments
 (0)