|
| 1 | +[[elasticsearch-intro]] |
| 2 | += You know, for search (and analysis) |
| 3 | +[partintro] |
| 4 | +-- |
| 5 | +{es} is the distributed search and analytics engine at the heart of |
| 6 | +the {stack}. {ls} and {beats} facilitate collecting, aggregating, and |
| 7 | +enriching your data and storing it in {es}. {kib} enables you to |
| 8 | +interactively explore, visualize, and share insights into your data and manage |
| 9 | +and monitor the stack. {es} is where the indexing, search, and analysis |
| 10 | +magic happen. |
| 11 | + |
| 12 | +{es} provides real-time search and analytics for all types of data. Whether you |
| 13 | +have structured or unstructured text, numerical data, or geospatial data, |
| 14 | +{es} can efficiently store and index it in a way that supports fast searches. |
| 15 | +You can go far beyond simple data retrieval and aggregate information to discover |
| 16 | +trends and patterns in your data. And as your data and query volume grows, the |
| 17 | +distributed nature of {es} enables your deployment to grow seamlessly right |
| 18 | +along with it. |
| 19 | + |
| 20 | +While not _every_ problem is a search problem, {es} offers speed and flexibility |
| 21 | +to handle data in a wide variety of use cases: |
| 22 | + |
| 23 | +* Add a search box to an app or website |
| 24 | +* Store and analyze logs, metrics, and security event data |
| 25 | +* Use machine learning to automatically model the behavior of your data in real |
| 26 | + time |
| 27 | +* Automate business workflows using {es} as a storage engine |
| 28 | +* Manage, integrate, and analyze spatial information using {es} as a geographic |
| 29 | + information system (GIS) |
| 30 | +* Store and process genetic data using {es} as a bioinformatics research tool |
| 31 | + |
| 32 | +We’re continually amazed by the novel ways people use search. But whether |
| 33 | +your use case is similar to one of these, or you're using {es} to tackle a new |
| 34 | +problem, the way you work with your data, documents, and indices in {es} is |
| 35 | +the same. |
| 36 | +-- |
| 37 | + |
| 38 | +[[documents-indices]] |
| 39 | +== Data in: documents and indices |
| 40 | + |
| 41 | +{es} is a distributed document store. Instead of storing information as rows of |
| 42 | +columnar data, {es} stores complex data structures that have been serialized |
| 43 | +as JSON documents. When you have multiple {es} nodes in a cluster, stored |
| 44 | +documents are distributed across the cluster and can be accessed immediately |
| 45 | +from any node. |
| 46 | + |
| 47 | +When a document is stored, it is indexed and fully searchable in near |
| 48 | +real-time--within 1 second. {es} uses a data structure called an |
| 49 | +inverted index that supports very fast full-text searches. An inverted index |
| 50 | +lists every unique word that appears in any document and identifies all of the |
| 51 | +documents each word occurs in. |
| 52 | + |
| 53 | +An index can be thought of as an optimized collection of documents and each |
| 54 | +document is a collection of fields, which are the key-value pairs that contain |
| 55 | +your data. By default, {es} indexes all data in every field and each indexed |
| 56 | +field has a dedicated, optimized data structure. For example, text fields are |
| 57 | +stored in inverted indices, and numeric and geo fields are stored in BKD trees. |
| 58 | +The ability to use the per-field data structures to assemble and return search |
| 59 | +results is what makes {es} so fast. |
| 60 | + |
| 61 | +{es} also has the ability to be schema-less, which means that documents can be |
| 62 | +indexed without explicitly specifying how to handle each of the different fields |
| 63 | +that might occur in a document. When dynamic mapping is enabled, {es} |
| 64 | +automatically detects and adds new fields to the index. This default |
| 65 | +behavior makes it easy to index and explore your data--just start |
| 66 | +indexing documents and {es} will detect and map booleans, floating point and |
| 67 | +integer values, dates, and strings to the appropriate {es} datatypes. |
| 68 | + |
| 69 | +Ultimately, however, you know more about your data and how you want to use it |
| 70 | +than {es} can. You can define rules to control dynamic mapping and explicitly |
| 71 | +define mappings to take full control of how fields are stored and indexed. |
| 72 | + |
| 73 | +Defining your own mappings enables you to: |
| 74 | + |
| 75 | +* Distinguish between full-text string fields and exact value string fields |
| 76 | +* Perform language-specific text analysis |
| 77 | +* Optimize fields for partial matching |
| 78 | +* Use custom date formats |
| 79 | +* Use data types such as `geo_point` and `geo_shape` that cannot be automatically |
| 80 | +detected |
| 81 | + |
| 82 | +It’s often useful to index the same field in different ways for different |
| 83 | +purposes. For example, you might want to index a string field as both a text |
| 84 | +field for full-text search and as a keyword field for sorting or aggregating |
| 85 | +your data. Or, you might choose to use more than one language analyzer to |
| 86 | +process the contents of a string field that contains user input. |
| 87 | + |
| 88 | +The analysis chain that is applied to a full-text field during indexing is also |
| 89 | +used at search time. When you query a full-text field, the query text undergoes |
| 90 | +the same analysis before the terms are looked up in the index. |
| 91 | + |
| 92 | +[[search-analyze]] |
| 93 | +== Information out: search and analyze |
| 94 | + |
| 95 | +While you can use {es} as a document store and retrieve documents and their |
| 96 | +metadata, the real power comes from being able to easily access the full suite |
| 97 | +of search capabilities built on the Apache Lucene search engine library. |
| 98 | + |
| 99 | +{es} provides a simple, coherent REST API for managing your cluster and indexing |
| 100 | +and searching your data. For testing purposes, you can easily submit requests |
| 101 | +directly from the command line or through the Developer Console in {kib}. From |
| 102 | +your applications, you can use the |
| 103 | +https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client] |
| 104 | +for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python |
| 105 | +or Ruby. |
| 106 | + |
| 107 | +[float] |
| 108 | +[[search-data]] |
| 109 | +=== Searching your data |
| 110 | + |
| 111 | +The {es} REST APIs support structured queries, full text queries, and complex |
| 112 | +queries that combine the two. Structured queries are |
| 113 | +similar to the types of queries you can construct in SQL. For example, you |
| 114 | +could search the `gender` and `age` fields in your `employee` index and sort the |
| 115 | +matches by the `hire_date` field. Full-text queries find all documents that |
| 116 | +match the query string and return them sorted by _relevance_—how good a |
| 117 | +match they are for your search terms. |
| 118 | + |
| 119 | +In addition to searching for individual terms, you can perform phrase searches, |
| 120 | +similarity searches, and prefix searches, and get autocomplete suggestions. |
| 121 | + |
| 122 | +Have geospatial or other numerical data {es} you want to search? {es} indexes |
| 123 | +non-textual data in optimized data structures that support |
| 124 | +high-performance geo and numerical queries. |
| 125 | + |
| 126 | +You can access all of these search capabilities using {es}'s |
| 127 | +comprehensive JSON-style query language (<<query-dsl, Query DSL>>). You can also |
| 128 | +construct <<sql-overview, SQL-style queries>> to search and aggregate data |
| 129 | +natively inside {es}, and JDBC and ODBC drivers enable a broad range of |
| 130 | +third-party applications to interact with {es} via SQL. |
| 131 | + |
| 132 | +[float] |
| 133 | +[[analyze-data]] |
| 134 | +=== Analyzing your data |
| 135 | + |
| 136 | +{es} aggregations enable you to build complex summaries of your data and gain |
| 137 | +insight into key metrics, patterns, and trends. Instead of just finding the |
| 138 | +proverbial “needle in a haystack”, aggregations enable you to answer questions |
| 139 | +like: |
| 140 | + |
| 141 | +* How many needles are in the haystack? |
| 142 | +* What is the average length of the needles? |
| 143 | +* What is the median length of the needles, broken down by manufacturer? |
| 144 | +* How many needles were added to the haystack in each of the last six months? |
| 145 | + |
| 146 | +You can also use aggregations to answer more subtle questions, such as: |
| 147 | + |
| 148 | +* What are your most popular needle manufacturers? |
| 149 | +* Are there any unusual or anomalous clumps of needles? |
| 150 | + |
| 151 | +Because aggregations leverage the same data-structures used for search, they are |
| 152 | +also very fast. This enables you to analyze and visualize your data in real time. |
| 153 | +Your reports and dashboards update as your data changes so you can take action |
| 154 | +based on the latest information. |
| 155 | + |
| 156 | +What’s more, aggregations operate alongside search requests. You can search |
| 157 | +documents, filter results, and perform analytics at the same time, on the same |
| 158 | +data, in a single request. And because aggregations are calculated in the |
| 159 | +context of a particular search, you’re not just displaying a count of all |
| 160 | +size 7 needles, you’re displaying a count of the size 7 needles |
| 161 | +that match your users' search criteria--for example, all size 7 _non-stick |
| 162 | +embroidery_ needles. |
| 163 | + |
| 164 | +[float] |
| 165 | +[[more-features]] |
| 166 | +==== But wait, there’s more |
| 167 | + |
| 168 | +Want to automate the analysis of your time-series data? You can use |
| 169 | +{stack-ov}/ml-overview.html[machine learning] features to create accurate |
| 170 | +baselines of normal behavior in your data and identify anomalous patterns. With |
| 171 | +machine learning, you can detect: |
| 172 | + |
| 173 | +* Anomalies related to temporal deviations in values, counts, or frequencies |
| 174 | +* Statistical rarity |
| 175 | +* Unusual behaviors for a member of a population |
| 176 | + |
| 177 | +And the best part? You can do this without having to specify algorithms, models, |
| 178 | +or other data science-related configurations. |
| 179 | + |
| 180 | +[[scalability]] |
| 181 | +== Scalability and resilience: clusters, nodes, and shards |
| 182 | +++++ |
| 183 | +<titleabbrev>Scalability and resilience</titleabbrev> |
| 184 | +++++ |
| 185 | + |
| 186 | +{es} is built to be always available and to scale with your needs. It does this |
| 187 | +by being distributed by nature. You can add servers (nodes) to a cluster to |
| 188 | +increase capacity and {es} automatically distributes your data and query load |
| 189 | +across all of the available nodes. No need to overhaul your application, {es} |
| 190 | +knows how to balance multi-node clusters to provide scale and high availability. |
| 191 | +The more nodes, the merrier. |
| 192 | + |
| 193 | +How does this work? Under the covers, an {es} index is really just a logical |
| 194 | +grouping of one or more physical shards, where each shard is actually a |
| 195 | +self-contained index. By distributing the documents in an index across multiple |
| 196 | +shards, and distributing those shards across multiple nodes, {es} can ensure |
| 197 | +redundancy, which both protects against hardware failures and increases |
| 198 | +query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), |
| 199 | +{es} automatically migrates shards to rebalance the cluster. |
| 200 | + |
| 201 | +There are two types of shards: primaries and replicas. Each document in an index |
| 202 | +belongs to one primary shard. A replica shard is a copy of a primary shard. |
| 203 | +Replicas provide redundant copies of your data to protect against hardware |
| 204 | +failure and increase capacity to serve read requests |
| 205 | +like searching or retrieving a document. |
| 206 | + |
| 207 | +The number of primary shards in an index is fixed at the time that an index is |
| 208 | +created, but the number of replica shards can be changed at any time, without |
| 209 | +interrupting indexing or query operations. |
| 210 | + |
| 211 | +[float] |
| 212 | +[[it-depends]] |
| 213 | +=== It depends... |
| 214 | + |
| 215 | +There are a number of performance considerations and trade offs with respect |
| 216 | +to shard size and the number of primary shards configured for an index. The more |
| 217 | +shards, the more overhead there is simply in maintaining those indices. The |
| 218 | +larger the shard size, the longer it takes to move shards around when {es} |
| 219 | +needs to rebalance a cluster. |
| 220 | + |
| 221 | +Querying lots of small shards makes the processing per shard faster, but more |
| 222 | +queries means more overhead, so querying a smaller |
| 223 | +number of larger shards might be faster. In short...it depends. |
| 224 | + |
| 225 | +As a starting point: |
| 226 | + |
| 227 | +* Aim to keep the average shard size between a few GB and a few tens of GB. For |
| 228 | + use cases with time-based data, it is common to see shards in the 20GB to 40GB |
| 229 | + range. |
| 230 | + |
| 231 | +* Avoid the gazillion shards problem. The number of shards a node can hold is |
| 232 | + proportional to the available heap space. As a general rule, the number of |
| 233 | + shards per GB of heap space should be less than 20. |
| 234 | + |
| 235 | +The best way to determine the optimal configuration for your use case is |
| 236 | +through https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing[ |
| 237 | +testing with your own data and queries]. |
| 238 | + |
| 239 | +[float] |
| 240 | +[[disaster-ccr]] |
| 241 | +=== In case of disaster |
| 242 | + |
| 243 | +For performance reasons, the nodes within a cluster need to be on the same |
| 244 | +network. Balancing shards in a cluster across nodes in different data centers |
| 245 | +simply takes too long. But high-availability architectures demand that you avoid |
| 246 | +putting all of your eggs in one basket. In the event of a major outage in one |
| 247 | +location, servers in another location need to be able to take over. Seamlessly. |
| 248 | +The answer? {ccr-cap} (CCR). |
| 249 | + |
| 250 | +CCR provides a way to automatically synchronize indices from your primary cluster |
| 251 | +to a secondary remote cluster that can serve as a hot backup. If the primary |
| 252 | +cluster fails, the secondary cluster can take over. You can also use CCR to |
| 253 | +create secondary clusters to serve read requests in geo-proximity to your users. |
| 254 | + |
| 255 | +{ccr-cap} is active-passive. The index on the primary cluster is |
| 256 | +the active leader index and handles all write requests. Indices replicated to |
| 257 | +secondary clusters are read-only followers. |
| 258 | + |
| 259 | +[float] |
| 260 | +[[admin]] |
| 261 | +=== Care and feeding |
| 262 | + |
| 263 | +As with any enterprise system, you need tools to secure, manage, and |
| 264 | +monitor your {es} clusters. Security, monitoring, and administrative features |
| 265 | +that are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}] |
| 266 | +as a control center for managing a cluster. Features like <<rollup-overview, |
| 267 | +data rollups>> and <<index-lifecycle-management, index lifecycle management>> |
| 268 | +help you intelligently manage your data over time. |
0 commit comments