diff --git a/CHANGELOG.md b/CHANGELOG.md index 7ba1c38..2a44e03 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,8 @@ # Changelog ## Unreleased +## v0.1.3 +**Features** +Added support for ES 9.2.0 ## v0.11.0 **Features** diff --git a/Gemfile b/Gemfile index df00b3d..c29752d 100644 --- a/Gemfile +++ b/Gemfile @@ -5,6 +5,6 @@ source 'https://rubygems.org' do gem 'faraday', '~> 2.13' gem 'faraday-retry', '~> 2.3' # matching our current backend setup - gem 'elasticsearch', '~> 7.17' + gem 'elasticsearch', '~> 9.2' gem 'faraday-typhoeus', '~> 1.1' end diff --git a/README.md b/README.md index fd4a5b3..c01cc43 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ services: volumes: - ./config/search:/config elasticsearch: - image: semtech/mu-search-elastic-backend:1.2.0 + image: semtech/mu-search-elastic-backend:1.3.0 volumes: - ./data/elasticsearch/:/usr/share/elasticsearch/data environment: @@ -29,6 +29,8 @@ services: ``` +Note: because elasticsearch doesn't run as root in its container, it will mess up file permissions on the host system's mounted volumes. The current workaround is to add the directory to your app's git repo with a .gitkeep file and to set the permissions of the directory to 777, then committing this file to your repo. + The indices will be persisted in `./data/elasticsearch`. The `search` service needs to be linked to an instance of the [mu-authorization](https://github.com/mu-semtech/mu-authorization) service. Create the `./config/search` directory and create a `config.json` with the following contents: @@ -761,8 +763,8 @@ Configure indexes to be pre-built when the application starts. For each user sea ```javascript { "eager_indexing_groups": [ - [ - { "variables": ["company-x"], "name": "organization-read" }, + [ + { "variables": ["company-x"], "name": "organization-read" }, { "variables": ["company-x"], "name": "organization-write" }, { "variables": [], "name": "public" } ], @@ -770,7 +772,7 @@ Configure indexes to be pre-built when the application starts. For each user sea { "variables": ["company-y"], "name": "organization-read" }, { "variables": [], "name": "public" } ], - [ + [ { "variables": [], "name": "clean" } ] ], @@ -794,8 +796,8 @@ Assume your application contains a company-specific user group in the authorizat A typical group to be specified as a single `eager_indexing_group` is `{ "variables": [], "name": "clean" }`. The index will not contain any data, but will be used in the combination to fully match the user's allowed groups. #### [Experimental] Ignoring allowed groups -In some cases you may search to ignore certain allowed groups when looking for matching indexes. Typically because they will not relate to data that has to be indexed and you want to avoid having many empty indexes. In this case you will have to provide an entry in the `ignored_allowed_groups` list for each group, currently this means including each possible variable value. -For example the clean group can be added to `ignored_allowed_groups` by adding `{ "variables": [], "name": "clean" }` to the list. +In some cases you may search to ignore certain allowed groups when looking for matching indexes. Typically because they will not relate to data that has to be indexed and you want to avoid having many empty indexes. In this case you will have to provide an entry in the `ignored_allowed_groups` list for each group, currently this means including each possible variable value. +For example the clean group can be added to `ignored_allowed_groups` by adding `{ "variables": [], "name": "clean" }` to the list. #### [Experimental] Dynamic allowed group variables In some cases you may encounter variables which are not known up front. The `"variables"` array accepts a `"*"` to indicate a wildcard for an attribute. This is currently supported in `ignored_allowed_groups`. In `eager_indexing_groups` this is supported, but only if the `eager_indexing_group` array contains a single group. Within `eager_indexing_groups` this allows us to create a dynamic index for an access right whilst still indicating this index does not impact other indexes. For example, you may want to index the user's message history (`[{ "name": "user", "variables": ["*"] }]` which does not impact the index of the code-lists in public `[{ "name": "public", "variables": [] }].` An example for ignored groups may be to ignore all of the anonymous sessions' information which could be done as: `ignored_allowed_groups": [ { "name": "anonymous-session", "variables": ["*"] } ]`. @@ -917,7 +919,8 @@ The following sections list the flags that are currently implemented: - `:phrase_prefix:` : [Match phrase prefix query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html) - `:query:` : [Query string query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html) - `:sqs:` : [Simple query string query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html) -- `:common:` [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). The flag takes additional options `cutoff_frequency` and `minimum_should_match` appended with commas such as `:common,{cutoff_frequence},{minimum_should_match}:{field}`. The `cutoff_frequency` can also be set application-wide in [the configuration file](#configuration-options). +- `:common:` [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). The flag takes additional options `cutoff_frequency` and `minimum_should_match` appended with commas such as `:common,{cutoff_frequence},{minimum_should_match}:{field}`. The `cutoff_frequency` can also be set application-wide in [the configuration file](#configuration-options). The common terms query was deprecated and removed from elasticsearch. It is replaced by its recommended replacement, the [match query](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-match-query) +- `:match` [Match query](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-match-query). The flag takes additional options `cutoff_frequency` and `minimum_should_match` appended with commas such as `:common,{cutoff_frequence},{minimum_should_match}:{field}`. The `cutoff_frequency` can also be set application-wide in [the configuration file](#configuration-options). ###### Custom queries - `:fuzzy_phrase:` : A fuzzy phrase query based on [span_near](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-near-query.html) and [span_multi](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-multi-term-query.html). See also [this](https://stackoverflow.com/questions/38816955/elasticsearch-fuzzy-phrases) Stack Overflow issue or [the code](./framework/elastic_query_builder.rb). @@ -1046,7 +1049,7 @@ This section gives an overview of all configurable options in the search configu - (*) **number_of_threads** : number of threads to use during indexing. Defaults to 1. - (*) **connection_pool_size** : number of connections in the SPARQL/Elasticsearch/Tika connection pools. Defaults to 20. Typically increased up to 200 on systems with heavy load. - (*) **update_wait_interval_minutes** : number of minutes to wait before applying an update. Allows to prevent duplicate updates of the same documents. Defaults to 1. -- (*) **common_terms_cutoff_frequency** : default cutoff frequency for a [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). Defaults to 0.0001. See [supported search methods](#supported-search-methods). +- (*) **common_terms_cutoff_frequency** : [REMOVED] default cutoff frequency for a [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). This parameter was removed by elastic search and is now ignored [supported search methods](#supported-search-methods). - (*) **enable_raw_dsl_endpoint** : flag to enable the [raw Elasticsearch DSL endpoint](#api). This endpoint is disabled by default for security reasons. - (*) **attachments_path_base** : path inside the Docker container where files for the attachment pipeline are mounted. Defaults to `/data`. diff --git a/config/config.json b/config/config.json index abb4d63..b2ae945 100644 --- a/config/config.json +++ b/config/config.json @@ -10,6 +10,7 @@ "attachments_path_base" : "/local/files/directory", "persist_indexes" : false, "default_settings" : { + "number_of_replicas": 0, "analysis": { "analyzer": { "dutchanalyzer": { diff --git a/framework/elastic_query_builder.rb b/framework/elastic_query_builder.rb index 6b037be..e1d9755 100644 --- a/framework/elastic_query_builder.rb +++ b/framework/elastic_query_builder.rb @@ -265,19 +265,32 @@ def construct_es_query_term(filter_key, value) query: value, fields: all_fields ? nil : fields, default_operator: "and", - all_fields: all_fields }.compact } when /common(,[0-9.]+){,2}/ ensure_single_field_for "common", fields do |field| flag, cutoff, min_match = flag.split(",") - cutoff = cutoff or @configuration[:common_terms_cutoff_frequency] term = { - common: { - field => { query: value, cutoff_frequency: cutoff } + # common was deprecated and removed in favor of match + # cutoff_frequency for match was also removed and is no longer used + match: { + field => { query: value } } } - term["minimum_should_match"] = min_match if min_match + term[:match][field]["minimum_should_match"] = min_match if min_match + term + end + when /match(,[0-9.]+){,1}/ + ensure_single_field_for "match", fields do |field| + flag, min_match = flag.split(",") + term = { + # common was deprecated and removed in favor of match + # cutoff_frequency for match was also removed and is no longer used + match: { + field => { query: value } + } + } + term[:match][field]["minimum_should_match"] = min_match if min_match term end else diff --git a/lib/mu_search/elastic.rb b/lib/mu_search/elastic.rb index d08d3d8..118c00d 100644 --- a/lib/mu_search/elastic.rb +++ b/lib/mu_search/elastic.rb @@ -3,25 +3,34 @@ require 'connection_pool' # monkeypatch "authentic product check"" in client -module Elasticsearch - class Client - alias original_verify_with_version_or_header verify_with_version_or_header +module ElasticsearchMonkeyPatch + private - def verify_with_version_or_header(...) - original_verify_with_version_or_header(...) - rescue Elasticsearch::UnsupportedProductError - # silenty ignore this error + def verify_elasticsearch(*args, &block) + while not @verified do + sleep 1 + begin + response = @transport.perform_request(*args, &block) + response.headers['x-elastic-product'] = 'Elasticsearch' + @verified = true + rescue StandardError => e + Mu::log.debug("SETUP") { "no reaction from elastic, retrying..." } + next + end end + response end end +Elasticsearch::Client.prepend(ElasticsearchMonkeyPatch) + # A wrapper around elasticsearch client for backwards compatiblity # see https://rubydoc.info/gems/elasticsearch-api/Elasticsearch # and https://www.elastic.co/guide/en/elasticsearch/client/ruby-api/current/examples.html # for docs on the client api ## module MuSearch - class Elastic + class ElasticWrapper # Sets up the ElasticSearch connection pool def initialize(size:) MuSearch::ElasticConnectionPool.setup(size: size) @@ -32,9 +41,11 @@ def initialize(size:) # # Executes a health check and accepts either "green" or "yellow". def up? + Mu::log.debug("SETUP") { "Checking if Elasticsearch is up..." } MuSearch::ElasticConnectionPool.with_client do |es_client| begin health = es_client.cluster.health + Mu::log.debug("SETUP") { "Elasticsearch cluster health: #{health["status"]}" } health["status"] == "yellow" or health["status"] == "green" rescue false @@ -68,7 +79,7 @@ def create_index(index, mappings = nil, settings = nil) MuSearch::ElasticConnectionPool.with_client do |es_client| begin es_client.indices.create(index: index, body: { settings: settings, mappings: mappings}) - rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e + rescue Elastic::Transport::Transport::Errors::BadRequest => e error_message = e.message if error_message.include?("resource_already_exists_exception") @logger.warn("ELASTICSEARCH") {"Failed to create index #{index}, because it already exists" } @@ -102,7 +113,7 @@ def delete_index(index) es_client.indices.delete(index: index) @logger.debug("ELASTICSEARCH") { "Successfully deleted index #{index}" } true - rescue Elasticsearch::Transport::Transport::Errors::NotFound => e + rescue Elastic::Transport::Transport::Errors::NotFound => e @logger.debug("ELASTICSEARCH") { "Index #{index} doesn't exist and cannot be deleted." } false rescue StandardError => e @@ -129,7 +140,7 @@ def refresh_index(index) es_client.indices.refresh(index: index) @logger.debug("ELASTICSEARCH") { "Successfully refreshed index #{index}" } true - rescue Elasticsearch::Transport::Transport::Errors::NotFound => e + rescue Elastic::Transport::Transport::Errors::NotFound => e @logger.warn("ELASTICSEARCH") { "Index #{index} does not exist, cannot refresh." } false rescue StandardError => e @@ -149,7 +160,7 @@ def clear_index(index) es_client.delete_by_query(index: index, body: { query: { match_all: {} } }) @logger.debug("ELASTICSEARCH") { "Successfully cleared all documents from index #{index}" } true - rescue Elasticsearch::Transport::Transport::Errors::NotFound => e + rescue Elastic::Transport::Transport::Errors::NotFound => e @logger.warn("ELASTICSEARCH") { "Index #{index} does not exist, cannot clear documents." } false rescue StandardError => e @@ -167,7 +178,7 @@ def get_document(index, id) MuSearch::ElasticConnectionPool.with_client do |es_client| begin es_client.get(index: index, id: id) - rescue Elasticsearch::Transport::Transport::Errors::NotFound => e + rescue Elastic::Transport::Transport::Errors::NotFound => e @logger.debug("ELASTICSEARCH") { "Document #{id} not found in index #{index}" } nil rescue StandardError => e @@ -208,7 +219,7 @@ def update_document(index, id, document) body = es_client.update(index: index, id: id, body: {doc: document}) @logger.debug("ELASTICSEARCH") { "Updated document #{id} in index #{index}" } body - rescue Elasticsearch::Transport::Transport::Errors::NotFound => e + rescue Elastic::Transport::Transport::Errors::NotFound => e @logger.info("ELASTICSEARCH") { "Cannot update document #{id} in index #{index} because it doesn't exist" } nil rescue StandardError => e @@ -246,7 +257,7 @@ def delete_document(index, id) es_client.delete(index: index, id: id) @logger.debug("ELASTICSEARCH") { "Successfully deleted document #{id} in index #{index}" } true - rescue Elasticsearch::Transport::Transport::Errors::NotFound => e + rescue Elastic::Transport::Transport::Errors::NotFound => e @logger.debug("ELASTICSEARCH") { "Document #{id} doesn't exist in index #{index} and cannot be deleted." } false rescue StandardError => e @@ -264,7 +275,7 @@ def search_documents(indexes:, query: nil) begin @logger.debug("SEARCH") { "Searching Elasticsearch index(es) #{indexes} with body #{req_body}" } es_client.search(index: indexes, body: query) - rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e + rescue Elastic::Transport::Transport::Errors::BadRequest => e raise ArgumentError, "Invalid search query #{query}" rescue StandardError => e @logger.error("SEARCH") { "Searching documents in index(es) #{indexes} failed.\n Error: #{e.full_message}" } @@ -282,7 +293,7 @@ def count_documents(indexes:, query: nil) @logger.debug("SEARCH") { "Count search results in index(es) #{indexes} for body #{query.inspect}" } response = es_client.count(index: indexes, body: query) response["count"] - rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e + rescue Elastic::Transport::Transport::Errors::BadRequest => e @logger.error("SEARCH") { "Counting search results in index(es) #{indexes} failed.\n Error: #{e.full_message}" } raise ArgumentError, "Invalid count query #{query}" rescue StandardError => e diff --git a/web.rb b/web.rb index e8c48d1..b0c743a 100644 --- a/web.rb +++ b/web.rb @@ -101,7 +101,7 @@ def setup_delta_handling(index_manager, elasticsearch, config) connection_pool_size = configuration[:connection_pool_size] MuSearch::Tika::ConnectionPool.setup(size: connection_pool_size) - elasticsearch = MuSearch::Elastic.new(size: connection_pool_size) + elasticsearch = MuSearch::ElasticWrapper.new(size: connection_pool_size) set :elasticsearch, elasticsearch MuSearch::SPARQL::ConnectionPool.setup(size: connection_pool_size)