Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Changelog
## Unreleased
## v0.1.3
**Features**
Added support for ES 9.2.0

## v0.11.0
**Features**
Expand Down
2 changes: 1 addition & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,6 @@ source 'https://rubygems.org' do
gem 'faraday', '~> 2.13'
gem 'faraday-retry', '~> 2.3'
# matching our current backend setup
gem 'elasticsearch', '~> 7.17'
gem 'elasticsearch', '~> 9.2'
gem 'faraday-typhoeus', '~> 1.1'
end
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,16 @@ services:
volumes:
- ./config/search:/config
elasticsearch:
image: semtech/mu-search-elastic-backend:1.2.0
image: semtech/mu-search-elastic-backend:1.3.0
volumes:
- ./data/elasticsearch/:/usr/share/elasticsearch/data
environment:
- discovery.type=single-node

```

Note: because elasticsearch doesn't run as root in its container, it will mess up file permissions on the host system's mounted volumes. The current workaround is to add the directory to your app's git repo with a .gitkeep file and to set the permissions of the directory to 777, then committing this file to your repo.

The indices will be persisted in `./data/elasticsearch`. The `search` service needs to be linked to an instance of the [mu-authorization](https://github.com/mu-semtech/mu-authorization) service.

Create the `./config/search` directory and create a `config.json` with the following contents:
Expand Down Expand Up @@ -761,16 +763,16 @@ Configure indexes to be pre-built when the application starts. For each user sea
```javascript
{
"eager_indexing_groups": [
[
{ "variables": ["company-x"], "name": "organization-read" },
[
{ "variables": ["company-x"], "name": "organization-read" },
{ "variables": ["company-x"], "name": "organization-write" },
{ "variables": [], "name": "public" }
],
[
{ "variables": ["company-y"], "name": "organization-read" },
{ "variables": [], "name": "public" }
],
[
[
{ "variables": [], "name": "clean" }
]
],
Expand All @@ -794,8 +796,8 @@ Assume your application contains a company-specific user group in the authorizat
A typical group to be specified as a single `eager_indexing_group` is `{ "variables": [], "name": "clean" }`. The index will not contain any data, but will be used in the combination to fully match the user's allowed groups.

#### [Experimental] Ignoring allowed groups
In some cases you may search to ignore certain allowed groups when looking for matching indexes. Typically because they will not relate to data that has to be indexed and you want to avoid having many empty indexes. In this case you will have to provide an entry in the `ignored_allowed_groups` list for each group, currently this means including each possible variable value.
For example the clean group can be added to `ignored_allowed_groups` by adding `{ "variables": [], "name": "clean" }` to the list.
In some cases you may search to ignore certain allowed groups when looking for matching indexes. Typically because they will not relate to data that has to be indexed and you want to avoid having many empty indexes. In this case you will have to provide an entry in the `ignored_allowed_groups` list for each group, currently this means including each possible variable value.
For example the clean group can be added to `ignored_allowed_groups` by adding `{ "variables": [], "name": "clean" }` to the list.

#### [Experimental] Dynamic allowed group variables
In some cases you may encounter variables which are not known up front. The `"variables"` array accepts a `"*"` to indicate a wildcard for an attribute. This is currently supported in `ignored_allowed_groups`. In `eager_indexing_groups` this is supported, but only if the `eager_indexing_group` array contains a single group. Within `eager_indexing_groups` this allows us to create a dynamic index for an access right whilst still indicating this index does not impact other indexes. For example, you may want to index the user's message history (`[{ "name": "user", "variables": ["*"] }]` which does not impact the index of the code-lists in public `[{ "name": "public", "variables": [] }].` An example for ignored groups may be to ignore all of the anonymous sessions' information which could be done as: `ignored_allowed_groups": [ { "name": "anonymous-session", "variables": ["*"] } ]`.
Expand Down Expand Up @@ -917,7 +919,8 @@ The following sections list the flags that are currently implemented:
- `:phrase_prefix:` : [Match phrase prefix query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html)
- `:query:` : [Query string query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)
- `:sqs:` : [Simple query string query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html)
- `:common:` [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). The flag takes additional options `cutoff_frequency` and `minimum_should_match` appended with commas such as `:common,{cutoff_frequence},{minimum_should_match}:{field}`. The `cutoff_frequency` can also be set application-wide in [the configuration file](#configuration-options).
- `:common:` [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). The flag takes additional options `cutoff_frequency` and `minimum_should_match` appended with commas such as `:common,{cutoff_frequence},{minimum_should_match}:{field}`. The `cutoff_frequency` can also be set application-wide in [the configuration file](#configuration-options). The common terms query was deprecated and removed from elasticsearch. It is replaced by its recommended replacement, the [match query](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-match-query)
- `:match` [Match query](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-match-query). The flag takes additional options `cutoff_frequency` and `minimum_should_match` appended with commas such as `:common,{cutoff_frequence},{minimum_should_match}:{field}`. The `cutoff_frequency` can also be set application-wide in [the configuration file](#configuration-options).

###### Custom queries
- `:fuzzy_phrase:` : A fuzzy phrase query based on [span_near](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-near-query.html) and [span_multi](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-multi-term-query.html). See also [this](https://stackoverflow.com/questions/38816955/elasticsearch-fuzzy-phrases) Stack Overflow issue or [the code](./framework/elastic_query_builder.rb).
Expand Down Expand Up @@ -1046,7 +1049,7 @@ This section gives an overview of all configurable options in the search configu
- (*) **number_of_threads** : number of threads to use during indexing. Defaults to 1.
- (*) **connection_pool_size** : number of connections in the SPARQL/Elasticsearch/Tika connection pools. Defaults to 20. Typically increased up to 200 on systems with heavy load.
- (*) **update_wait_interval_minutes** : number of minutes to wait before applying an update. Allows to prevent duplicate updates of the same documents. Defaults to 1.
- (*) **common_terms_cutoff_frequency** : default cutoff frequency for a [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). Defaults to 0.0001. See [supported search methods](#supported-search-methods).
- (*) **common_terms_cutoff_frequency** : [REMOVED] default cutoff frequency for a [Common terms query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html). This parameter was removed by elastic search and is now ignored [supported search methods](#supported-search-methods).
- (*) **enable_raw_dsl_endpoint** : flag to enable the [raw Elasticsearch DSL endpoint](#api). This endpoint is disabled by default for security reasons.
- (*) **attachments_path_base** : path inside the Docker container where files for the attachment pipeline are mounted. Defaults to `/data`.

Expand Down
1 change: 1 addition & 0 deletions config/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
"attachments_path_base" : "/local/files/directory",
"persist_indexes" : false,
"default_settings" : {
"number_of_replicas": 0,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ES put the cluster status to yellow if we don't specify that we really only want a single node.

"analysis": {
"analyzer": {
"dutchanalyzer": {
Expand Down
23 changes: 18 additions & 5 deletions framework/elastic_query_builder.rb
Original file line number Diff line number Diff line change
Expand Up @@ -265,19 +265,32 @@ def construct_es_query_term(filter_key, value)
query: value,
fields: all_fields ? nil : fields,
default_operator: "and",
all_fields: all_fields
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deprecated in 6.0.0 and since removed

}.compact
}
when /common(,[0-9.]+){,2}/
ensure_single_field_for "common", fields do |field|
flag, cutoff, min_match = flag.split(",")
cutoff = cutoff or @configuration[:common_terms_cutoff_frequency]
term = {
common: {
field => { query: value, cutoff_frequency: cutoff }
# common was deprecated and removed in favor of match
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# cutoff_frequency for match was also removed and is no longer used
match: {
field => { query: value }
}
}
term["minimum_should_match"] = min_match if min_match
term[:match][field]["minimum_should_match"] = min_match if min_match
term
end
when /match(,[0-9.]+){,1}/
ensure_single_field_for "match", fields do |field|
flag, min_match = flag.split(",")
term = {
# common was deprecated and removed in favor of match
# cutoff_frequency for match was also removed and is no longer used
match: {
field => { query: value }
}
}
term[:match][field]["minimum_should_match"] = min_match if min_match
term
end
else
Expand Down
45 changes: 28 additions & 17 deletions lib/mu_search/elastic.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,34 @@
require 'connection_pool'

# monkeypatch "authentic product check"" in client
module Elasticsearch
class Client
alias original_verify_with_version_or_header verify_with_version_or_header
module ElasticsearchMonkeyPatch
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new version of the monkeypatch was needed

private

def verify_with_version_or_header(...)
original_verify_with_version_or_header(...)
rescue Elasticsearch::UnsupportedProductError
# silenty ignore this error
def verify_elasticsearch(*args, &block)
while not @verified do
sleep 1
begin
response = @transport.perform_request(*args, &block)
response.headers['x-elastic-product'] = 'Elasticsearch'
@verified = true
rescue StandardError => e
Mu::log.debug("SETUP") { "no reaction from elastic, retrying..." }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when creating the connection pool, the library already performs this check, but elastic isn't there yet, this causes an error and if we don't rescue, the service crashes.

Copy link
Author

@Rahien Rahien Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is before our check on if elastic is up (which is still needed because the cluster may not be healthy yet even if the response gives a 200).

next
end
end
response
end
end

Elasticsearch::Client.prepend(ElasticsearchMonkeyPatch)

# A wrapper around elasticsearch client for backwards compatiblity
# see https://rubydoc.info/gems/elasticsearch-api/Elasticsearch
# and https://www.elastic.co/guide/en/elasticsearch/client/ruby-api/current/examples.html
# for docs on the client api
##
module MuSearch
class Elastic
class ElasticWrapper
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid conflict with the Elastic::Transport that replaces Elasticsearch::Transport

# Sets up the ElasticSearch connection pool
def initialize(size:)
MuSearch::ElasticConnectionPool.setup(size: size)
Expand All @@ -32,9 +41,11 @@ def initialize(size:)
#
# Executes a health check and accepts either "green" or "yellow".
def up?
Mu::log.debug("SETUP") { "Checking if Elasticsearch is up..." }
MuSearch::ElasticConnectionPool.with_client do |es_client|
begin
health = es_client.cluster.health
Mu::log.debug("SETUP") { "Elasticsearch cluster health: #{health["status"]}" }
health["status"] == "yellow" or health["status"] == "green"
rescue
false
Expand Down Expand Up @@ -68,7 +79,7 @@ def create_index(index, mappings = nil, settings = nil)
MuSearch::ElasticConnectionPool.with_client do |es_client|
begin
es_client.indices.create(index: index, body: { settings: settings, mappings: mappings})
rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e
rescue Elastic::Transport::Transport::Errors::BadRequest => e
error_message = e.message
if error_message.include?("resource_already_exists_exception")
@logger.warn("ELASTICSEARCH") {"Failed to create index #{index}, because it already exists" }
Expand Down Expand Up @@ -102,7 +113,7 @@ def delete_index(index)
es_client.indices.delete(index: index)
@logger.debug("ELASTICSEARCH") { "Successfully deleted index #{index}" }
true
rescue Elasticsearch::Transport::Transport::Errors::NotFound => e
rescue Elastic::Transport::Transport::Errors::NotFound => e
@logger.debug("ELASTICSEARCH") { "Index #{index} doesn't exist and cannot be deleted." }
false
rescue StandardError => e
Expand All @@ -129,7 +140,7 @@ def refresh_index(index)
es_client.indices.refresh(index: index)
@logger.debug("ELASTICSEARCH") { "Successfully refreshed index #{index}" }
true
rescue Elasticsearch::Transport::Transport::Errors::NotFound => e
rescue Elastic::Transport::Transport::Errors::NotFound => e
@logger.warn("ELASTICSEARCH") { "Index #{index} does not exist, cannot refresh." }
false
rescue StandardError => e
Expand All @@ -149,7 +160,7 @@ def clear_index(index)
es_client.delete_by_query(index: index, body: { query: { match_all: {} } })
@logger.debug("ELASTICSEARCH") { "Successfully cleared all documents from index #{index}" }
true
rescue Elasticsearch::Transport::Transport::Errors::NotFound => e
rescue Elastic::Transport::Transport::Errors::NotFound => e
@logger.warn("ELASTICSEARCH") { "Index #{index} does not exist, cannot clear documents." }
false
rescue StandardError => e
Expand All @@ -167,7 +178,7 @@ def get_document(index, id)
MuSearch::ElasticConnectionPool.with_client do |es_client|
begin
es_client.get(index: index, id: id)
rescue Elasticsearch::Transport::Transport::Errors::NotFound => e
rescue Elastic::Transport::Transport::Errors::NotFound => e
@logger.debug("ELASTICSEARCH") { "Document #{id} not found in index #{index}" }
nil
rescue StandardError => e
Expand Down Expand Up @@ -208,7 +219,7 @@ def update_document(index, id, document)
body = es_client.update(index: index, id: id, body: {doc: document})
@logger.debug("ELASTICSEARCH") { "Updated document #{id} in index #{index}" }
body
rescue Elasticsearch::Transport::Transport::Errors::NotFound => e
rescue Elastic::Transport::Transport::Errors::NotFound => e
@logger.info("ELASTICSEARCH") { "Cannot update document #{id} in index #{index} because it doesn't exist" }
nil
rescue StandardError => e
Expand Down Expand Up @@ -246,7 +257,7 @@ def delete_document(index, id)
es_client.delete(index: index, id: id)
@logger.debug("ELASTICSEARCH") { "Successfully deleted document #{id} in index #{index}" }
true
rescue Elasticsearch::Transport::Transport::Errors::NotFound => e
rescue Elastic::Transport::Transport::Errors::NotFound => e
@logger.debug("ELASTICSEARCH") { "Document #{id} doesn't exist in index #{index} and cannot be deleted." }
false
rescue StandardError => e
Expand All @@ -264,7 +275,7 @@ def search_documents(indexes:, query: nil)
begin
@logger.debug("SEARCH") { "Searching Elasticsearch index(es) #{indexes} with body #{req_body}" }
es_client.search(index: indexes, body: query)
rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e
rescue Elastic::Transport::Transport::Errors::BadRequest => e
raise ArgumentError, "Invalid search query #{query}"
rescue StandardError => e
@logger.error("SEARCH") { "Searching documents in index(es) #{indexes} failed.\n Error: #{e.full_message}" }
Expand All @@ -282,7 +293,7 @@ def count_documents(indexes:, query: nil)
@logger.debug("SEARCH") { "Count search results in index(es) #{indexes} for body #{query.inspect}" }
response = es_client.count(index: indexes, body: query)
response["count"]
rescue Elasticsearch::Transport::Transport::Errors::BadRequest => e
rescue Elastic::Transport::Transport::Errors::BadRequest => e
@logger.error("SEARCH") { "Counting search results in index(es) #{indexes} failed.\n Error: #{e.full_message}" }
raise ArgumentError, "Invalid count query #{query}"
rescue StandardError => e
Expand Down
2 changes: 1 addition & 1 deletion web.rb
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def setup_delta_handling(index_manager, elasticsearch, config)
connection_pool_size = configuration[:connection_pool_size]
MuSearch::Tika::ConnectionPool.setup(size: connection_pool_size)

elasticsearch = MuSearch::Elastic.new(size: connection_pool_size)
elasticsearch = MuSearch::ElasticWrapper.new(size: connection_pool_size)
set :elasticsearch, elasticsearch

MuSearch::SPARQL::ConnectionPool.setup(size: connection_pool_size)
Expand Down