|
| 1 | +# Elastic Search |
| 2 | + |
| 3 | +Elastic Search indexes the information in the database for quick retrieval, facilitating endpoints |
| 4 | +that can search through assets with loosely matching queries and give relevancy-ranked suggestions. |
| 5 | + |
| 6 | +## Indexing the Database using Logstash |
| 7 | +Elastic Search keeps independent indices for various assets in the database, and achieves this by: |
| 8 | + |
| 9 | + * Creating an initial index |
| 10 | + * Populating it based on the information already in the database |
| 11 | + * Updating the index with new information periodically, removing old entries if necessary |
| 12 | + |
| 13 | +Because the logic for these steps is very similar for the different assets, we generate various |
| 14 | +scripts to create and maintain the elastic search indices. We use [Logstash](https://www.elastic.co/logstash) |
| 15 | +to process the data from our database and export it to Elastic Search. |
| 16 | + |
| 17 | +This happens through two container services: |
| 18 | + |
| 19 | + * `es_logstash_setup`: Generates the common scripts for use by logstash, and creates the Elastic Search indices if necessary. |
| 20 | + This is a short-running service that only runs on startup, exiting when its done. |
| 21 | + * `logstash`: Continually monitors the database and updates the Elastic Search indices. |
| 22 | + |
| 23 | +### Logstash Setup |
| 24 | + |
| 25 | +The `es_logstash_setup` service executes two important roles: generating logstash files and creating Elastic Search indices. |
| 26 | + |
| 27 | +The `src/logstash_setup/generate_logstash_config_files.py` file generates logstash files based on the |
| 28 | +templates provided in the `src/logstash_setup/templates` directory. The generated files are placed |
| 29 | +into subdirectories of the `logstash/config` directory, along with predefined files. |
| 30 | + |
| 31 | +For syncing the Elastic Search index, logstash requires SQL files that extract the necessary data from the database. |
| 32 | +These are generated based on the `src/logstash_setup/templates/sql_{init|sync|rm}.py` files: |
| 33 | + |
| 34 | + * The `sql_init.py` file defines the query template that finds the data that should be included in the index if it is populated from scratch. |
| 35 | + * The `sql_sync.py` file defines the query template that finds the data that has been updated since the last creation or synchronization, so that the ES index can be updated efficiently. |
| 36 | + * The `sql_rm.py` file defines the query template that finds the data that should be removed from the index. |
| 37 | + |
| 38 | +It also generates the configuration files needed for Logstash to run the sync scripts: |
| 39 | + |
| 40 | + * `config.py`: used to generate `logstash.yml`, the general configuration. |
| 41 | + * `init_table.py`: contains the configuration that is needed to run the queries from `sql_init.py`, and defines them for each asset that needs to be indexed. |
| 42 | + * `sync_table.py`: contains the configuration that is needed to run the queries from `sql_sync.py` and `sql_rm.py` scripts, and defines them for each asset that needs to be synced. |
| 43 | + |
| 44 | +All generated files contain the preamble defined in `file_header.py`. |
| 45 | +Additionally, the `logstash/config/config` directory contains additional files used for the configuration of logstash, such as the JVM options. |
| 46 | + |
| 47 | +### Creating a New Index |
| 48 | +To create a new index for an asset supported in the metadata catalogue REST API, you simply need to create the respective "search router", more on that below. |
| 49 | + |
| 50 | +## Elastic Search in the Metadata Catalogue |
| 51 | +The metadata catalogue provides REST API endpoints to allow querying elastic search in a uniform manner. |
| 52 | +While the Elastic Search can be exposed directly in production, this unified endpoint allows us to provide more structure and better automated documentation. |
| 53 | +It also avoids requiring the user to learn the Elastic Search query format. |
| 54 | + |
| 55 | +### Creating a New Search |
| 56 | +To extend Elastic Search to a new asset type, create a search router, similar to those in `src/routers/search_routers/`. |
| 57 | +Simply inherit from the base `SearchRouter` class defined in `src/routers/search_router.py` and define a few properties: |
| 58 | + |
| 59 | +```python |
| 60 | + @property |
| 61 | + def es_index(self) -> str: |
| 62 | + return "case_study" |
| 63 | +``` |
| 64 | +The `es_index` property defines the name of the index. It is how it is known by Elasic Search, and should match the name of the table in the database. |
| 65 | + |
| 66 | +```python |
| 67 | + @property |
| 68 | + def resource_name_plural(self) -> str: |
| 69 | + return "case_studies" |
| 70 | +``` |
| 71 | + |
| 72 | +The `resource_name_plural` is used to define the path of the REST API endpoint, e.g.: `api.aiod.eu/search/case_studies`. |
| 73 | + |
| 74 | +```python |
| 75 | +@property |
| 76 | +def resource_class(self): |
| 77 | + return CaseStudy |
| 78 | +``` |
| 79 | + |
| 80 | +The `resource_class` property contains a direct reference to the object it indexes, which is used when returning expanded responses from the ES query ("get all"). |
| 81 | + |
| 82 | +```python |
| 83 | + @property |
| 84 | + def extra_indexed_fields(self) -> set[str]: |
| 85 | + return {"headline", "alternative_headline"} |
| 86 | +``` |
| 87 | + |
| 88 | +The `extra_indexed_fields` property contains the fields of the entity that should be included in the index other than the `global_indexed_fields` found in the `SearchRouter` class. |
| 89 | + |
| 90 | +```python |
| 91 | + @property |
| 92 | + def linked_fields(self) -> set[str]: |
| 93 | + return { |
| 94 | + "alternate_name", |
| 95 | + "application_area", |
| 96 | + "industrial_sector", |
| 97 | + "research_area", |
| 98 | + "scientific_domain", |
| 99 | + } |
| 100 | +``` |
| 101 | +The `linked_fields` property contains the fields of the entity which refer to external tables and should be included in the index. |
| 102 | + |
| 103 | +By creating a new `SearchRouter` (and adding it to the router list), the script which generates the logstash files will automatically include it. |
| 104 | + |
| 105 | +## Configuration |
| 106 | +Besides the aforementioned configuration files, the elastic search configuration is located at `es/elasticsearch.yml`, but shouldn't need much configuration. |
| 107 | +Some aspects of both Logstash and Elastic Search are to be configured through environment variables through the `override.env` file (defaults in `.env`). |
| 108 | +Most notable one of these are the password for Elastic Search and the JVM resource options. |
0 commit comments