Skip to content

Commit 036d65b

Browse files
committed
Update index document models to fix data problems
This is a significant refactoring of the index mappings for all document types in order to fix problems with missing fields, adjust how data is indexed for usability, correcting data types, fixing field name conflicts, while updating various fields using proper tokenizers and/or analyzers to ensure their data is indexed in a more natural or usable way. The driving factor for this work was the fact that there were no result-data templates being created, and what we had in the code base was completely wrong. We replaced the existing but unused result-data mapping with one which matches how documents are generated as of this commit. We add code to ensure that all templates are constructed and pushed. We add `_meta` version fields to each mapping in the templates. This version is used in two ways: * We now query to see if the currently installed version is older, that is, has a version number less than the current version, and if so update it accordingly * As the version number input in index names As a result of adding the version, the index names have been changed to add a version number that matches the mapping in the template. This lets us easily re-index data using the new mapping, while allowing the consumers of the index to use the new and old version indices at the same time opportunistically. We now enforce the index prefix setting to not contain the period character, since that separates the various parts of the index name, <prefix>.<version>.<name>.<YYYY-MM-DD>. While doing this work, we found a bug in discovering the tool data files in the tar ball. Sometimes the full hostname is being used instead of the short hostname. So now we look for tool data by short and long hostname. We turn off the use of the _all field for all document types, and add a `default_field` setting for each type so that unspecified searches don't touch every field. This required us to add a "settings" infrastructure per index name. We fix the indexing for mpstat to emit floats, prevent duplicate mpstat data, have prometheus-metrics emit fields in its namespace, have proc-vmstat index similar to the way the proc-vmstat-postprocess works (although we index both the original value and the computed rate), add proper indexing of the proc-interrupts data, and generally correct many data types. In order to ensure and test all these changes, we re-worked the unit test mock infrastructure to work with less specific interactions in our normal code, moving it to its own module, and adding the ability to compare generated JSON documents against the mappings to flag problems. We also moved more Elasticsearch infrastructure into the pbench library module. We add a unit test, 7.17, to verify vmstat tool data indexing works. Since we are re-working toc-entries in this effort, we take the opportunity to add missing fields for directories (time stamp and mode), ensuring all empty directories are indexed as well, and add the mtime field for files. This is in preparation for viewing tar ball contents via the JSON documents only. When we get BadDate exceptions from processing the pbench tar ball we will exit with that status. But when we encounter these errors while dealing with result and tool data, we have already indexed the run document, and other tool data, so ending early is not too useful. We now continue to index so that we can report all the errors, indexing what we can. We add an "@idx" field to every document to record the "index" or the offset of that document in the original source so that we can go back and look up the original document a bit easier. For result data, that is the array offset for a particular object in the original JSON document, while for tool data that is the row in a .csv file, or which timestamp in a text file. The uid calculations for result data have been fixed to replace "controller_host" references with the controller field of the run document. In order to reduce a bit of space, result-data documents have had all but their identifying fields stripped and placed into a parent "sample" document. This significantly reduces the size of the JSON documents for each result value. The parent sample document contains all of the metadata from the benchmark.parameters section of the original result.json file. We have dropped the use of @metadata fields for all but the run documents, where that namespace is reserved for metadata about the tar ball which is rarely used. We then add three namespaces for result-data and tool-data documents: run, iteration, and sample. This is used consistently across all document types. We now capture the directory path name of the controller in the @metadata field of the run document. This enables the UI to reconstruct the proper URL to the incoming tar ball when the controller name does not match the directory name of the controller. Typically this happens when a satellite tar ball is ingested to the main server. E.g. when the satellite name is EC2, the controller name on disk will be "EC2::controller" while the metadata.log file will have the controller property of the "run" section set to "controller.example.com". The new @metadata.controller_dir will now contain "EC2::controller". In addition, we add the "EC2" value in the @metadata.satellite field as well. We sort the order of the tool data files to be processed so that we don't get unexpected changes in the unit test output.
1 parent b75a4df commit 036d65b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+69345
-31455
lines changed

server/bin/gold/test-7.0.txt renamed to server/bin/gold/test-7.0.0.txt

Lines changed: 48 additions & 50 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.0.1.txt

Lines changed: 3156 additions & 0 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.1.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ drwxrwxr-x - logs/pbench-audit-server
9696
-rw-rw-r-- 0 logs/pbench-audit-server/pbench-audit-server.log
9797
drwxrwxr-x - logs/pbench-index
9898
-rw-rw-r-- 0 logs/pbench-index/pbench-index.error
99-
-rw-rw-r-- 527 logs/pbench-index/pbench-index.log
99+
-rw-rw-r-- 594 logs/pbench-index/pbench-index.log
100100
drwxrwxr-x - pbench-move-results-receive
101101
drwxrwxr-x - pbench-move-results-receive/fs-version-001
102102
drwxrwxr-x - pbench-move-results-receive/fs-version-002
@@ -160,7 +160,7 @@ drwxrwxr-x - tmp
160160
+++++ pbench-index/pbench-index.log
161161
run-1900-01-01T00:00:00-UTC: starting at 1900-01-01T00:00:00-UTC
162162
1900-01-01T00:00:00-UTC: Starting /var/tmp/pbench-test-server/test-7.1/pbench/archive/fs-version-001/controller/TO-INDEX/test-7.1.tar.xz (size 1232)
163-
1970-01-01T00:00:00.000000 ERROR pbench-index.index-pbench main -- The metadata.log file is curdled in tarball: /var/tmp/pbench-test-server/test-7.1/pbench/archive/fs-version-001/controller/test-7.1.tar.xz
163+
1970-01-01T00:00:00.000000 ERROR pbench-index.index-pbench main -- The metadata.log file is curdled in tarball: /var/tmp/pbench-test-server/test-7.1/pbench/archive/fs-version-001/controller/test-7.1.tar.xz - error fetching required metadata.log fields, "No section: 'run'"
164164
run-1900-01-01T00:00:00-UTC: ending at 1900-01-01T00:00:00-UTC, indexed 0 (skipped 1) results, 0 errors
165165
----- pbench-index/pbench-index.log
166166
---- pbench-local/logs

server/bin/gold/test-7.10.txt

Lines changed: 8493 additions & 4239 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.11.txt

Lines changed: 4316 additions & 2297 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.12.txt

Lines changed: 2621 additions & 1678 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.13.txt

Lines changed: 8142 additions & 3455 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.14.txt

Lines changed: 4551 additions & 2164 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.15.txt

Lines changed: 6775 additions & 3792 deletions
Large diffs are not rendered by default.

server/bin/gold/test-7.16.txt

Lines changed: 8790 additions & 4216 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)