Commit 036d65b
committed
Update index document models to fix data problems
This is a significant refactoring of the index mappings for all document
types in order to fix problems with missing fields, adjust how
data is indexed for usability, correcting data types, fixing field name
conflicts, while updating various fields using proper tokenizers and/or
analyzers to ensure their data is indexed in a more natural or usable
way.
The driving factor for this work was the fact that there were no
result-data templates being created, and what we had in the code base
was completely wrong. We replaced the existing but unused result-data
mapping with one which matches how documents are generated as of this
commit. We add code to ensure that all templates are constructed and
pushed.
We add `_meta` version fields to each mapping in the templates. This
version is used in two ways:
* We now query to see if the currently installed version is older,
that is, has a version number less than the current version, and if
so update it accordingly
* As the version number input in index names
As a result of adding the version, the index names have been changed to
add a version number that matches the mapping in the template. This
lets us easily re-index data using the new mapping, while allowing the
consumers of the index to use the new and old version indices at the
same time opportunistically. We now enforce the index prefix setting to
not contain the period character, since that separates the various
parts of the index name, <prefix>.<version>.<name>.<YYYY-MM-DD>.
While doing this work, we found a bug in discovering the tool data
files in the tar ball. Sometimes the full hostname is being used
instead of the short hostname. So now we look for tool data by short
and long hostname.
We turn off the use of the _all field for all document types, and add a
`default_field` setting for each type so that unspecified searches don't
touch every field. This required us to add a "settings" infrastructure
per index name.
We fix the indexing for mpstat to emit floats, prevent duplicate mpstat
data, have prometheus-metrics emit fields in its namespace, have
proc-vmstat index similar to the way the proc-vmstat-postprocess works
(although we index both the original value and the computed rate), add
proper indexing of the proc-interrupts data, and generally correct many
data types.
In order to ensure and test all these changes, we re-worked the unit
test mock infrastructure to work with less specific interactions in
our normal code, moving it to its own module, and adding the ability
to compare generated JSON documents against the mappings to flag
problems. We also moved more Elasticsearch infrastructure into the
pbench library module.
We add a unit test, 7.17, to verify vmstat tool data indexing works.
Since we are re-working toc-entries in this effort, we take the
opportunity to add missing fields for directories (time stamp and mode),
ensuring all empty directories are indexed as well, and add the mtime
field for files. This is in preparation for viewing tar ball contents
via the JSON documents only.
When we get BadDate exceptions from processing the pbench tar ball we
will exit with that status. But when we encounter these errors while
dealing with result and tool data, we have already indexed the run
document, and other tool data, so ending early is not too useful. We
now continue to index so that we can report all the errors, indexing
what we can.
We add an "@idx" field to every document to record the "index" or the
offset of that document in the original source so that we can go back
and look up the original document a bit easier. For result data, that
is the array offset for a particular object in the original JSON document,
while for tool data that is the row in a .csv file, or which timestamp
in a text file.
The uid calculations for result data have been fixed to replace
"controller_host" references with the controller field of the run
document.
In order to reduce a bit of space, result-data documents have had all
but their identifying fields stripped and placed into a parent "sample"
document. This significantly reduces the size of the JSON documents
for each result value. The parent sample document contains all of the
metadata from the benchmark.parameters section of the original
result.json file.
We have dropped the use of @metadata fields for all but the run
documents, where that namespace is reserved for metadata about the tar
ball which is rarely used. We then add three namespaces for result-data
and tool-data documents: run, iteration, and sample. This is used
consistently across all document types.
We now capture the directory path name of the controller in the
@metadata field of the run document. This enables the UI to reconstruct
the proper URL to the incoming tar ball when the controller name does
not match the directory name of the controller. Typically this happens
when a satellite tar ball is ingested to the main server. E.g. when
the satellite name is EC2, the controller name on disk will be
"EC2::controller" while the metadata.log file will have the controller
property of the "run" section set to "controller.example.com". The
new @metadata.controller_dir will now contain "EC2::controller". In
addition, we add the "EC2" value in the @metadata.satellite field as
well.
We sort the order of the tool data files to be processed so that we
don't get unexpected changes in the unit test output.1 parent b75a4df commit 036d65b
File tree
61 files changed
+69345
-31455
lines changed- server
- bin
- gold
- state
- config
- test-5.1.config
- lib
- config
- mappings
- pbench
- settings
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
61 files changed
+69345
-31455
lines changedLines changed: 48 additions & 50 deletions
Large diffs are not rendered by default.
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
99 | | - | |
| 99 | + | |
100 | 100 | | |
101 | 101 | | |
102 | 102 | | |
| |||
160 | 160 | | |
161 | 161 | | |
162 | 162 | | |
163 | | - | |
| 163 | + | |
164 | 164 | | |
165 | 165 | | |
166 | 166 | | |
| |||
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
0 commit comments