-
Notifications
You must be signed in to change notification settings - Fork 25.4k
A random-random test for time-series data #132556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to get some confirmation from @kkrik-es that this is doing what he wants, but I think it's pretty good. I left some feedback, none of which is critical but I'd like to get it addressed.
@@ -78,6 +79,7 @@ public FieldDataGenerator generator(String fieldName, DataSource dataSource) { | |||
case IP -> new IpFieldDataGenerator(dataSource); | |||
case CONSTANT_KEYWORD -> new ConstantKeywordFieldDataGenerator(); | |||
case WILDCARD -> new WildcardFieldDataGenerator(dataSource); | |||
default -> null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to have a default in this switch. When someone adds a new FieldType
here, I want the compiler to tell them they must also update this switch, but the default will hide that.
In fact, this probably shouldn't be a switch. It should probably be an abstract method or a function member on the enum itself. Switches on enums are a bit of a smell, and switches on enums from within the enum itself are so smelly as to almost be an error, IMHO. I realize you didn't add this switch, but this is a good opportunity to refactor it and leave it better than we found it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can compromise on removing the default? 😅
@@ -64,7 +64,7 @@ public Mapping generate(Template template) { | |||
|
|||
rawMapping.put("_doc", topLevelMappingParameters); | |||
|
|||
if (specification.fullyDynamicMapping()) { | |||
if (specification.fullyDynamicMapping() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we inverting this check? Especially while leaving a comment that says the thing we just checked was false must be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the comment. Note how the code path is inverted and the actual setup for dynamic mapping is below
...c/main/java/org/elasticsearch/datageneration/datasource/DefaultMappingParametersHandler.java
Outdated
Show resolved
Hide resolved
private List<XContentBuilder> documents = null; | ||
private DataGenerationHelper dataGenerationHelper; | ||
|
||
private static final class DataGenerationHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be a top level class. Seems like we'll want to build multiple test classes using this framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this class to its own file! TY.
private static Object randomDimensionValue(String dimensionName) { | ||
// We use dimensionName to determine the type of the value. | ||
var isNumeric = dimensionName.hashCode() % 5 == 0; | ||
if (isNumeric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about IP dimensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 20% of dimensions as IP-like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as follow up ill add dynamic mapping to parse as ip. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...in/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/GenerativeTSIT.java
Outdated
Show resolved
Hide resolved
...in/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/GenerativeTSIT.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TY @not-napoleon - ptal!
@@ -64,7 +64,7 @@ public Mapping generate(Template template) { | |||
|
|||
rawMapping.put("_doc", topLevelMappingParameters); | |||
|
|||
if (specification.fullyDynamicMapping()) { | |||
if (specification.fullyDynamicMapping() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the comment. Note how the code path is inverted and the actual setup for dynamic mapping is below
...c/main/java/org/elasticsearch/datageneration/datasource/DefaultMappingParametersHandler.java
Outdated
Show resolved
Hide resolved
...in/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/GenerativeTSIT.java
Show resolved
Hide resolved
...in/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/action/GenerativeTSIT.java
Outdated
Show resolved
Hide resolved
@@ -78,6 +79,7 @@ public FieldDataGenerator generator(String fieldName, DataSource dataSource) { | |||
case IP -> new IpFieldDataGenerator(dataSource); | |||
case CONSTANT_KEYWORD -> new ConstantKeywordFieldDataGenerator(); | |||
case WILDCARD -> new WildcardFieldDataGenerator(dataSource); | |||
default -> null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can compromise on removing the default? 😅
private List<XContentBuilder> documents = null; | ||
private DataGenerationHelper dataGenerationHelper; | ||
|
||
private static final class DataGenerationHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this class to its own file! TY.
private static Object randomDimensionValue(String dimensionName) { | ||
// We use dimensionName to determine the type of the value. | ||
var isNumeric = dimensionName.hashCode() % 5 == 0; | ||
if (isNumeric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added 20% of dimensions as IP-like.
private static Object randomDimensionValue(String dimensionName) { | ||
// We use dimensionName to determine the type of the value. | ||
var isNumeric = dimensionName.hashCode() % 5 == 0; | ||
if (isNumeric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as follow up ill add dynamic mapping to parse as ip. thoughts?
); | ||
return new DataSourceResponse.FieldTypeGenerator(() -> { | ||
// All field types minus the excluded ones. | ||
var fieldTypes = Arrays.stream(FieldType.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: store this in a static variable.
import static org.hamcrest.Matchers.closeTo; | ||
import static org.hamcrest.Matchers.equalTo; | ||
|
||
public class GenerativeTSIT extends AbstractEsqlIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's expand to GenerativeTimeSeriesIT
. Also, maybe skip Generative
for now, since this is different than the other generative tests.
) | ||
.filter(val -> val.v2().isEmpty() == false) // Filter out empty values | ||
.map(tup -> tup.v1() + ":" + tup.v2()) | ||
.toList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe it's just me, but the above is somewhat hard to parse. I'd move this to a helper function.
|
||
static Long windowStart(Object timestampCell, int secondsInWindow) { | ||
// The timestamp is in the 4th column (index 3) | ||
return Instant.parse((String) timestampCell).toEpochMilli() / 1000 / secondsInWindow * secondsInWindow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have / secondsInWindow * secondsInWindow
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also use parentheses for readability.
Settings.Builder settingsBuilder = Settings.builder(); | ||
// Ensure it will be a TSDB data stream | ||
settingsBuilder.put(IndexSettings.MODE.getKey(), IndexMode.TIME_SERIES); | ||
settingsBuilder.putList("index.routing_path", List.of("attributes.*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to define this for data streams, dimensions are automatically added to the routing path.
settingsBuilder.put(IndexSettings.MODE.getKey(), IndexMode.TIME_SERIES); | ||
settingsBuilder.putList("index.routing_path", List.of("attributes.*")); | ||
CompressedXContent mappings = mappingString == null ? null : CompressedXContent.fromJSON(mappingString); | ||
// print the mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print? Sorry I missed this, what does this mean?
.map(doc -> ((Map<String, Integer>) doc.get("metrics")).get("gauge_hdd.bytes.used")) | ||
.toList(); | ||
// Verify that the first column is the max value (the query gets max, avg, min in that order) | ||
docValues.stream().max(Integer::compareTo).ifPresentOrElse(maxValue -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, let's move calculation of the expected values to separate functions so that they can be generalized later.
* there is only one metric group per time bucket. | ||
*/ | ||
public void testGroupByNothing() { | ||
try (EsqlQueryResponse resp = run(String.format(Locale.ROOT, """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of duplication between these tests. Let's try to refactor them and move shared parts to utility functions.
var docValues = windowDataPoints.stream() | ||
.map(doc -> ((Map<String, Integer>) doc.get("metrics")).get("gauge_hdd.bytes.used")) | ||
.toList(); | ||
// Verify that the first column is the max value (the query gets max, avg, min in that order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up, you can avoid hard-coding by defining an enum for each fuction, with corresponding validation logic.
// Verify that the second column is the avg value (thus why row.get(2)) | ||
docValues.stream().mapToDouble(Integer::doubleValue).average().ifPresentOrElse(avgValue -> { | ||
var res = (Double) row.get(2); | ||
assertThat(res, closeTo(avgValue, res * 0.5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the 0.5
factor?
|
||
private static Object randomDimensionValue(String dimensionName) { | ||
// We use dimensionName to determine the type of the value. | ||
var isNumeric = dimensionName.hashCode() % 5 == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's just use randomDouble() < 0.2. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this needs to be consistent per field. Why not rely on the data generation framework to provide these values? You can start simple with all dimensions being keyword fields - not dynamic, no pass-through subfields.
var isNumeric = dimensionName.hashCode() % 5 == 0; | ||
var isIP = dimensionName.hashCode() % 5 == 1; | ||
if (isNumeric) { | ||
// Numeric values are sometimes passed as integers and sometimes as strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to test parsing here so let's just use one of the formats.
this.numDocs = numDocs; | ||
attributesForMetrics = List.copyOf(Set.copyOf(ESTestCase.randomList(1, 300, () -> ESTestCase.randomAlphaOfLengthBetween(2, 30)))); | ||
numTimeSeries = ESTestCase.randomIntBetween(10, (int) Math.sqrt(numDocs)); | ||
// System.out.println("Total of time series: " + numTimeSeries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
Map.of( | ||
"gauge_double", | ||
Map.of("path_match", "metrics.gauge_*", "mapping", Map.of("type", "double", "time_series_metric", "gauge")) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last two have the same path as the above, I'm surprised that this is allowed. Let's skip them for now, since you only use one long gauge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pablo, this ia a good step. It's nice that you tried to include the pass-through field on the first take, though that complicates things somewhat. I'd start with statically defined dimension and metric fields to get the validation logic in place first, then introduce dynamic fields on top of that.
Let's try to refactor the logic slightly so that it can be further extended in follow-up PRs.
Follow up items after this PR:
rate
function and counters in general