Skip to content

Conversation

@kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jul 4, 2025

Summary

  • This patch migrates several indexing and compaction integration tests (several of which have been flaking out a lot recently) to the embedded-tests framework.
  • Overall, this migration will save about 1 hr 30 mins of GitHub runner time.
  • The MiddleManager version of the integration tests has currently not been ported as they are slower and currently not desirable in embedded tests (because they launch child processes).
  • The MM-based embedded-tests can be easily enabled at any point in the future.

Changes

  • The new tests use SQL queries to verify results instead of native since the SQL syntax is much more concise and makes for more readable unit tests. We can add native queries later or in future tests that are migrated to this framework.
  • Add TaskBuilder utility to create Task objects using fluent syntax
  • Add JSON data resource files to embedded-tests

Test migrations

Old test New test
ITPerfectRollupParallelIndexTest IndexParallelTaskTest
ITBestEffortRollupParallelIndexTest IndexParallelTaskTest, added as a new test parameter which uses dynamic partitioning
ITAutoCompactionTest AutoCompactionTest
ITAutoCompactionLockContentionTest AutoCompactionUpgradeTest
ITAutoCompactionLockContentionTest KafkaClusterMetricsTest, method test_ingestClusterMetrics_compactionSkipsLockedIntervals()
ITCompactionTaskTest CompactionTaskTest
ITCompactionSparseColumnTest CompactionSparseColumnTest
ITOverlordResourceTest Already being verified in OverlordClientTest
ITOverlordResourceNotFoundTest Already being verified in OverlordClientTest

New nested tests

  • EmbeddedCentralizedSchemaPublishFailureTest for the group cds-task-schema-publish-disabled
  • EmbeddedCentralizedSchemaMetadataQueryDisabledTest for the group cds-coordinator-metadata-query-disabled

Test run times

Test Actual test time Total time including setup
IndexParallelTaskTest(indexer) 38 s 38 s
AutoCompactionTest (indexer) 1 min 20 s 1 min 20 s
CompactionSparseColumnTest (indexer) 12 s 12 s
Standard ITPerfectRollupParallelIndexTest 10 min 15 min
Standard ITPerfectRollupParallelIndexTest
(Indexer, shuffle deep store test, only 1 config changed)
10 min 15 min
Standard ITPerfectRollupParallelIndexTest
(MM, shuffle deep store test, only 1 config changed)
10 min 15 min
Standard ITBestEffortRollupParallelIndexTest 2 min NA(setup includes other tests too)
Revised ITBestEffortRollupParallelIndexTest 4 min 10 s "
Standard ITAutoCompactionTest (middle manager) 25 mins "
Standard ITAutoCompactionTest (indexer) 15 mins "
Standard ITCompactionSparseColumnTest (indexer) 2 min 10 s "

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz changed the title Add EmbeddedIndexParallelTaskTest to migrate ITPerfectRollupParallelI… Migrate ITPerfectRollupParallelIndexTest and ITBestEffortRollupParallelIndexTest to embedded-tests Jul 4, 2025
@github-actions github-actions bot added the GHA label Jul 4, 2025
@kfaraz kfaraz requested a review from gianm July 4, 2025 15:53
@kfaraz kfaraz changed the title Migrate ITPerfectRollupParallelIndexTest and ITBestEffortRollupParallelIndexTest to embedded-tests Migrate several indexing and compaction integration tests to embedded-tests Jul 8, 2025
kfaraz added a commit that referenced this pull request Jul 8, 2025
Bug:
Concurrent append uses lock of type APPEND which always uses a lock version of epoch 1970-01-01.

This can cause data loss in a flow as follows:
- Ingest data using an APPEND task to an empty interval
- Mark all the segments as unused
- Re-run the APPEND task
- Data is not visible since old segment IDs (now unused) are allocated again

Fix:
In segment allocation, do not reuse an old segment ID, used or unused.
This fix was already done for some cases back in #16380 .
An embedded test for this has been included in #18207
@@ -0,0 +1,3 @@
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "tags": ["t1", "t2"], "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by comment/nit: i know this isn't new, but imo we should fix the problem of referring to this dataset as 'wikipedia' because it is confusing with the quickstart wikipedia data which is also going to be used in some tests, and this stuff only has a vaguely similar schema, maybe tiny-wikipedia or something to indicate that its a very small dataset would help clear things up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I will rename these datasets accordingly.

@Akshat-Jain Akshat-Jain reopened this Jul 23, 2025
@Akshat-Jain Akshat-Jain reopened this Jul 23, 2025
@Akshat-Jain Akshat-Jain reopened this Jul 25, 2025
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lgtm, thanks for the builder stuff it feels to me like this should be a lot more ergonomic and less error prone than either the old IT templates or trying to hand craft json specs.

seems fine to merge after fixing up the todo comment one way or another.

One thing I was thinking about while reviewing is that i believe we are perhaps losing some minor coverage of auth stuff during this transition since I think the base ITs had basic-auth setup, though afaict not much in the way of roles and stuff in most tests (so i think was only authentication that would have been tested except for those that extend AbstractAuthConfigurationTest). I think that is probably fine though as long as we migrate the auth tests over to run on this framework, though they probably don't cast quite a wide of net in terms of APIs being called since those tests are more focused on authorization. I don't think I am suggesting we just bake basic auth into random tests or anything, and really its a bit of a negative of the old frameworks that you have to hunt across several files to determine what configuration is actually active for a given test, but maybe something we should watch out for as we move tests over.

import java.util.TreeSet;
import java.util.stream.Collectors;

public abstract class CompactionTestBase extends EmbeddedClusterTestBase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than maybe the cluster config, most of the utility methods on this class don't seem too related to compaction, should they live somewhere more common? Is fine to change this later, just want to avoid copy and paste of these methods down the line or like weird extending of compaction test base for things that aren't doing any compaction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I will move the common utility methods to some utility class similar to EmbeddedMsqApis as suggested by @gianm here

);

// Wait for scheduler to pick up the compaction job
// TODO: make this latch-based
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should either do this or transition from a TODO to a comment about how someone in the future can improve this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please don't commit todo comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, must have missed this in the cleanup.

.ofTypeIndexParallel()
.jsonInputFormat()
.inlineInputSourceWithData(Resources.InlineData.JSON_2_ROWS)
.isoTimestampColumn("timestamp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be nicer to have .timestampColumn("timestamp", "iso"). Makes it easier to sub in "auto" or whatever other format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have that initially. But realized that most tests are just using an ISO timestamp. So used this syntax sugar instead.

For a fully custom timestamp, there is also the option to use.

.dataSchema(d -> d.withTimestamp(new TimestampSpec(...)))

Please let me know which one you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with either one.

* Verifies the result of a SELECT query
*
* @param field Field to select
* @param result CSV result with special strings {@code ||} to represent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do this rather than accept the CSV as-is?

Copy link
Contributor Author

@kfaraz kfaraz Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't want to do this initially, but it made for more readable tests as it avoided all the escaping of empty strings and newlines.

verifyScanResult("added", "...||31||...||62");

vs

verifyScanResult("added", "\"\"\n31\"\"\n62");

Please let me know if this seems hacky and if you feel that it is cleaner to just use the original.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. To me the first form is hard to get used to. I keep wanting to read the || as field separators rather than newlines. The second is also weird looking though. I can't think of a good solution immediately. I'm ok with what you think is best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a bunch of different symbols, but nothing really conveyed a "newline" well-enough, not even or .
The empty strings are probably the worse of the two though.
Using ellipsis for now, but keeping newline \n as is.

So, we would have something like this:

verifyScanResult("added", "...\n31\n...\n62");

);

// Wait for scheduler to pick up the compaction job
// TODO: make this latch-based
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please don't commit todo comments.

*
* @return ID of the task.
*/
protected String runTask(TaskBuilder<?, ?, ?> taskBuilder, String dataSource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, protected methods on a base class aren't the best way to do utility APIs. Other test cases might want some of these utility methods, but not to be "compaction tests". Also, some tests might want to extend multiple base classes, and this approach makes it impossible.

With MSQ tests we have EmbeddedMSQApis, something that collects together utility APIs without using a base class approach. Something similar might work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved most methods to EmbeddedClusterApis itself so that all tests can benefit from it.
I have still kept some of the protected methods in CompactionTestBase but these mostly just act as syntax sugar over the methods in EmbeddedClusterApis to keep the diff in the compaction test classes small.

/**
* Constants and utility methods used in embedded cluster tests.
*/
public class Resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to embedded-tests means that it's now tougher for extensions to write their own tests that pull in Resources. Can you split this up? Such as having some Resources in services, and MoreResources in embedded-tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 29, 2025

One thing I was thinking about while reviewing is that i believe we are perhaps losing some minor coverage of auth stuff during this transition since I think the base ITs had basic-auth setup, though afaict not much in the way of roles and stuff in most tests (so i think was only authentication that would have been tested except for those that extend AbstractAuthConfigurationTest). I think that is probably fine though as long as we migrate the auth tests over to run on this framework, though they probably don't cast quite a wide of net in terms of APIs being called since those tests are more focused on authorization. I don't think I am suggesting we just bake basic auth into random tests or anything, and really its a bit of a negative of the old frameworks that you have to hunt across several files to determine what configuration is actually active for a given test, but maybe something we should watch out for as we move tests over.

Thanks for calling this out, @clintropolis !

I do plan to migrate the auth tests as well to the embedded framework.
For now, we will migrate the non-auth tests without basic auth enabled.
Once all tests are migrated, we can add test variants which have auth enabled to increase API coverage.

@kfaraz
Copy link
Contributor Author

kfaraz commented Jul 29, 2025

@gianm , @clintropolis , thanks for the feedback!
I have updated the PR based on your suggestions.
Please let me know if further changes are required.

@gianm
Copy link
Contributor

gianm commented Jul 29, 2025

@gianm , @clintropolis , thanks for the feedback! I have updated the PR based on your suggestions. Please let me know if further changes are required.

Approved, since the only remaining comment (about verifyScanResult) is nonblocking.

@Akshat-Jain Akshat-Jain reopened this Jul 29, 2025
@kfaraz kfaraz merged commit 3b7dd53 into apache:master Jul 29, 2025
72 checks passed
@kfaraz kfaraz deleted the add_embedded_perfect_rollup_test branch July 29, 2025 10:52
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants