Skip to content

Improve queries tool#624

Merged
elshize merged 55 commits intopisa-engine:mainfrom
gustingonzalez:improve/queries-stats
Jan 30, 2026
Merged

Improve queries tool#624
elshize merged 55 commits intopisa-engine:mainfrom
gustingonzalez:improve/queries-stats

Conversation

@gustingonzalez
Copy link
Collaborator

@gustingonzalez gustingonzalez commented Nov 19, 2025

Key changes in this pull request:

  1. Replaces --extract option by --output, which now requires an explicit output file. Since more than one algorithm (query type) could be specified, the algorithm is now also printed in the TSV.
  2. The original op_perftest() behavior remains available when --output is not specified.
  3. Adds a new --runs option to specify the number of runs to measure the query set (by default: 3). Note that this parameter excludes warmup.
  4. Modifies the --algorithm parameter to accept multiple -a/--algorithm flags instead of colon-separated algorithms.
  5. The summary now includes per-run query timing aggregation using none, min, mean, median and max as aggregation types.
  6. Measures each query once per run as with the former op_perftest(), so the set of queries is evaluated independently in each run.
  7. Refactors code to improve code clarity.

Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely what I had in mind. I think the aggregate function should either be applied to everything or we should just not aggregate at all. Otherwise, it's just too complex to keep track of what is what.

I was also thinking we either summarize or extract. But I think it's fine to just use different streams, only then I would remove the option to summarize only and just always summarize.

Let's discuss this a little more.

@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?

If not, then maybe it's best to just always extract everything and always print out summary to stderr and maybe that summary will always be (a) no aggregate, (b) min aggregate, and (c) mean aggregate? I frankly see no use for max or median. What are your thoughts?

If it makes sense to aggregate the actual output data, then I think there should always be 1 aggregate applied to both data and summary, or no aggregate at all.

@JMMackenzie
Copy link
Member

@JMMackenzie do you think it makes sense to extract results after aggregation? Say, return min for each query instead of R results where R is num of runs?

I think this could be reasonable, and I do think this was what I had initially envisaged. But I can see the benefit of extracing everything to stderr, and then also allowing a separate "aggregated" stream to either file or to stdout? I agree that if there is an aggregated stream being output, then a summary should also use that same aggregation. Maybe we can make a new "results" format where we dump both file.agg.summary and file.agg.stream?

@elshize
Copy link
Member

elshize commented Nov 20, 2025

I would actually try to avoid introducing new formats, I'd like to keep it as simple as possible, and as unsurprising as possible as well.

I think the most important is to give the user the raw data, which they can process however they want. I think we all agree on this part.

Then, I would lean towards simplicity:

  • I think having a single type of output (data vs summary) is much less ambiguous, so I lean towards having either one or the other. I'm thinking of the summary as a convenience during prototyping rather than actual serious data gathering.
  • Having the ability to aggregate for a query may be crucial for summary, and maybe useful for extracting data.
  • I think we should think hard what agg functions make sense, and not include those that don't. I'm struggling to find use for max or median. I think min and mean makes sense.

We will not support all types outputs from this tool but that's ok. This is why we give the user raw output.

Given the above, I would suggest the following algorithm:

results = run_benchmark(...)
if (aggregate) {
  results = aggregate_per_query(results, agg_function)
}
if (summarize) print_summary(results)
else print_data(results)

You have a few choices for implementing this.

One is having results as "rows", i.e., a vector of structs describing everything you need, including the query ID, and then aggregate_per_query can internally do group-by (could just use a hash map) and aggregate.

The other would be to have results as nested vectors and then after aggregation you still get nested vectors, only each inner vector has one element.

agg_function can be take multiple results and output multiple results as well, this way you can treat no aggregate as a type of aggregate (identity) -- maybe "transform" is a better name? I would prefer creating that transform function first (using CLI options) and then pass that and apply it for each query group, as opposed to pass name and have all the if-else conditions nested. It would be much cleaner that way, but it requires some design considerations.

print_data and print_summary don't need to know anything about how results were produced or transformed, only how to print them (or calculate stats).

@gustingonzalez
Copy link
Collaborator Author

gustingonzalez commented Nov 20, 2025

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

As for the max aggregation, I don't now if it is really useful (maybe to capture take the worst cases?), but it may ultimately not be representative; I just included it because its implementation required no additional effort. If it has no real usefulness, I think it should be removed.

Regarding the methodology for printing data or a summary, I think it is useful to show the summary when extracting. In this case, if some of the metrics satisfy the user's needs, there is no need to run an external script (for that reason I think is useful to print all defined metrics when no aggregation/transformation is specified).

Also, given that the query times are printed to the output, they can simply be exported using redirection (>). If that is confusing, a file path could be used instead, but I think redirection keeps it simple. In the case of printing just the summary (for example, for a quick check of a prototype), I agree with it not makes sense to print the query times.

@gustingonzalez gustingonzalez marked this pull request as draft November 20, 2025 03:14
@elshize
Copy link
Member

elshize commented Nov 21, 2025

I think that, for the case, the median could provide more robustness than mean; for example, by suppressing atypical cases or noise (mostly related to the maximum values across the runs).

That's fair.


My main concern is making it too complex. How about we always extract all queries (user can process that data themselves) and always print all summaries?

I really don't want to go the route of defining aggregate function that will only apply to one or the other.

Just note that if you use stderr for summaries, we can't pipe it to another tool for transformation because redirect will capture the rest of the logs, so it will be purely informative.

@gustingonzalez
Copy link
Collaborator Author

The behavior in which all runs (together with all summaries) are extracted occurs when aggregation is set to none (or not specified): in this case, the tool shows the summary with the runs aggregated by min, mean, etc.

However, another experiment could be: "I want just to know what happens when all values are the minimum". Although I can obtain this value from the summary output, if I specify aggregation by min, it makes sense that the summary and the output adapt to that scenario, so I don’t need to implement an specific script to reprocess the output data (even though I understand such a script would be simple). This can be useful for quick experiments; if I want to understand the causes behind this value, because I can quickly analyze the file that already contains all the minimum values.

In any case, I understand that this may introduce unnecessary complexity from a SRP perspective, and an intermediate option would be to remove the --aggregate-by parameter and always show all the data with the summary for the different values, as mentioned.

What do you think @elshize, @JMMackenzie?

@elshize
Copy link
Member

elshize commented Nov 22, 2025

I personally think it's unnecessary but if you really want to only aggregate the summary, this needs to be explicitly named in a way it doesn't leave any doubt as to what it does. --aggregate-by is not good enough. --summarize-by maybe? Not sure, but must be explicit about what we're aggregating.

@JMMackenzie
Copy link
Member

JMMackenzie commented Nov 25, 2025

As for the max aggregation, I don't now if it is really useful (maybe to capture take the worst cases?), but it may ultimately not be representative; I just included it because its implementation required no additional effort. If it has no real usefulness, I think it should be removed.

Maybe we should actually run a bunch of experiments and see if it actually matters? 😁

By the way, if we want the summary to aggregate, why not show them all?

Per-Query Minimums
--------------------------------
> Summary data here

Per-Query Medians
--------------------------------
> Summary data here

...

I guess output would be verbose... I am usually happy to report something sensible as the default though, so we could have a median or mean summary by default, and then a --verbose-summary flag that does them all?

@elshize
Copy link
Member

elshize commented Nov 27, 2025

By the way, if we want the summary to aggregate, why not show them all?

My understanding is that this is what @gustingonzalez suggests to do -- by default. I would say that if this is the default, then I see very little value in an additional filter to only show one.

I think we can probably come up with a succinct way of printing them out; per aggregate function, we only need mean and quantiles:

Encoding:           block_simdbp
Algorithm:          block_wand_max
Corrective reruns:  7
All runs:           [mean=1000, q50=1000 q90=2000, q99=3000]
Per query medians:  [mean=1000, q50=1000 q90=2000, q99=3000]
Per query means:    [mean=1000, q50=1000 q90=2000, q99=3000]
Per query minimums: [mean=1000, q50=1000 q90=2000, q99=3000]
...

Though this is compounded by the fact that we can define multiple algorithms. I believe this was the reason initially to print JSON lines for the statistics, so that you can capture it and parse later. But this was before we had the --extract option, and we never really thought about it much after.

I believe that anything going to stderr shouldn't really be for parsing. This is because we also print other unstructured logs, such as "warming up" or "performing queries", etc. These should be logs that tell user what is happening, and the data should go to stdout.

If we want to print summaries to stderr, I think we should either always print them, or have two options: (1) print them, (2) don't print them (with one of them being default, and the other enabled with a flag). Having additional filter for median, mean, minimum, etc., is just distracting. All it does is save up a few lines of logs but I adds significantly more complexity.

@gustingonzalez
Copy link
Collaborator Author

gustingonzalez commented Dec 19, 2025

Hi guys, sorry, I've been a bit absent.

If we don't want to use stderr, then we should recover the --extract parameter, to be used as following: --extract extracted.csv. In this case, the JSON will be always printed to the stdout, so it could be easily redirected to a file, while other outputs (for example, logs), would continue to be printed to stderr as they are now. What do you think?

On the other hand, I agree with the idea that the filters are unnecessary. We can simply output everything together, and let users filter using an external tool. This keeps the script simple, as @elshize mentioned.

@elshize
Copy link
Member

elshize commented Dec 25, 2025

@gustingonzalez What do you suggest being extracted with --extract option? Is it the data or the summaries?

@gustingonzalez
Copy link
Collaborator Author

@elshize the idea of the --extract is to be used for the data (per-query timing details). In fact, this option is currently implemented in the main branch, but was removed on this MR. The intention is to restore it, but making mandatory a output file (e.g.: --extract extracted.csv).

Take into account that regardless of whether --extract is option is provided, the query execution logic will be same: each query will be measured one time per run. In the main branch this behavior differs when the option is or not enabled.

@elshize
Copy link
Member

elshize commented Dec 25, 2025

Ideally, I would prefer to have the data go to stdout but I'd be fine with that, especially because this is how it works now. I would rename it --output (-o for short) maybe?

Then, we print the summaries (in JSON) to the stdout, and logging to stderr.

The full measurements use no aggregations, just full results, multiple lines per query. The JSON summaries are printed for all supported aggregates (including none?), and can easily be extracted with something like ./queries ... | jq 'select(.agg == "min")'.

I think this would keep it reasonably simple and flexible at the same time.

Side note: I think --output can be optional. It makes no sense to define it if someone is only interested in the summaries for experimenting and whatnot.

Does the above sound good?

@gustingonzalez
Copy link
Collaborator Author

I agree with the idea. I'll work in the changes.

Just one additional comment: I think using just --output might be a bit vague. Maybe, a better option could be something like --query-times-output (I also considered --query-times-file, but it doesn't align well with the short option -o).

@elshize
Copy link
Member

elshize commented Dec 26, 2025

I personally think using --output is fine because it's the main output and because it's all going to be documented under --help. I think --query-times-output is overly verbose. If we want to be more precise, perhaps --data-output/-o?

@gustingonzalez
Copy link
Collaborator Author

@elshize, got it!
One more question: do you think it makes sense to keep the aggregation concept, or should we switch to summary?

@elshize
Copy link
Member

elshize commented Dec 26, 2025

Because we want to print multiple summaries (for different agg functions), we'll need to name it somehow to print in JSON:

{"agg_per_query": "none", ...}
{"agg_per_query": "min", ...}

Not sure if there's maybe a better name for that, I'm certainly open for suggestions.

Summary is a different thing for me, the printed statistics is summary, and if we always print it, then we don't need to label it, but we may need to use that term in code, docs, or CLI help.

@gustingonzalez
Copy link
Collaborator Author

Hi, guys, ready with the changes.

One thing that hadn't been taken into account is that more than algorithm (query type) can be specified. Therefore, the changes now support specifying more than one output file (one for each query type specified).

The following is an example of execution and its output:

./build/bin/queries --encoding block_interpolative --index /path/to/index.block_interpolative -k 10 --algorithm or:and --scorer bm25 -q /path/to/queries.txt -o data-or.csv:data-and.csv
[2025-12-26 18:28:03.714] [stderr] [info] Warming up posting lists...
[2025-12-26 18:28:03.809] [stderr] [info] Per-run query output will be saved to 'data-or.csv'.
[2025-12-26 18:28:03.809] [stderr] [info] Performing 3 runs for 'or' queries...
{
  "encoding": "block_interpolative",
  "algorithm": "or",
  "runs": 3,
  "k": 10,
  "safe": false,
  "corrective_reruns": 0,
  "query_aggregation": {
    "none": {"mean": 17183.4, "q50": 8942, "q90": 41413, "q95": 46643, "q99": 62398},
    "min": {"mean": 14845.6, "q50": 7693, "q90": 39044, "q95": 45186, "q99": 58988},
    "mean": {"mean": 17183.1, "q50": 9266, "q90": 39809, "q95": 46401, "q99": 63660},
    "median": {"mean": 18283.7, "q50": 9471, "q90": 42155, "q95": 46711, "q99": 63830},
    "max": {"mean": 18421, "q50": 9565, "q90": 42421, "q95": 46960, "q99": 63889}
  }
}
[2025-12-26 18:29:13.680] [stderr] [info] Per-run query output will be saved to 'data-and.csv'.
[2025-12-26 18:29:13.680] [stderr] [info] Performing 3 runs for 'and' queries...
{
  "encoding": "block_interpolative",
  "algorithm": "and",
  "runs": 3,
  "k": 10,
  "safe": false,
  "corrective_reruns": 0,
  "query_aggregation": {
    "none": {"mean": 6499.02, "q50": 1630, "q90": 20539, "q95": 31282, "q99": 46491},
    "min": {"mean": 4257.8, "q50": 1111, "q90": 14786, "q95": 18986, "q99": 28139},
    "mean": {"mean": 6498.68, "q50": 1703, "q90": 21923, "q95": 30309, "q99": 40328},
    "median": {"mean": 6898.12, "q50": 1768, "q90": 22582, "q95": 33420, "q99": 46509},
    "max": {"mean": 8341.13, "q50": 2181, "q90": 27297, "q95": 39730, "q99": 51367}
  }
}

Let me know if this is OK or if any changes are needed.

@elshize
Copy link
Member

elshize commented Dec 26, 2025

One thing that hadn't been taken into account is that more than algorithm (query type) can be specified. Therefore, the changes now support specifying more than one output file (one for each query type specified).

Why not just have an "algorithm" column in the output file? I would rather avoid multiple output files. First, I would say it's no more convenient, if not less convenient than having one. It's so easy to filter out with your dataframe framework of choice, or whatever one uses for crunching data. Furthermore, now you have to worry about ensuring that the number of algorithms is the same as the number of output files, which is just a headache.

I would simply print a column header (are we printing it now or no?) and then values. We can keep it TSV.

Regarding summaries, I think it's better to have something like this:

  "times": [
    {"query_aggregation": "none", "mean": 6499.02, "q50": 1630, "q90": 20539, "q95": 31282, "q99": 46491},
    {"query_aggregation": "min", "mean": 4257.8, "q50": 1111, "q90": 14786, "q95": 18986, "q99": 28139},
    {"query_aggregation": "mean", "mean": 6498.68, "q50": 1703, "q90": 21923, "q95": 30309, "q99": 40328},
    {"query_aggregation": "median", "mean": 6898.12, "q50": 1768, "q90": 22582, "q95": 33420, "q99": 46509},
    {"query_aggregation": "max", "mean": 8341.13, "q50": 2181, "q90": 27297, "q95": 39730, "q99": 51367}
  ]

I think the mapping query_aggregation->min->{} is not clear on what these objects actually contain, while times->min->{} is not clear about what "min" is.

} else {
std::sort(query_times.begin(), query_times.end());
double avg =
// Print JSON summary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid formatting JSON by hand, we have a library for that in our deps already (#include <nlohmann/json.hpp>), which allows you to define JSON similar to a map, and then print it.

@elshize
Copy link
Member

elshize commented Dec 26, 2025

I left another comment for the code, but I'll need to come back to this later, just letting you know I have not gone through all of the code yet.

@gustingonzalez
Copy link
Collaborator Author

@elshize, ready with the changes.

One observation is that the JSON now is printed in an unordered way. Altough new versions of <nlohmann/json.hpp> includes an ordered_json object, I can't "easily" update the library because is included via the https://github.com/pisa-engine/warcpp submodule. If you think it's necessary, we could consider updating it.

Below is an example of the current JSON output:

{
  "algorithm": "or",
  "corrective_reruns": 0,
  "encoding": "block_interpolative",
  "k": 10,
  "runs": 1,
  "safe": false,
  "times": [
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "none"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "min"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "mean"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "median"
    },
    {
      "mean": 8665.901,
      "q50": 4529.0,
      "q90": 19855.0,
      "q95": 22181.0,
      "q99": 30635.0,
      "query_aggregation": "max"
    }
  ]
}

@elshize
Copy link
Member

elshize commented Dec 27, 2025

It's a little unfortunate that we can't control the order but on the other hand, I don't think it's that crucial, especially with pretty-printing. Also, not sure if APIs guarantee it, but it looks like it's not so much unordered as lexicographically ordered.

We might want to work on upgrading the dependency anyway, but I don't think it's necessary as part of this work. I think the JSON output you provided above is fine.


One other thing we have to consider is that now we are printing potentially multiple JSON objects, each in multiple lines, so it's no longer in JSONL format. This may limit what out-of-the-box tools one can use to parse the output. I typically use jq for parsing my JSON outputs, which supports multiple multi-line JSONs but I'm not sure if, say, pandas.read_json supports it (I didn't check but the docs make it sound like it needs to be line-by-line).

Ultimately, I don't think this is a big issue. You can always use queries ... | jq -c > summaries.jsonl, and I really would expect people to use the raw results if they intend to do any non-trivial processing.

That said, we could address this as well. I see two options off the top of my head.

One is to have a flag --pretty-print (or --no-pretty-print) but it's complicated by the fact we have two outputs, and I would like to avoid some very long complex names if possible. Plus, as I mention above, this can be easily achieved by piping it to jq -c so it feels unnecessary. I know jq is a third party tool, but ubiquitous and available in any Linux package manager I know and on brew for mac.

The other approach would be to print a single JSON and put all summaries in an array:

{
  "summaries": [
    {
      "algorithm": "or",
      ...
    },
    {
      "algorithm": "and",
      ...
    },
  ]
}

But to be clear, I'm ok with leaving the output as is now.

Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some more comments, but still haven't gone through the entire PR.

Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, finished going through it, left some comments.


A general note:

I would discourage the nesting doll style where you keep passing slightly modified parameters down and the next function slightly moves forward the entire logic. It's usually much clearer if we break down our programs into sub-programs and stick to separation of concerns.

For example, extracting times should typically have nothing to do with printing or summarizing them, so the extracting function should not get the output stream at all.

If we break things down, they typically easier to reason about. There are many reasons for that, including: the functions take fewer parameters, we are forced to return some meaningful types, understanding a function in isolation is much easier than globally, etc. One particular thing we should strive for is to keep mutable state contained, and as much as we can try to have pure functions doing complex logic, so we can deterministically predict what happens based on parameters. Of course, benchmarks are not deterministic in what values they produce, but I'm talking about all the rest.

Note that this nested type of functions are quite common in this code base, especially in legacy code, but we should fight it and break away from it as much as possible.

@gustingonzalez
Copy link
Collaborator Author

Thank you for your review @elshize, I'll work on that.

@gustingonzalez
Copy link
Collaborator Author

This is great work! Thanks so much for working on this.

One last thing I forgot to mention before (my bad), we also need to re-generate the docs as described here: https://github.com/pisa-engine/pisa/tree/main/docs

@elshize, done! However, Codacy is now reporting some issues regarding the autogenerated documentation.

@elshize
Copy link
Member

elshize commented Jan 30, 2026

Thanks, we can ignore those. I'll merge it.

@elshize elshize merged commit 2f09757 into pisa-engine:main Jan 30, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants