Skip to content

Conversation

@garlick
Copy link
Member

@garlick garlick commented Nov 6, 2025

Problem: we have a need in production to make job data persistent for longer than the 2 week purge window we typically use now. Short term, we'll try increasing the job purge window, but the KVS and the memory footprint of job-list are potential limits to scalability. Some way to persist job data beyond the KVS purge window could be useful, and a solution that allows current flux jobs queries to work seamlessly on historical and current data is preferred.

This is #5847 re-submitted with a newer vendored libsqlite3.

To recap, a job-sql module is added that is a job manager journal consumer that dumps job data into a sqlite database. In this WIP PR, the database is in-memory, but a configuration option to make it persistent could be trivially added, with the logic added needed to pick up from the journal where it left off when the system is restarted.

As discussed in #5847, exposing SQL directly to guests is a non-starter, but either reworking job-list to use (this?) sql back end, or adding a job-list compatible RPC method within this module is one possible path forward.

This was posted primarily to get feet wet with job queries using the JSON1 syntax and some shortcomings with that approach were identified in #5847: for example JSON1 doesn't know about hostlists, idsets, or eventlogs. However, maybe a combination of SQL query post-processing and expansion of the schema to include some broken out fields could be used to get around those limitations.

Anyway posting this mainly for discussion. Since this initial work was already done, it's a potential fast path to...something.

Problem: the JSON1 extension to sqlite3 is not available
in RHEL 8 based distros.

JSON1 is enabled by default in version 3.38.0.
(or in earlier versions as an opt-in build option).

Ubuntu 22.04 ships 3.37.2.
RHEL 8 ships 3.26.0.

Pull in the sqlite3 amalgomated source for 3.51.0 from here:
https://sqlite.org/download.html

License: "public domain".  See https://sqlite.org/copyright.html
Problem: content-sqlite and job-archive use external libsqlite3
but we now have an internal copy of the library.

Drop configure requirement and change the build rules for those
broker modules to use the internal copy of libsqlite3.
Problem: the debian control file and scripts that install
dependencies pull in libsqlite3-dev packages.

Drop sqlite from those scripts.
Problem: docker test images pull in libsqlite3 but this
is no longer a build requirement for flux-core.

Update Dockerfiles.
Problem: libsqlite3 is included in coverage.

Add it to CODE_COVERAGE_IGNORE_PATTERN.
Problem: libsqlite3 is spell checked in CI

Add it to the .typos.toml extend-include list.
Problem: Flux doesn't have a raw SQL interface to job data that
can utilize the sqlite JSON1 extensions.

Add a service that consumes the job manager journal and populates
an in-memory sqlite database with all jobs (active and inactive jobs).
The schema simply stores the jobid, eventlog, jobspec, and R.
The last three are kept in JSON format so sqlite JSON1 extensions
can be used to construct queries.

https://www.sqlite.org/json1.html
Problem: the job-sql service has no command line client.

Add one.
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

❌ Patch coverage is 5.12821% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.67%. Comparing base (9a718dd) to head (1e6cd11).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/cmd/builtin/sql.c 5.12% 37 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7191      +/-   ##
==========================================
- Coverage   83.71%   83.67%   -0.04%     
==========================================
  Files         553      554       +1     
  Lines       92270    92309      +39     
==========================================
+ Hits        77240    77244       +4     
- Misses      15030    15065      +35     
Files with missing lines Coverage Δ
src/modules/content-sqlite/content-sqlite.c 69.57% <ø> (ø)
src/cmd/builtin/sql.c 5.12% <5.12%> (ø)

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chu11
Copy link
Member

chu11 commented Nov 6, 2025

Just a reminder, I did have an alternate solution, which was job-list putting all inactive jobs in a job-db .. and job-list looking up inactive entries from that db.

#4336

This work was never pushed hard b/c @ryanday36 eventually no longer needed it for his reporting purposes. I've been maintaining the PR for some time under the assumption of "we'll need this some day" ...

I should say the PR above did assume no access to sqlite json queries (done before consideration of vendoring sqlite, b/c older versions of sqlite don't have the json support).

There are downsides of that approach and positives to this approach. And there's devil in the details with the json queries.

@garlick
Copy link
Member Author

garlick commented Nov 6, 2025

@garlick
Copy link
Member Author

garlick commented Nov 21, 2025

for example JSON1 doesn't know about hostlists, idsets, or eventlogs

See also:

https://sqlite.org/loadext.html
https://sqlpkg.org/

These are plugin extensions to sqlite3 and are how JSON1 was implemented.

(I think @trws may have been referring to this in a meeting a few weeks ago)

@chu11
Copy link
Member

chu11 commented Dec 5, 2025

I was playing around with the flux sql command just to learn the json query syntax. There was a lot of trial and error with minor subtleties that were not the most obvious (the sqlite documentation is quite thin on this, some stackoverflow and AI help was needed)..

This is what I came up with to get the job id of every job with a specific user:

flux sql 'SELECT T1.id FROM jobs as T1, json_each(T1.eventlog) as T2 WHERE json_extract(T2.value, "$.context.userid") = 8556'

Although there's devils in the details, I think doing a giant AND of all the json_extract(T2.value, "$.context.userid") = 8556 like checks in the WHERE statement should give us something we can work with that's similar to my work in #4336.

What is very different is the FROM statement, FROM jobs as T1, json_each(T1.eventlog) as T2, as we effectively do a JOIN on the entire jobs table with every entry in (I believe) every eventlog. I'm no sql expert, but that seems to be a very expensive operation.

Thought 1)

What if we did a database "mix" similar to what I did in job-db?

+const char *sql_create_table = "CREATE TABLE if not exists jobs("
+                               "  id CHAR(16) PRIMARY KEY,"
+                               "  userid INT,"
+                               "  name TEXT,"
+                               "  queue TEXT,"
+                               "  state INT,"
+                               "  result INT,"
+                               "  nodelist TEXT,"
+                               "  t_submit REAL,"
+                               "  t_depend REAL,"
+                               "  t_run REAL,"
+                               "  t_cleanup REAL,"
+                               "  t_inactive REAL,"
+                               "  jobdata JSON,"
+                               "  eventlog TEXT,"
+                               "  jobspec JSON,"
+                               "  R JSON"
+    ");";

if we added in columns for the most common things (I think t_inactive is critical) we can eliminate all but the most egregiously inefficient searches.

Thought 2)

Could we lay out the eventlog smarter? One idea:

{
  events: [json array lists all events like now]
  submit: {object of submit event}
  release: {object of release event}
  ... etc
}

Instead of an array of all events, have an object that holds the array of events, but have keys that store a handful of the most important events, so lookups can be easy and we avoid the JOIN.

Edit: Oh yeah, this may be critical. Some events can be duplicated (priority comes to mind) so we need the last entry in the array. This could make things hairy (internet says to use some ORDER by and LIMIT statements, but that may get hard / tricky if we're combining multiple json_extract statements). Some of the solutions above would resolve this by only having the latest entry be the one we check against.

Just some initial thoughts. I haven't prototyped anything deeply.

@chu11
Copy link
Member

chu11 commented Dec 5, 2025

Ok, I just realized something. The job-sql db as currently constructed is probably a deal breaker b/c the eventlog is in an array. Using my example from above:

flux sql 'SELECT T1.id FROM jobs as T1, json_each(T1.eventlog) as T2 WHERE json_extract(T2.value, "$.context.userid") = 8556'

The join (jobs s T1, json_each(T1.eventlog) as T2) means that only a single eventlog entry is compared per each row. that means if we wanted to do (in pseudocode) T2.value.name = submit AND T2.value.context.userid = 8000 AND T2.value.name = finish AND T2.value.context.status = 0 (i.e. is job of this user and the job finished without error), this cannot be done.

I'm obviously not the most expert sqlite person out there, but AFAICT, there is no way to to "simply" do this with any of the sqlite json functions available. The most logical way to do this (using my example above) would be to do a JOIN between jobs, "submit events", "finish events", which I think would involve doing nested SQL queries within the main SQL query.

Add in my comment above about ensuring we get the last entry within an eventlog array (which may also involve nested sql queries), I'm not sure this is a path we want to go down.

Having a slightly duplicated eventlog format or a database format may be a necessity.

Edit: I suppose I did not include the possibility of a sqlite extension. We could hypothetically add a eventlog_query kinda of sql function. But I'm not sure that's the path we want to go down. We want to do extensions to improve performance, or add some preferred feature. Doing an extension "make this workable" probably isn't something we want to do.

@trws
Copy link
Member

trws commented Dec 5, 2025

Ouch, yeah, that sounds like it would be really expensive. We could probably use the json support as a way to extract things into a table or tables, virtual or otherwise, to at least make the transformation machined if that would help. Not sure what that would look like off top of my head, but it might be a relatively low cost way to have the other view/format.

@chu11
Copy link
Member

chu11 commented Dec 5, 2025

We could probably use the json support as a way to extract things into a table or tables, virtual or otherwise, to at least make the transformation machined if that would help. Not sure what that would look like off top of my head

Yeah, that's what I was refering to with my "nested SQL queries". Like (in pseudo-query language)

SELECT *
FROM jobs, (SELECT * FROM eventlog WHERE name = "submit"), (SELECT * FROM eventlog WHERE name = "release")
WHERE submit.context.userid = 1234 AND release.status = 0

@trws
Copy link
Member

trws commented Dec 5, 2025

Fair, I usually think in terms of views or constructed tables for things like this more than subqueries because of the performance or composability differences but they're effectively equivalent. One thing that comes to mind, would the json_tree builtin work better for this at all? That would give us a way to directly select over the full hierarchy. That said, I like your idea 1, though I'd be curious to see if we could have an SQL function or similar that would just pull the json from a json virtual table to populate a table or tables like that.

@chu11
Copy link
Member

chu11 commented Dec 5, 2025

would the json_tree builtin work better for this at all? That would give us a way to directly select over the full hierarchy.

I didn't try json_tree(), didn't see how recursively going down the eventlog objects would be useful? Doesn't mean there's a subtle advantage here I could be missing.

@trws
Copy link
Member

trws commented Dec 5, 2025

It depends on what the query is. For some things it would likely make it harder because you now need to aggregate things across rows, for others it could be easier because you can get all the entries where the field you care about matches then join back on the table with the parent and ID keys to get all the meaningful entries. If the performance of the decomposition is reasonable, I could see that being faster than having to separately run the json_extract too but that's very much implementation dependent.

@chu11
Copy link
Member

chu11 commented Dec 6, 2025

A random thought I had tonight, regarding my idea 1 and idea 2 that I listed above.

Idea 1, something like:

+const char *sql_create_table = "CREATE TABLE if not exists jobs("
+                               "  id CHAR(16) PRIMARY KEY,"
+                               "  userid INT,"
+                               "  name TEXT,"
+                               "  queue TEXT,"
+                               "  state INT,"
+                               "  result INT,"
+                               "  nodelist TEXT,"
+                               "  t_submit REAL,"
+                               "  t_depend REAL,"
+                               "  t_run REAL,"
+                               "  t_cleanup REAL,"
+                               "  t_inactive REAL,"
+                               "  jobdata JSON,"
+                               "  eventlog JSON,"
+                               "  jobspec JSON,"
+                               "  R JSON"
+    ");";

idea 2:

store events array like:

{
  events: [json array lists all events like now]
  submit: {object of submit event}
  release: {object of release event}
  ... etc
}

I thought, why not do both?

Idea 1 part - it will optimize the known most common queries now and presumably most of the common ones in the future

idea 2 part - flexibility for the future. for example, e.g. new event types that are added to the eventlog, we have a mechanism that serves us going into the future.

@trws
Copy link
Member

trws commented Dec 6, 2025 via email

@grondo
Copy link
Contributor

grondo commented Dec 6, 2025

Kinda orthogonal, and I’m not sure how big these normally get, but if we’re considering reworking the log, is it worth considering making it possible to incrementally parse it, or even just the events array maybe? Specifically we could do a JSONL style array so it doesn’t have to be parsed, serialized or even sent as one immense object.

Are you talking about the KVS eventlogs (see RFC 18)? They are already essentially in JSONL format IIUC, each eventlog entry is valid JSON object, and entries are separated by newlines.

Edit: just realized you were probably talking about something specific to this implementation, just ignore me if so 🤦

@trws
Copy link
Member

trws commented Dec 6, 2025

Honestly I just didn't know that, that makes me think that we might want to expand on that rather than going from that to a big object though. We've already seen some scaling issues on large json objects in other places, and JGF is likely to get that treatment at some point, so may as well keep it that way here. Still a good idea to be able to get the events efficiently by type, but keeping it incremental would be good.

@chu11
Copy link
Member

chu11 commented Dec 8, 2025

I like the idea. Are you thinking the events in the events array would be complete, or that they would be references out like [,]/[“submit”,5]?

I imagined them being complete. Minimally, b/c there can be multiples of some events (i.e. multiple "priority" events) the array of events is the "source of truth" for full history. The "keyed" events is the most recent one.

Edit: Sorry, I think I read your comment backwards. The "keyed" events would point to the array, not the other way around.

I initially imagined both being "complete". The array being complete for "full history". The "keyed" events being complete so that it would make json queries simpler. (i.e. accessible through json_extract(), no need to call json_each/tree().)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants