Skip to content

Conversation

@chu11
Copy link
Member

@chu11 chu11 commented Apr 27, 2023

This is a super early WIP, where I'm just building up the matching library to use within job-list, so it isn't used at all for actual queries yet.

The main reason I'm posting is I wanted to highlight these functions in match.[ch] and see if anyone sees a problem with it or if its overly dumb with what I want to do.

/*  Identical to list_constraint_create() but only cares about
 *  "states" operation.  Effectively tests the 'states' constraint
 *  operation and conditionals with 'states', everything else is
 *  considered true all of the time.
 */
struct list_constraint *state_constraint_create (json_t *constraint,
                                                 flux_error_t *errp);

/* determines if a job in 'state' could potentially return true with
 * the given constraint.
 */
bool state_match (flux_job_state_t state, struct list_constraint *constraint);

Basically, given some constraint that is passed in, I'd like to know if it is possible / impossible for any jobs with a specific job state to match it. e.g. the constraint states:pending userid:42 queue:batch returns true if the job state is pending and it doesn't matter what the queue/userid are. There could be no jobs for userid 42 in job-list at all and it will return true.

This is so we can skip iterating and filtering on possibly long lists of pending/running/inactive jobs. (Note, I also need to add an equivalent one to these for since matching to end iterating on the inactive list).

To implement this, this special constraint treats everything that isn't a states constraint as true all of the time. So we are only testing just the states constraint and nothing else. Some special handling of conditionals has to be done too. e.g. userid:42 and -userid:42 both need to return true, because we don't care about userid:42.

Here's the core code as an example of what I'm doing in this function (removing some error checking to make this example simpler).

    json_object_foreach (constraint, op, values) {
        if (streq (op, "userid")
            || streq (op, "name")
            || streq (op, "queue")
            || streq (op, "results")
            || streq (op, "since"))
            return list_constraint_new (errp); // this constraint always returns true
        else if (streq (op, "states"))
            return create_states_constraint (values, errp);
        else if (streq (op, "or") || streq (op, "and") || streq (op, "not"))
            return conditional_constraint (op,
                                           match_or_checkempty,
                                           match_not_checkempty,
                                           state_constraint_create,
                                           values,
                                           errp);
    }

As a complete aside as this might come up as a question, "instead of having a new function state_constraint_create() could you just use the normal constraint from before?", and the answer is yes. That was my round 1 prototype, but I found the code to be too complex / irritating so I shifted to this one.

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch 2 times, most recently from e9919eb to 03ac597 Compare April 27, 2023 19:36
@chu11
Copy link
Member Author

chu11 commented Apr 28, 2023

/*  Identical to list_constraint_create() but only cares about
 *  "states" operation.  Effectively tests the 'states' constraint
 *  operation and conditionals with 'states', everything else is
 *  considered true all of the time.
*/

It doesn't look like my simplistic approach will work, which can be illustrated by this simple example:

states=running | userid=42

In my states_match() function, this constraint should return true for pending, running, and inactive job states, b/c the userid=42 is the only thing that matters. i.e. userid=42 can be on the pending, running, or inactive lists. So we have to scan all the lists.

not (states=running | userid=42)

This should return true for the pending and inactive job states, as its impossible for a running job (regardless of userid) to pass this constraint match.

I'm going to give it the good 'ol college try on another technique, which I think will basically involve a third potential state of "maybe" a job in a certain state could pass, but shouldn't spend forever on this. Perhaps an interim solution could be to support job constraints on all the non-job states stuff only (and given discussion in #4914 maybe since too).

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch from 03ac597 to 39fe136 Compare May 4, 2023 18:04
@chu11 chu11 changed the title WIP: job-list: support job list constraints job-list: support job list constraints May 4, 2023
@chu11
Copy link
Member Author

chu11 commented May 4, 2023

Re-pushed, pushing a first round implementation of constraints based filtering on job-list. I effectively implement all the prior filter options into constraints logic. There is only minor enhancements to this logic, like you can specify multiple userids (i.e. userid:100,101,102`) which was not possible before.

In follow up work, this will be enhanced:

  $jq -j -c -n  "{max_entries:1, userid:${id}, states:0, results:0, attrs:[\"all\"]}" \
    | $RPC job-list.list | jq ".jobs[0]" > all_success.out &&

It initially shocked me that this worked, only to realize, this basically told job-list.list to return all jobs, because it didn't specify a constraint object. Alternately, a constraint object could be required by job-list.list instead of it being optional.

A side note:

--- a/src/bindings/python/flux/job/list.py
+++ b/src/bindings/python/flux/job/list.py
@@ -56,8 +56,6 @@ def job_list(
     if since:
         constraint["and"].append({ "since": [ since ] })
     if states and results:
-        if states & flux.constants.FLUX_JOB_STATE_INACTIVE:
-            states &= (~flux.constants.FLUX_JOB_STATE_INACTIVE)
         tmp = { "or": [] }
         tmp["or"].append({ "states": [ states ] })
         tmp["or"].append({ "results": [ results ] })

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch 4 times, most recently from b3a96ca to d254c88 Compare May 9, 2023 05:31
@chu11 chu11 force-pushed the issue4914_job_list_constraints branch from d254c88 to 4d4f3e8 Compare May 12, 2023 19:05
@chu11
Copy link
Member Author

chu11 commented May 12, 2023

re-pushed with a small refactor, a "oh, that's still programmed that way because of ... before." Edit: oops and one more anal retentive tweak

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch from d5b2de0 to c1bb5f0 Compare May 13, 2023 19:57
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass through.. I'd like to do some further testing on another pass.

One general note: I would have perhaps kept the since parameter separate in the protocol. The reason for that is it was meant as a way to stop processing the job lists before you get all the way through, and it isn't clear that the constraint based solution allows that.

Also, eventually I would think since would be implemented in the protocol as a time comparison, i.e. something like t_inactive >= timestamp. Eh, this all seems fine actually, I just thought I'd bring it up for further discussion.

{ FLUX_JOB_STATE_RUN, "RUN", "run", "R", "r" },
{ FLUX_JOB_STATE_CLEANUP, "CLEANUP", "cleanup", "C", "c" },
{ FLUX_JOB_STATE_INACTIVE, "INACTIVE", "inactive", "I", "i" },
{ FLUX_JOB_STATE_PENDING, "PENDING", "pending", "PE", "pe" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably PD is a much more common abbreviation for PENDING, or maybe there was a motive for choosing only the first two letters that isn't mentioned here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particular reason, just picked it somewhat at random :-) PD is good.

{ FLUX_JOB_STATE_INACTIVE, "INACTIVE", "inactive", "I", "i" },
{ FLUX_JOB_STATE_PENDING, "PENDING", "pending", "PE", "pe" },
{ FLUX_JOB_STATE_RUNNING, "RUNNING", "running", "RU", "ru" },
{ FLUX_JOB_STATE_ACTIVE, "ACTIVE", "active", "AC", "ac" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not A for active to match I for inactive? (Perhaps the idea is that virtual states are always two letters, but this isn't mentioned in the commit message or comments so it may cause confusion why ACTIVE is abbreviated differently than INACTIVE)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sounds good.

idsync.c \
stats.h \
stats.c
stats.c \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message: "add an intially job matching library" -> "add an initial job matching library" ?

"results", 0,
"attrs", o))) {
"attrs", o,
"constraints", c))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's a mix of constraint vs constraints throughout this file. What is the difference? It wasn't obvious reading the diff anyway...

Copy link
Member Author

@chu11 chu11 May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to just be an inconsistency I picked somewhat at random. I'll correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whew, good catch, at first I was like "how do the tests even work?" only to realize "oh yeah, we deprecated these functions", so I guess its not really tested at all.

@chu11
Copy link
Member Author

chu11 commented May 19, 2023

Also, eventually I would think since would be implemented in the protocol as a time comparison, i.e. something like t_inactive >= timestamp. Eh, this all seems fine actually, I just thought I'd bring it up for further discussion.

Yeah, perhaps I should fold in the timestamp support into this PR. I had originally planned for it to be separate, but given what I've worked on with it so far, it's not as big a deal as I thought it would be.

@grondo
Copy link
Contributor

grondo commented May 19, 2023

Acually, might be nice to keep it separate, release-notes wise. It makes sense that you'll eventually add support for timestamps, in which case since support could be implemented in terms of that. (If that's easiest for you)

Edit: I was thinking one use case for since was to efficiently grab the jobs that changed since the last time a tool queried job-list, but I don't think this really works (nor does it seem as simple as I was thinking), so maybe a design for something like that should eventually be separately considered.

@chu11
Copy link
Member Author

chu11 commented May 19, 2023

ya know, maybe since should be kept separate outside of constraints. We have a specific use case for it (job archiving) that does improve performance with it. It could be kept even after timestamp support is added. Will think about it.

@grondo
Copy link
Contributor

grondo commented May 19, 2023

ya know, maybe since should be kept separate outside of constraints.

Yeah, that's what I was trying to suggest above :-)

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch from c1bb5f0 to 51a68db Compare May 23, 2023 20:21
@chu11
Copy link
Member Author

chu11 commented May 23, 2023

Re-pushed with the following changes:

  • tweaked issues per comments above (notably virtual state abbreviations)
  • re-added the "since" job-list RPC parameter, we will keep that one outside of constraints
    • and removed "since" constraint support
  • added timestamp job-list matching constraints, timestamp being t_run, t_inactive, etc.
    • in hindsight, this didn't have to be added in this PR, as the re-addition of the since parameter keeps that filtering the same was before. But since I programmed it, it's there :-)

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I keep going through to review this and end up getting interrupted. Since it is working, my inclination is just get this in and we can deal with any unforeseen issues as we encounter them (though it doesn't appear there will be any)

Before that, I have one issue with the timestamp constraints (noted inline). I'm not sure how I feel about this one so I'm on the fence, but though maybe some discussion would be beneficial.

type,
MATCH_LESS_THAN,
errp);
else /* if no operation specified >= is default */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize there isn't much point in supporting "=" for timestamps, but it feels weird that >= is the default here.
Maybe just support '=' and have that be the default, knowing that in most real use cases a bare timestamp value or = would never actually be used?

Copy link
Member Author

@chu11 chu11 Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I picked >= since I figured it was the "best" of all the choices.

Defaulting to = does seem like an interesting idea and would make things consistent to everything else in the constraints. However, I guess my concern would be if someone did a constraint like {"t_inactive": [ "NUMBER" ]} (no oeprator), basically this constraint would never return anything and a user could be confused?? I dunno if that's better or worse.

I almost feel that just requiring ">", "<", ">=", or "<=" would be better. What do you think of just making it a requirement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could go either way. I like making it a requirement better than having a default of >=. I don't personally think users are going to be constructing these constraints by hand, so I don't think allowing {"t_inactive": [ "SOMETHING" ]} would be confusing per se (but it would be difficult to test).

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch 2 times, most recently from ff3de31 to 06269c9 Compare June 16, 2023 23:46
@chu11
Copy link
Member Author

chu11 commented Jun 16, 2023

rebased and re-pushed an update in which all of the timestamps now require a comparison operator. A minor fallout of this is that all timestamp values must be strings now, i.e. `">=1234.5", b/c we always require the comparison operator. I doubt that's a big deal, (it's no big deal in atleast python). I think it does make the comparison more clear. Tests updated as a result.

also we should probably merge

flux-framework/rfc#377
and
#5137

first. It's mostly a technicality, since I do something in this PR that the RFC clarifies.

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch from 06269c9 to bc7ba9d Compare July 6, 2023 05:16
@chu11 chu11 modified the milestones: flux-core v0.52.0, flux-core v0.53.0 Jul 7, 2023
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general question: Is there a benefit in converting the RFC 31 JSON constraint object to struct list_constraint as opposed to just leaving it as a JSON object and using jansson methods to manipulate it? A large fraction of the matching library seems to be constructors, destructors, and conversion.

@chu11
Copy link
Member Author

chu11 commented Jul 12, 2023

One general question: Is there a benefit in converting the RFC 31 JSON constraint object to struct list_constraint as opposed to just leaving it as a JSON object and using jansson methods to manipulate it? A large fraction of the matching library seems to be constructors, destructors, and conversion.

I was mostly following design pattern used in `librlist'. Given the current matching that is supported, the use is perhaps a tad limited given its just a bunch of string compares and what not. However, later on there will be things like hostlist filtering, where we will need to convert possible hostlist ranges and things like that into their appropriate internal data structures.

chu11 added 5 commits July 23, 2023 08:24
Problem: It would be convenient if virtual job states of "pending",
"running", and "active" could be parsed and handled by flux_job_statetostr()
and flux_job_strtostate().

Solution: Add virtual states to the table of states that can be parsed
and handled by flux_job_statetostr() and flux_job_strtostate().  Update unit
tests for coverage as well.
Problem: In the future we would like to use RFC31 constraints
to filter jobs rather than the current implementation.  It would
allow users to write far more expressive job filtering queries than
can currently be done.

Add job matching library to parse and match jobs against
constraint objects. Add unit tests.
Problem: In job-list we store jobs on three lists, pending,
running, and inactive.  If we begin to use RFC31 constraints
to filter jobs, it would be inefficient to scan all three lists
for every job list query.

Solution: Add a job state matching library.  It will determine if
jobs in the pending, running, or inactive state are possible given
the job constraint sent with the job-list query.  For example:

states=pending AND userid=42

In the above constraint, a job in a pending state can possibly be
matched to this constraint, so it is worthwhile to scan the pending
job list for jobs with userid=42.  However, it is impossible for
a job in running or inactive state to match at all, so they should
not be scanned.

This will allow job-list queries to run more efficiently if entire
lists can be skipped if it is impossible for any jobs on those lists
to be matched to a constraint.

Add unit tests.
Problem: Using RFC31 constraints to match jobs would allow us to
support many new filtering and query opportunities in job-list.

Solution: Convert job-list queries to use constraints for filtering
instead of the earlier solution.  This change breaks the old filtering
RPC protocol.  The "userid", "states", "results", "name", and "queue"
RPC fields are no longer supported.

Update callers in libjob, flux-job, flux-top, job-archive, python
JobList and in the testsuite.
Problem: Some additional job-list constraint tests would be
useful.

Add more tests in t2260-job-list.t.
Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went through this again and did some extra testing and LGTM!

@chu11
Copy link
Member Author

chu11 commented Jul 25, 2023

awesome, thanks! I know it was a big PR.

@chu11 chu11 force-pushed the issue4914_job_list_constraints branch from bc7ba9d to 589a8f2 Compare July 25, 2023 16:11
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2023

setting MWP

@codecov
Copy link

codecov bot commented Jul 25, 2023

Codecov Report

Merging #5126 (d5b2de0) into master (46ecc54) will decrease coverage by 0.60%.
The diff coverage is 90.84%.

❗ Current head d5b2de0 differs from pull request most recent head 589a8f2. Consider uploading reports for the commit 589a8f2 to get more accurate results

@@            Coverage Diff             @@
##           master    #5126      +/-   ##
==========================================
- Coverage   83.74%   83.15%   -0.60%     
==========================================
  Files         460      456       -4     
  Lines       77017    78246    +1229     
==========================================
+ Hits        64501    65066     +565     
- Misses      12516    13180     +664     
Files Changed Coverage Δ
src/cmd/top/joblist_pane.c 90.77% <ø> (-0.09%) ⬇️
src/common/libjob/state.c 100.00% <ø> (ø)
src/modules/job-archive/job-archive.c 61.64% <ø> (ø)
src/common/libjob/list.c 53.44% <16.66%> (-9.60%) ⬇️
src/cmd/flux-job.c 87.67% <66.66%> (-0.20%) ⬇️
src/modules/job-list/match.c 91.40% <91.40%> (ø)
src/modules/job-list/state_match.c 94.30% <94.30%> (ø)
src/bindings/python/flux/job/list.py 96.06% <95.00%> (-0.40%) ⬇️
src/modules/job-list/list.c 73.11% <95.45%> (-3.77%) ⬇️
src/modules/job-list/match_util.c 100.00% <100.00%> (ø)

... and 253 files with indirect coverage changes

@mergify mergify bot merged commit f5bfbaa into flux-framework:master Jul 25, 2023
@chu11 chu11 deleted the issue4914_job_list_constraints branch July 25, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants