Skip to content

Unusual jobs cause different code errors  #12

@sketchkey

Description

@sketchkey

Hi there! We've recently had problems with our installation of GreenAlgorithms v0.2.1, and after updating to the newest release, the same issues remains — suggesting it's actually an issue with the way our version of Slurm (v18.08.04) is doing accounting on our HPC.

After doing some digging, it looks like in cases where ReqMem, Elapsed, TotalCPU, or Partition are invalid or empty, the GreenAlgo code errors out. Some examples include:
ValueError: could not convert string to float: '' — seems to be caused by ReqMem values of '0n'
AttributeError: 'float' object has no attribute 'split' — seems to be caused by empty Elapsed or TotalCPU values?
assert x in self.cluster_info['partitions'], f"\n-!- Unknown partition: {x} -!-\n" — caused by empty Partition values?
assert x.TotalCPUtime_ <= x.CPUwallclocktime_ \n AssertionError — caused by TotalCPU time being longer than Elapsed time?
assert (foo.WallclockTimeX.dt.total_seconds() == 0).all() # Cancelled GPU jobs won't have any GPUs allocated if they didn't start \n AssertionError — don't know exactly what causes this one!

I'm attaching an example file with the worst-offending jobs (some columns anonymised) that was produced by uncommented some of the debug lines you included. Do you have suggestions for how to deal with these jobs? I started editing the _global.py and _workloadmanager.py code to deal with them, but it's become a can of worms so thought I'd check if there was a better way. Ideally I'd be able to get the code to run and just skip over rows for jobs it couldn't calculate, or run it on a specific job ID without processing the entire sacct output and filtering like --filterJobIDs does. Would be great if you have a way to easily achieve either of those!

Happy to provide additional files (e.g., the cluster_info.yaml) if it's useful! Cheers!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions