Bug? Last job in any list of trivial jobs always fails. #815
Replies: 5 comments
-
|
Update: I also see this failure on the command line: ./hq server start & daniel@pebby3:~/HQ$ ./hq job list About 5-10 seconds later: daniel@pebby3: Then similarly when submitting two jobs one after the other - one succeeds, one fails. |
Beta Was this translation helpful? Give feedback.
-
|
Much more simply, directly from the command line, no c++, done on two different machines:
Then: Quite simply: attempting to run anything that lasts a non trivial amount of time, fails. The time limit setting has no effect. I've also tried other ways of making a slow program, like my original c++ example, or a bash script with
coproc read -t 10 && wait "$!" || true
What am I missing here? |
Beta Was this translation helpful? Give feedback.
-
|
Can you please show us output of |
Beta Was this translation helpful? Give feedback.
-
|
For a single "sleep 10" job which failed: boston@boston-SYS-540A-TR:~/HQ$ ./hq job info 1 Then, submitting two such jobs one after the other (job ids 2 and 3). ID 2 succeeded, 3 failed. This is for 3: boston@boston-SYS-540A-TR:~/HQ$ ./hq job info 3 Job ID 2, which succeeded: 1 tasks failed. |
Beta Was this translation helpful? Give feedback.
-
|
This was a very nasty bug that will be fixed by #823. We will probably soon release a patch version for HQ, because this affects pretty much all users of v0.21. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I'm trying to build a simple demonstration program in c++, like an extremely simplified version of
https://github.com/UM-Bridge/umbridge/blob/main/hpc/LoadBalancer.cpp
for my local machine.
I wish to run multiple instances of the following trivial program:
int main(){
cout << "....." << endl;
sleep(10);
cout << "done " << endl;
}
Which I compile to "presim.exe".
In my program using HQ, I try to set up like so:
// launch hq
std::system("~/HQ/hq server start &");
// wait for it to start
std::system("until ~/HQ/hq server info &> do sleep 1; done");
cout << "server is up" << endl;
std::system("/~/HQ/hq worker start &");
cout <<"worker is up" << endl;
Then fire off N-many instances of my test program:
int N = 5;
for (int i=0; i<N; ++i)
{
cout <<"submit job..." << endl;
std::system("~/HQ/hq submit ~/HQ_tests/pretend_sim/presim.exe");
}
For this simple example I then wait 15 seconds; outputting the job list before and after, then stop and end the program:
cout << "state before waiting: " << endl;
std::system("~/HQ/hq job list --all");
sleep(15);
cout << "state before finishing: " << endl;
std::system("~/HQ/hq job list --all");
std::system("~/HQ/hq server stop &");
The output when N=5:
state before waiting: +----+------------+---------+-------+ | ID | Name | State | Tasks | +----+------------+---------+-------+ | 1 | presim.exe | RUNNING | 1 | | 2 | presim.exe | RUNNING | 1 | | 3 | presim.exe | RUNNING | 1 | | 4 | presim.exe | RUNNING | 1 | | 5 | presim.exe | RUNNING | 1 | +----+------------+---------+-------+ state before finishing: +----+------------+----------+-------+ | ID | Name | State | Tasks | +----+------------+----------+-------+ | 1 | presim.exe | FINISHED | 1 | | 2 | presim.exe | FINISHED | 1 | | 3 | presim.exe | FINISHED | 1 | | 4 | presim.exe | FINISHED | 1 | | 5 | presim.exe | FAILED | 1 | +----+------------+----------+-------+When N=7:
state before waiting:
+----+------------+---------+-------+
| ID | Name | State | Tasks |
+----+------------+---------+-------+
| 1 | presim.exe | RUNNING | 1 |
| 2 | presim.exe | RUNNING | 1 |
| 3 | presim.exe | RUNNING | 1 |
| 4 | presim.exe | RUNNING | 1 |
| 5 | presim.exe | RUNNING | 1 |
| 6 | presim.exe | RUNNING | 1 |
| 7 | presim.exe | RUNNING | 1 |
+----+------------+---------+-------+
state before finishing:
+----+------------+----------+-------+
| ID | Name | State | Tasks |
+----+------------+----------+-------+
| 1 | presim.exe | FINISHED | 1 |
| 2 | presim.exe | FINISHED | 1 |
| 3 | presim.exe | FINISHED | 1 |
| 4 | presim.exe | FINISHED | 1 |
| 5 | presim.exe | FINISHED | 1 |
| 6 | presim.exe | FAILED | 1 |
| 7 | presim.exe | FAILED | 1 |
+----+------------+----------+-------+
N=3:
state before waiting:
+----+------------+---------+-------+
| ID | Name | State | Tasks |
+----+------------+---------+-------+
| 1 | presim.exe | RUNNING | 1 |
| 2 | presim.exe | RUNNING | 1 |
| 3 | presim.exe | RUNNING | 1 |
+----+------------+---------+-------+
state before finishing:
+----+------------+----------+-------+
| ID | Name | State | Tasks |
+----+------------+----------+-------+
| 1 | presim.exe | FINISHED | 1 |
| 2 | presim.exe | FINISHED | 1 |
| 3 | presim.exe | FAILED | 1 |
+----+------------+----------+-------+
N=1:
+----+------------+---------+-------+
| ID | Name | State | Tasks |
+----+------------+---------+-------+
| 1 | presim.exe | WAITING | 1 |
+----+------------+---------+-------+
state before finishing:
+----+------------+--------+-------+
| ID | Name | State | Tasks |
+----+------------+--------+-------+
| 1 | presim.exe | FAILED | 1 |
+----+------------+--------+-------+
So at least one task fails every time - the last one (or two) submitted. In the extreme case of N=1; I can't even get a single task to complete. There is no output in the .stderr files for the failed jobs. When a job does run, it runs fine - correct output in the .stdout file, etc.
Can anyone give any insight at all? Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions