teuthology/suite/run.py: Improve ScheduleFail exception by kamoltat · Pull Request #1924 · ceph/teuthology

kamoltat · 2024-03-07T18:58:17Z

(Merge this first before ceph/teuthology-api#51 and ceph/pulpito-ng#23)

As per: ceph/pulpito-ng#23

Improve teuthology-schedule exceptions by utilizing ScheduleFail and GitError
make teuthology.suite.main return job_count such that teuthology-api and Pulpito-ng can use it.
edit teuthology.sh & start.sh such that we -x earlier for better output when watching the docker-compose do it's thing.

Fixes: https://tracker.ceph.com/issues/64820

teuthology/suite/util.py

zmc

Just a couple minor changes suggested; have you scheduled a run with this branch? I'd be curious to see it if so

teuthology/suite/run.py

kamoltat · 2024-05-08T14:40:11Z

integration test failed:

Warning: RNING]: Skipping Galaxy server https://galaxy.ansible.com./ Got an unexpected
86.80 error when getting available versions of collection amazon.aws: Unknown error
86.80 when attempting to call Galaxy at 'https://galaxy.ansible.com/api/':
86.80 'CustomHTTPSConnection' object has no attribute 'cert_file'.
86.80 'CustomHTTPSConnection' object has no attribute 'cert_file'
86.80 ERROR! Unknown error when attempting to call Galaxy at 'https://galaxy.ansible.com/api/': 'CustomHTTPSConnection' object has no attribute 'cert_file'. 'CustomHTTPSConnection' object has no attribute 'cert_file'
------
Dockerfile:22
--------------------
  21 |     COPY requirements.txt requirements.yml ansible.cfg bootstrap /teuthology/
  22 | >>> RUN \
  23 | >>>     cd /teuthology && \
  24 | >>>     mkdir ../archive_dir && \
  25 | >>>     mkdir log && \
  26 | >>>     chmod +x /teuthology/bootstrap && \
  27 | >>>     PIP_INSTALL_FLAGS="-r requirements.txt" ./bootstrap
  28 |     COPY . /teuthology
--------------------
ERROR: failed to solve: process "/bin/sh -c cd /teuthology &&     mkdir ../archive_dir &&     mkdir log &&     chmod +x /teuthology/bootstrap &&     PIP_INSTALL_FLAGS=\"-r requirements.txt\" ./bootstrap" did not complete successfully: exit code: 1
Service 'teuthology' failed to build : Build failed

kamoltat · 2024-05-29T17:30:48Z

jenkins retest this please

teuthology/suite/run.py

do it for /teuthology/teuthlogy.sh && /start.sh Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

VallariAg

LGTM! To be sure, I tested it on a few runs:
https://pulpito.ceph.com/vallariag-2026-03-12_22:38:13-nvmeof-wip-rocky10-branch-of-the-day-2026-03-09-1773079353-tentacle-distro-default-trial/
https://pulpito.ceph.com/vallariag-2026-03-12_19:02:32-nvmeof-wip-rocky10-branch-of-the-day-2026-03-09-1773079353-tentacle-distro-default-trial/
https://pulpito.ceph.com/vallariag-2026-03-12_15:19:21-nvmeof-wip-rocky10-branch-of-the-day-2026-03-09-1773079353-tentacle-distro-default-trial/

zmc · 2026-03-18T18:34:10Z

teuthology/suite/__init__.py

+            conf[key] = value
+        except ValueError:
+            log.error(" --{} value has incorrect type/format".format(key))
+            raise ScheduleFailError("--{} value has incorrect type/format".format(key),'')


Do we need to log and raise? I'd also maybe just use an f-string here.

Agreed, changed!

zmc · 2026-03-18T18:34:33Z

Do we have any idea why the integration test is failing here? I'm not seeing a useful clue

Added more loggings and utilizes exceptions e.g., ScheduleFail, GitError Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

Changes: - Added tmp_path fixture to all four TestScheduleSuite test methods - Create the expected suite directory structure (tmp_path/suites/suite) - Update self.args.suite_dir to point to the temporary directory This ensures that when schedule_suite() constructs the suite path: suite_path = os.path.join(suite_dir, suite_relpath, 'suites', suite_name) the directory actually exists and passes the os.path.exists() check. Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

kamoltat · 2026-03-20T02:50:16Z

Do we have any idea why the integration test is failing here? I'm not seeing a useful clue

I'll check back here once the newly triggered CI finishes, Thanks! @zmc

kamoltat · 2026-03-20T03:21:33Z

@zmc yeah the failure are not super helpful, seems like the job got scheduled but teuthology container exited with code 1 mean something could have potentially gone wrong after job is scheduled?

teuthology-1  | 2026-03-20 02:53:07,116.116 DEBUG:teuthology.suite.merge:configuring Lua randomseed to 349
teuthology-1  | 2026-03-20 02:53:07,116.116 DEBUG:teuthology.suite.merge:merging config {clusters/single tasks/teuthology}
teuthology-1  | 2026-03-20 02:53:07,117.117 DEBUG:teuthology.suite.merge:postmerge script running:
teuthology-1  | 
teuthology-1  | 2026-03-20 02:53:07,121.121 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=ubuntu%2F22.04%2Fx86_64&sha1=b49bd22951aa85aec96fe8b7976d730a7bf0ae0b
beanstalk-1   | accept 5
beanstalk-1   | close 5
teuthology-1  | Job scheduled with name root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode and ID 1
teuthology-1  | 2026-03-20 02:53:08,270.270 INFO:teuthology.suite.run:Scheduling teuthology:no-ceph/{clusters/single tasks/teuthology}
beanstalk-1   | accept 5
paddles-1     | 2026-03-20 02:53:08,991 INFO  [paddles.controllers.runs] Creating run: root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode
paddles-1     | 2026-03-20 02:53:09,012 INFO  [paddles.controllers.jobs] Creating job: root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode/2
teuthology-1  | Job scheduled with name root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode and ID 2
beanstalk-1   | close 5
teuthology-1  | 2026-03-20 02:53:09,131.131 INFO:teuthology.suite.run:Suite teuthology:no-ceph in /root/src/github.com_ceph_ceph_b49bd22951aa85aec96fe8b7976d730a7bf0ae0b/qa/suites/teuthology/no-ceph scheduled 1 jobs.
teuthology-1  | 2026-03-20 02:53:09,132.132 INFO:teuthology.suite.run:0/1 jobs were filtered out.
teuthology-1  | 2026-03-20 02:53:09,132.132 INFO:teuthology.suite.run:Scheduled 1 jobs in total.
beanstalk-1   | accept 5
beanstalk-1   | close 5
teuthology-1  | Job scheduled with name root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode and ID 3
teuthology-1  | 2026-03-20 02:53:09,921.921 INFO:teuthology.suite.run:Test results viewable at http://pulpito:8081/root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode/
Aborting on container exit...

teuthology-1 exited with code 1
 Container docker-compose-teuthology-1  Stopping
 Container docker-compose-testnode-1  Stopping
 Container docker-compose-testnode-3  Stopping
 Container docker-compose-pulpito-1  Stopping
 Container docker-compose-testnode-2  Stopping
 Container docker-compose-teuthology-1  Stopped
 Container docker-compose-beanstalk-1  Stopping
 Container docker-compose-beanstalk-1  Stopped
 Container docker-compose-testnode-2  Stopped
 Container docker-compose-testnode-1  Stopped
 Container docker-compose-pulpito-1  Stopped
 Container docker-compose-testnode-3  Stopped
 Container docker-compose-paddles-1  Stopping
 Container docker-compose-paddles-1  Stopped
 Container docker-compose-postgres-1  Stopping
 Container docker-compose-postgres-1  Stopped

Error: Process completed with exit code 1.

VallariAg · 2026-03-20T06:34:48Z

I see the integration test passes on main (triggered it just now): https://github.com/ceph/teuthology/actions/runs/23331414332

So I tried the following locally on soko04 machine:

When I checkout to latest main branch locally:

(virtualenv) vallariag@soko04:~$ teuthology-suite -vv -s nvmeof -c wip-rocky10-branch-of-the-day-2026-03-18-1773820469 --ceph-repo https://github.com/ceph/ceph-ci.git --suite-repo https://github.com/ceph/ceph-ci.git --suite-branch wip-rocky10-branch-of-the-day-2026-03-18-1773820469 -p 50 ~/rocky10.yaml -t wip-ksirivad-teuth-suite-exception-test --dry-run
...
2026-03-20 06:22:41,156.156 INFO:teuthology.suite.run:Test results viewable at https://pulpito.ceph.com/vallariag-2026-03-20_06:22:29-nvmeof-wip-rocky10-branch-of-the-day-2026-03-18-1773820469-distro-default-trial/
(virtualenv) vallariag@soko04:~$ echo $?
0

But when I check out to this PR branch and run same teuthology-suite command:

(virtualenv) vallariag@soko04:~$ teuthology-suite -vv -s nvmeof -c wip-rocky10-branch-of-the-day-2026-03-18-1773820469 --ceph-repo https://github.com/ceph/ceph-ci.git --suite-repo https://github.com/ceph/ceph-ci.git --suite-branch wip-rocky10-branch-of-the-day-2026-03-18-1773820469 -p 50 ~/rocky10.yaml -t wip-ksirivad-teuth-suite-exception-test --dry-run
...
2026-03-20 06:24:57,548.548 INFO:teuthology.suite.run:Test results viewable at https://pulpito.ceph.com/vallariag-2026-03-20_06:24:46-nvmeof-wip-rocky10-branch-of-the-day-2026-03-18-1773820469-distro-default-trial/
(virtualenv) vallariag@soko04:~$ echo $?
11

Not sure what this means yet, but I don't see any errors in teuthology-suite output.

deepssin · 2026-03-24T12:00:56Z

I think the integration failure is caused by one behavior change in this PR:

teuthology.suite.main() now returns job_count, and scripts/suite.py forwards that return value from the teuthology-suite CLI entrypoint. For console scripts, that value becomes the process exit code (sys.exit(return_value) behavior).
So a successful run that schedules 1 job exits with code 1, 11 jobs exits with 11 and CI treats any non-zero as failure.

That matches with what we’re seeing in integration (teuthology container exits with code 1 despite successful scheduling/log output).

How about we keep job_count for API/pulpito consumers, but make the CLI path return 0 on success (and non-zero only for real errors or exceptions)

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch 3 times, most recently from 9ca70bd to 5fe4921 Compare March 13, 2024 20:05

kamoltat changed the title ~~[DNM] teuthology/suite/run.py: Added some loggings and ScheduleFail exception~~ teuthology/suite/run.py: Improve ScheduleFail exception Mar 13, 2024

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 5fe4921 to 3f548c6 Compare March 13, 2024 21:02

kamoltat commented Mar 13, 2024

View reviewed changes

teuthology/suite/util.py Outdated Show resolved Hide resolved

kamoltat requested review from VallariAg and zmc March 13, 2024 21:15

kamoltat mentioned this pull request Mar 13, 2024

Added job scheduling page ceph/pulpito-ng#23

Open

zmc requested changes Mar 22, 2024

View reviewed changes

teuthology/suite/run.py Outdated Show resolved Hide resolved

teuthology/suite/run.py Outdated Show resolved Hide resolved

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 3f548c6 to 27db304 Compare May 8, 2024 14:34

VallariAg mentioned this pull request May 8, 2024

Update ansible version #1937

Closed

kamoltat self-assigned this May 29, 2024

kamoltat added the improvement label May 29, 2024

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 47ddc26 to 27db304 Compare May 29, 2024 17:42

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch 2 times, most recently from fc1bcd3 to d4df210 Compare June 17, 2024 14:14

VallariAg reviewed Jun 19, 2024

View reviewed changes

teuthology/suite/run.py Outdated Show resolved Hide resolved

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from d4df210 to a2bc796 Compare August 18, 2024 18:04

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from a2bc796 to 340e02b Compare October 2, 2024 19:00

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 340e02b to 13d1c6d Compare October 15, 2025 14:54

kamoltat requested a review from a team as a code owner October 15, 2025 14:54

kamoltat requested review from VallariAg and amathuria and removed request for a team October 15, 2025 14:54

kamoltat mentioned this pull request Oct 15, 2025

[suite]: Modify how we handle Error and Success runs ceph/teuthology-api#51

Open

4 tasks

docs/docker-compose: -x earlier

9338396

do it for /teuthology/teuthlogy.sh && /start.sh Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 13d1c6d to a498fef Compare March 4, 2026 20:02

VallariAg previously approved these changes Mar 13, 2026

View reviewed changes

zmc reviewed Mar 18, 2026

View reviewed changes

kamoltat added 3 commits March 19, 2026 22:49

teuthology/suite/run.py: Improve scheduling exceptions

87ef29b

Added more loggings and utilizes exceptions e.g., ScheduleFail, GitError Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

teuthology/suite: teuthology.suite.main return job_count

863a090

Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>

kamoltat dismissed VallariAg’s stale review via 844fa7e March 20, 2026 02:49

kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 9ccd2e0 to 844fa7e Compare March 20, 2026 02:49

Conversation

kamoltat commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

(Merge this first before ceph/teuthology-api#51 and ceph/pulpito-ng#23)

Uh oh!

Uh oh!

zmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kamoltat commented May 8, 2024

Uh oh!

kamoltat commented May 29, 2024

Uh oh!

Uh oh!

VallariAg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zmc Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kamoltat Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

zmc commented Mar 18, 2026

Uh oh!

kamoltat commented Mar 20, 2026

Uh oh!

kamoltat commented Mar 20, 2026

Uh oh!

VallariAg commented Mar 20, 2026

Uh oh!

deepssin commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kamoltat commented Mar 7, 2024 •

edited

Loading

VallariAg left a comment •

edited

Loading