Skip to content

teuthology/suite/run.py: Improve ScheduleFail exception#1924

Open
kamoltat wants to merge 4 commits intoceph:mainfrom
kamoltat:wip-ksirivad-teuth-suite-exception
Open

teuthology/suite/run.py: Improve ScheduleFail exception#1924
kamoltat wants to merge 4 commits intoceph:mainfrom
kamoltat:wip-ksirivad-teuth-suite-exception

Conversation

@kamoltat
Copy link
Copy Markdown
Member

@kamoltat kamoltat commented Mar 7, 2024

(Merge this first before ceph/teuthology-api#51 and ceph/pulpito-ng#23)

As per: ceph/pulpito-ng#23

  1. Improve teuthology-schedule exceptions by utilizing ScheduleFail and GitError
  2. make teuthology.suite.main return job_count such that teuthology-api and Pulpito-ng can use it.
  3. edit teuthology.sh & start.sh such that we -x earlier for better output when watching the docker-compose do it's thing.

Fixes: https://tracker.ceph.com/issues/64820

@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch 3 times, most recently from 9ca70bd to 5fe4921 Compare March 13, 2024 20:05
@kamoltat kamoltat changed the title [DNM] teuthology/suite/run.py: Added some loggings and ScheduleFail exception teuthology/suite/run.py: Improve ScheduleFail exception Mar 13, 2024
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 5fe4921 to 3f548c6 Compare March 13, 2024 21:02
Copy link
Copy Markdown
Member

@zmc zmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor changes suggested; have you scheduled a run with this branch? I'd be curious to see it if so

@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 3f548c6 to 27db304 Compare May 8, 2024 14:34
@kamoltat
Copy link
Copy Markdown
Member Author

kamoltat commented May 8, 2024

integration test failed:

Warning: RNING]: Skipping Galaxy server https://galaxy.ansible.com./ Got an unexpected
86.80 error when getting available versions of collection amazon.aws: Unknown error
86.80 when attempting to call Galaxy at 'https://galaxy.ansible.com/api/':
86.80 'CustomHTTPSConnection' object has no attribute 'cert_file'.
86.80 'CustomHTTPSConnection' object has no attribute 'cert_file'
86.80 ERROR! Unknown error when attempting to call Galaxy at 'https://galaxy.ansible.com/api/': 'CustomHTTPSConnection' object has no attribute 'cert_file'. 'CustomHTTPSConnection' object has no attribute 'cert_file'
------
Dockerfile:22
--------------------
  21 |     COPY requirements.txt requirements.yml ansible.cfg bootstrap /teuthology/
  22 | >>> RUN \
  23 | >>>     cd /teuthology && \
  24 | >>>     mkdir ../archive_dir && \
  25 | >>>     mkdir log && \
  26 | >>>     chmod +x /teuthology/bootstrap && \
  27 | >>>     PIP_INSTALL_FLAGS="-r requirements.txt" ./bootstrap
  28 |     COPY . /teuthology
--------------------
ERROR: failed to solve: process "/bin/sh -c cd /teuthology &&     mkdir ../archive_dir &&     mkdir log &&     chmod +x /teuthology/bootstrap &&     PIP_INSTALL_FLAGS=\"-r requirements.txt\" ./bootstrap" did not complete successfully: exit code: 1
Service 'teuthology' failed to build : Build failed

@VallariAg VallariAg mentioned this pull request May 8, 2024
@kamoltat
Copy link
Copy Markdown
Member Author

jenkins retest this please

@kamoltat kamoltat self-assigned this May 29, 2024
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 47ddc26 to 27db304 Compare May 29, 2024 17:42
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch 2 times, most recently from fc1bcd3 to d4df210 Compare June 17, 2024 14:14
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from d4df210 to a2bc796 Compare August 18, 2024 18:04
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from a2bc796 to 340e02b Compare October 2, 2024 19:00
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 340e02b to 13d1c6d Compare October 15, 2025 14:54
@kamoltat kamoltat requested a review from a team as a code owner October 15, 2025 14:54
@kamoltat kamoltat requested review from VallariAg and amathuria and removed request for a team October 15, 2025 14:54
do it for

/teuthology/teuthlogy.sh

&&

/start.sh

Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 13d1c6d to a498fef Compare March 4, 2026 20:02
VallariAg
VallariAg previously approved these changes Mar 13, 2026
Copy link
Copy Markdown
Member

@VallariAg VallariAg left a comment

conf[key] = value
except ValueError:
log.error(" --{} value has incorrect type/format".format(key))
raise ScheduleFailError("--{} value has incorrect type/format".format(key),'')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to log and raise? I'd also maybe just use an f-string here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, changed!

@zmc
Copy link
Copy Markdown
Member

zmc commented Mar 18, 2026

Do we have any idea why the integration test is failing here? I'm not seeing a useful clue

Added more loggings and utilizes exceptions e.g.,

ScheduleFail, GitError

Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
Changes:
- Added tmp_path fixture to all four TestScheduleSuite test methods
- Create the expected suite directory structure (tmp_path/suites/suite)
- Update self.args.suite_dir to point to the temporary directory

This ensures that when schedule_suite() constructs the suite path:
  suite_path = os.path.join(suite_dir, suite_relpath, 'suites', suite_name)
the directory actually exists and passes the os.path.exists() check.

Signed-off-by: Kamoltat Sirivadhna <ksirivad@redhat.com>
@kamoltat kamoltat force-pushed the wip-ksirivad-teuth-suite-exception branch from 9ccd2e0 to 844fa7e Compare March 20, 2026 02:49
@kamoltat
Copy link
Copy Markdown
Member Author

Do we have any idea why the integration test is failing here? I'm not seeing a useful clue

I'll check back here once the newly triggered CI finishes, Thanks! @zmc

@kamoltat
Copy link
Copy Markdown
Member Author

@zmc yeah the failure are not super helpful, seems like the job got scheduled but teuthology container exited with code 1 mean something could have potentially gone wrong after job is scheduled?

teuthology-1  | 2026-03-20 02:53:07,116.116 DEBUG:teuthology.suite.merge:configuring Lua randomseed to 349
teuthology-1  | 2026-03-20 02:53:07,116.116 DEBUG:teuthology.suite.merge:merging config {clusters/single tasks/teuthology}
teuthology-1  | 2026-03-20 02:53:07,117.117 DEBUG:teuthology.suite.merge:postmerge script running:
teuthology-1  | 
teuthology-1  | 2026-03-20 02:53:07,121.121 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=ubuntu%2F22.04%2Fx86_64&sha1=b49bd22951aa85aec96fe8b7976d730a7bf0ae0b
beanstalk-1   | accept 5
beanstalk-1   | close 5
teuthology-1  | Job scheduled with name root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode and ID 1
teuthology-1  | 2026-03-20 02:53:08,270.270 INFO:teuthology.suite.run:Scheduling teuthology:no-ceph/{clusters/single tasks/teuthology}
beanstalk-1   | accept 5
paddles-1     | 2026-03-20 02:53:08,991 INFO  [paddles.controllers.runs] Creating run: root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode
paddles-1     | 2026-03-20 02:53:09,012 INFO  [paddles.controllers.jobs] Creating job: root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode/2
teuthology-1  | Job scheduled with name root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode and ID 2
beanstalk-1   | close 5
teuthology-1  | 2026-03-20 02:53:09,131.131 INFO:teuthology.suite.run:Suite teuthology:no-ceph in /root/src/github.com_ceph_ceph_b49bd22951aa85aec96fe8b7976d730a7bf0ae0b/qa/suites/teuthology/no-ceph scheduled 1 jobs.
teuthology-1  | 2026-03-20 02:53:09,132.132 INFO:teuthology.suite.run:0/1 jobs were filtered out.
teuthology-1  | 2026-03-20 02:53:09,132.132 INFO:teuthology.suite.run:Scheduled 1 jobs in total.
beanstalk-1   | accept 5
beanstalk-1   | close 5
teuthology-1  | Job scheduled with name root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode and ID 3
teuthology-1  | 2026-03-20 02:53:09,921.921 INFO:teuthology.suite.run:Test results viewable at http://pulpito:8081/root-2026-03-20_02:52:18-teuthology:no-ceph-main-distro-default-testnode/
Aborting on container exit...

teuthology-1 exited with code 1
 Container docker-compose-teuthology-1  Stopping
 Container docker-compose-testnode-1  Stopping
 Container docker-compose-testnode-3  Stopping
 Container docker-compose-pulpito-1  Stopping
 Container docker-compose-testnode-2  Stopping
 Container docker-compose-teuthology-1  Stopped
 Container docker-compose-beanstalk-1  Stopping
 Container docker-compose-beanstalk-1  Stopped
 Container docker-compose-testnode-2  Stopped
 Container docker-compose-testnode-1  Stopped
 Container docker-compose-pulpito-1  Stopped
 Container docker-compose-testnode-3  Stopped
 Container docker-compose-paddles-1  Stopping
 Container docker-compose-paddles-1  Stopped
 Container docker-compose-postgres-1  Stopping
 Container docker-compose-postgres-1  Stopped

Error: Process completed with exit code 1.

@VallariAg
Copy link
Copy Markdown
Member

I see the integration test passes on main (triggered it just now): https://github.com/ceph/teuthology/actions/runs/23331414332

So I tried the following locally on soko04 machine:

When I checkout to latest main branch locally:

(virtualenv) vallariag@soko04:~$ teuthology-suite -vv -s nvmeof -c wip-rocky10-branch-of-the-day-2026-03-18-1773820469 --ceph-repo https://github.com/ceph/ceph-ci.git --suite-repo https://github.com/ceph/ceph-ci.git --suite-branch wip-rocky10-branch-of-the-day-2026-03-18-1773820469 -p 50 ~/rocky10.yaml -t wip-ksirivad-teuth-suite-exception-test --dry-run
...
2026-03-20 06:22:41,156.156 INFO:teuthology.suite.run:Test results viewable at https://pulpito.ceph.com/vallariag-2026-03-20_06:22:29-nvmeof-wip-rocky10-branch-of-the-day-2026-03-18-1773820469-distro-default-trial/
(virtualenv) vallariag@soko04:~$ echo $?
0

But when I check out to this PR branch and run same teuthology-suite command:

(virtualenv) vallariag@soko04:~$ teuthology-suite -vv -s nvmeof -c wip-rocky10-branch-of-the-day-2026-03-18-1773820469 --ceph-repo https://github.com/ceph/ceph-ci.git --suite-repo https://github.com/ceph/ceph-ci.git --suite-branch wip-rocky10-branch-of-the-day-2026-03-18-1773820469 -p 50 ~/rocky10.yaml -t wip-ksirivad-teuth-suite-exception-test --dry-run
...
2026-03-20 06:24:57,548.548 INFO:teuthology.suite.run:Test results viewable at https://pulpito.ceph.com/vallariag-2026-03-20_06:24:46-nvmeof-wip-rocky10-branch-of-the-day-2026-03-18-1773820469-distro-default-trial/
(virtualenv) vallariag@soko04:~$ echo $?
11

Not sure what this means yet, but I don't see any errors in teuthology-suite output.

@deepssin
Copy link
Copy Markdown
Contributor

I think the integration failure is caused by one behavior change in this PR:

teuthology.suite.main() now returns job_count, and scripts/suite.py forwards that return value from the teuthology-suite CLI entrypoint. For console scripts, that value becomes the process exit code (sys.exit(return_value) behavior).
So a successful run that schedules 1 job exits with code 1, 11 jobs exits with 11 and CI treats any non-zero as failure.

That matches with what we’re seeing in integration (teuthology container exits with code 1 despite successful scheduling/log output).

How about we keep job_count for API/pulpito consumers, but make the CLI path return 0 on success (and non-zero only for real errors or exceptions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants