-
Notifications
You must be signed in to change notification settings - Fork 285
Split Python CLI job into smaller jobs #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for easier debugging
57b182f to
e42b585
Compare
|
@haixuanTao I tried multiple approaches, but the Node example keeps on failing. I'm a bit out of ideas, maybe you know what's going wrong? I also tried the example locally, but I had some issues with CUDA. In my opinion the Python node example is too complex for the main dora repo since it uses some heavy dependencies (yolo, rerun). There were various CI failures caused by it, which forced me to spent a lot of time debugging them. I still don't know if they were caused by my changes or if I was just unlucky. Perhaps we can move the example to the |
|
I think that the fact that it works in Windows and not other OS probably means that there's probably nothing wrong with the split of the CI jobs. I wonder if it might be linked to cli/api version mismatch or something but yeah would probably need to dive deeper beyond the changes included in this PR. |
|
What do you think about moving the python example to the node hub repo? |
|
I think we need to test python feature especially when we develop them so not sure how moving them would make it easier to do this |
|
Yes, we should keep testing python code, just not something so dependency-heavy and fragile. Right now we don't even know why splitting the CI jobs breaks the test, so this mainly introduces complexity into our development cycle. |
|
I probably spent multiple days trying to fix the failing tests in this PR and #1278 and it is still not clear to me whether these errors are actually related to my changes or whether it's just an unrelated side effect. Reproducing these errors locally is not easy or even possible for me, so I have to debug through the CI, which is extremely slow. I'm fine with keeping these tests in the repo if you prefer, as long as we move them to a separate, optional job. This way we could still merge with a failing job when the error seems unrelated. |
|
I understand the frustration. I think the thing is that, the fact that the CI is failing in 2 different OS means there is definitely a bug somewhere that user will face at one point or another. And I believe, moving those test to an optional CI would make our code less reliable and more prone to errors as we're not fixing errors that is underlying within our repo. As we don't have all the time in the world to fix all the issues within dora, I would prefer that we either keep this PR open and wait for someone to take over or rebase it, or close it. |
|
I created this PR mainly for making debugging the CI failures of #1278 easier. #1278 implements an actual improvement to Dora to reduce the version mismatch error that users are experiencing. This work is now blocked on some CI errors that I cannot reproduce locally.
I agree with that. We should try our best to find and fix this. The best way to do that in my opinion is to create a simpler Python test case that tells us whether the basic functionality works as expected. Right now we only have a very complex test that involves heavy dependencies, so we have no idea where to look (is this even an issue in Dora or is some dependency hanging).
If this issue is too complex for us to debug, it will probably not be fixed by a contributor that is less familiar with Dora either. So the bug will probably stay unfixed. And given that it affects other PRs as well, we will waste a lot of time on CI failures that aren't related to the changes we make. |
|
Ok, so what you are saying is that we replace the current python CI example with something simpler? Yeah that's fine with me! |
|
Ok great! I opened #1306 for this. |
Rename the object detection dataflow example to `dataflow-object-detection.yml` This was discussed in #1279 (comment)
|
@claude , I think there is an env missmatch that is added within this PR CI/CD, could you create a new PR on top of this one that aligns the environment. Not that at moment uv can be a confused about which env to use and can fall back on a new env if there is a pyproject in the current working directory. It could also potentially select a virtual env different from the active env if there is a venv within the current directory or sometimes if there is one in a parent directory. To makes things simpler maybe we should just either have only 1 virtual env for the whole CI/CD or make sure that each virtual env is well isolated and not conflicting with any pyproject.toml. |
|
Claude finished @haixuanTao's task in 1m 18s —— View job CI/CD Environment Mismatch AnalysisIssue IdentifiedThe PR splits Python CLI tests into smaller jobs, but introduces a potential environment mismatch where Root Cause:
Affected Test Steps:
Limitation
You can find more information in the FAQ. Proposed SolutionsOption 1: Force explicit venv path (Recommended) Set the - name: "Set up venv"
shell: bash
run: |
uv venv --seed -p 3.12 ${{ github.workspace }}/.venv
echo "UV_PROJECT_ENVIRONMENT=${{ github.workspace }}/.venv" >> $GITHUB_ENV
if [ "$RUNNER_OS" != "Windows" ]; then
source .venv/bin/activate
else
source .venv/Scripts/activate
fi
uv pip install -e apis/python/node
uv pip install ruff pytestThis ensures all Option 2: Use system-level installation Install packages at the system level (which is safe in CI runners): - name: "Install Python dependencies"
shell: bash
run: |
uv pip install --system -e apis/python/node
uv pip install --system ruff pytestThen remove all venv activation blocks and use commands directly (or with Option 3: Isolate each test with its own venv Create a fresh venv for each test step to ensure complete isolation, but this would be slower and use more disk space. RecommendationOption 1 is the best approach because:
Next StepsSince I cannot modify the workflow file directly, here are your options:
Which would you prefer? |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Also: Run the new example added in #1298
This same example will be tested in the `node-hub` repo as part of dora-rs/dora-hub#23 Testing with external deps will likely lead to version mismatches of the message crate. So it requires synchronous changes to two repos, which makes development difficult.
|
Since the issue is likely caused by the external dependencies that use pinned commits to the |
for easier debugging