Skip to content

Conversation

@phil-opp
Copy link
Collaborator

for easier debugging

@phil-opp
Copy link
Collaborator Author

@haixuanTao I tried multiple approaches, but the Node example keeps on failing. I'm a bit out of ideas, maybe you know what's going wrong?

I also tried the example locally, but I had some issues with CUDA.

In my opinion the Python node example is too complex for the main dora repo since it uses some heavy dependencies (yolo, rerun). There were various CI failures caused by it, which forced me to spent a lot of time debugging them. I still don't know if they were caused by my changes or if I was just unlucky. Perhaps we can move the example to the node-hub repo and add a simpler example with no external dependencies? Would that be ok with you?

@haixuanTao
Copy link
Collaborator

I think that the fact that it works in Windows and not other OS probably means that there's probably nothing wrong with the split of the CI jobs.

I wonder if it might be linked to cli/api version mismatch or something but yeah would probably need to dive deeper beyond the changes included in this PR.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Jan 7, 2026

What do you think about moving the python example to the node hub repo?

@haixuanTao
Copy link
Collaborator

I think we need to test python feature especially when we develop them so not sure how moving them would make it easier to do this

@phil-opp
Copy link
Collaborator Author

phil-opp commented Jan 9, 2026

Yes, we should keep testing python code, just not something so dependency-heavy and fragile. Right now we don't even know why splitting the CI jobs breaks the test, so this mainly introduces complexity into our development cycle.

@phil-opp
Copy link
Collaborator Author

phil-opp commented Jan 9, 2026

I probably spent multiple days trying to fix the failing tests in this PR and #1278 and it is still not clear to me whether these errors are actually related to my changes or whether it's just an unrelated side effect. Reproducing these errors locally is not easy or even possible for me, so I have to debug through the CI, which is extremely slow.

I'm fine with keeping these tests in the repo if you prefer, as long as we move them to a separate, optional job. This way we could still merge with a failing job when the error seems unrelated.

@haixuanTao
Copy link
Collaborator

I understand the frustration.

I think the thing is that, the fact that the CI is failing in 2 different OS means there is definitely a bug somewhere that user will face at one point or another.

And I believe, moving those test to an optional CI would make our code less reliable and more prone to errors as we're not fixing errors that is underlying within our repo.

As we don't have all the time in the world to fix all the issues within dora, I would prefer that we either keep this PR open and wait for someone to take over or rebase it, or close it.

@phil-opp
Copy link
Collaborator Author

I created this PR mainly for making debugging the CI failures of #1278 easier. #1278 implements an actual improvement to Dora to reduce the version mismatch error that users are experiencing. This work is now blocked on some CI errors that I cannot reproduce locally.

I think the thing is that, the fact that the CI is failing in 2 different OS means there is definitely a bug somewhere that user will face at one point or another.

I agree with that. We should try our best to find and fix this.

The best way to do that in my opinion is to create a simpler Python test case that tells us whether the basic functionality works as expected. Right now we only have a very complex test that involves heavy dependencies, so we have no idea where to look (is this even an issue in Dora or is some dependency hanging).

As we don't have all the time in the world to fix all the issues within dora, I would prefer that we either keep this PR open and wait for someone to take over or rebase it, or close it.

If this issue is too complex for us to debug, it will probably not be fixed by a contributor that is less familiar with Dora either. So the bug will probably stay unfixed.

And given that it affects other PRs as well, we will waste a lot of time on CI failures that aren't related to the changes we make.

@haixuanTao
Copy link
Collaborator

Ok, so what you are saying is that we replace the current python CI example with something simpler?

Yeah that's fine with me!

@phil-opp
Copy link
Collaborator Author

Ok great! I opened #1306 for this.

haixuanTao added a commit that referenced this pull request Jan 13, 2026
Rename the object detection dataflow example to
`dataflow-object-detection.yml`

This was discussed in
#1279 (comment)
@haixuanTao
Copy link
Collaborator

@claude , I think there is an env missmatch that is added within this PR CI/CD, could you create a new PR on top of this one that aligns the environment. Not that at moment uv can be a confused about which env to use and can fall back on a new env if there is a pyproject in the current working directory. It could also potentially select a virtual env different from the active env if there is a venv within the current directory or sometimes if there is one in a parent directory.

To makes things simpler maybe we should just either have only 1 virtual env for the whole CI/CD or make sure that each virtual env is well isolated and not conflicting with any pyproject.toml.

@claude
Copy link

claude bot commented Jan 14, 2026

Claude finished @haixuanTao's task in 1m 18s —— View job


CI/CD Environment Mismatch Analysis

Issue Identified

The PR splits Python CLI tests into smaller jobs, but introduces a potential environment mismatch where uv can get confused about which virtual environment to use.

Root Cause:

  1. A single venv is created at the repository root (.venv/)
  2. Tests navigate into subdirectories with their own pyproject.toml files (e.g., test_python_project, various examples)
  3. uv may:
    • Fall back to creating a new environment when it finds a pyproject.toml in the current directory
    • Select a different virtual environment if one exists in the current or parent directory
    • Not respect the activated environment in some cases

Affected Test Steps:

  • Test CLI (Python): Template Project - creates and enters test_python_project/ directory
  • Various example tests that have their own pyproject.toml files

Limitation

⚠️ I cannot modify .github/workflows/ci.yml files due to GitHub App permissions restrictions. Workflow files require elevated permissions for security reasons.

You can find more information in the FAQ.

Proposed Solutions

Option 1: Force explicit venv path (Recommended)

Set the UV_PROJECT_ENVIRONMENT environment variable to force uv to use a specific venv regardless of directory:

- name: "Set up venv"
  shell: bash
  run: |
    uv venv --seed -p 3.12 ${{ github.workspace }}/.venv
    echo "UV_PROJECT_ENVIRONMENT=${{ github.workspace }}/.venv" >> $GITHUB_ENV
    if [ "$RUNNER_OS" != "Windows" ]; then
      source .venv/bin/activate
    else
      source .venv/Scripts/activate
    fi
    
    uv pip install -e apis/python/node
    uv pip install ruff pytest

This ensures all uv commands use the same virtual environment location, preventing confusion when working in subdirectories.

Option 2: Use system-level installation

Install packages at the system level (which is safe in CI runners):

- name: "Install Python dependencies"
  shell: bash
  run: |
    uv pip install --system -e apis/python/node
    uv pip install --system ruff pytest

Then remove all venv activation blocks and use commands directly (or with uv run --system).

Option 3: Isolate each test with its own venv

Create a fresh venv for each test step to ensure complete isolation, but this would be slower and use more disk space.

Recommendation

Option 1 is the best approach because:

  • ✅ Maintains a single venv (faster, less disk usage)
  • ✅ Explicit control over which environment uv uses
  • ✅ No ambiguity when navigating directories
  • ✅ Minimal changes to existing workflow

Next Steps

Since I cannot modify the workflow file directly, here are your options:

  1. Apply the changes manually - Add echo "UV_PROJECT_ENVIRONMENT=${{ github.workspace }}/.venv" >> $GITHUB_ENV to the "Set up venv" step
  2. Create a patch file - I can create a detailed patch file with all the changes
  3. Share the exact diff - I can provide the precise changes needed

Which would you prefer?


@phil-opp
Copy link
Collaborator Author

@claude Please create a patch file for Option 1. Could you also include a new job for running the example added in #1298?

@claude
Copy link

claude bot commented Jan 14, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

This same example will be tested in the `node-hub` repo as part of dora-rs/dora-hub#23

Testing with external deps will likely lead to version mismatches of the message crate. So it requires synchronous changes to two repos, which makes development difficult.
@phil-opp
Copy link
Collaborator Author

Since the issue is likely caused by the external dependencies that use pinned commits to the node-hub repo, I just removed the object detection example from this repo and the CI in 7f477f1. The same example already lives in https://github.com/dora-rs/dora-hub/blob/main/examples/object-detection/yolo.yml, so I opened dora-rs/dora-hub#23 to run the node hub examples on CI too. This way, we keep testing the example with the latest version of dora.

@phil-opp phil-opp enabled auto-merge (squash) January 14, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants