Skip to content

Conversation

joecummings
Copy link
Member

@joecummings joecummings commented Sep 3, 2025

Fixing OSS install

... while committing coding atrocities that should get me locked up.

What's da problem?

  • After running the installer, conda crashed with:

    ImportError: ... libcrypto.so.3: version `OPENSSL_3.4.0' not found
    
  • Cool I fixed that, but then this exposed a second error when running the GRPO app:

    ImportError: libpython3.10.so.1.0: cannot open shared object file
    

What are the root causes here?

  1. OpenSSL collision
    The activation script put ${CONDA_PREFIX}/lib into global LD_LIBRARY_PATH.
    /usr/bin/conda uses system Python, which then loaded conda’s libcrypto.so.3 (older symbol set) instead of the system one built for OPENSSL_3.4.0_hashlib import failed.

  2. Missing libpython at import time
    The Monarch Rust extension links dynamically against libpython3.10.so.1.0.
    After removing ${CONDA_PREFIX}/lib from global LD_LIBRARY_PATH, Python could no longer find libpython when importing Monarch.

What options did I consider while trying not to break everything?

  • A. Scope LD only to Python invocations aka shim around python
    Wrap python/python3 to add ${CONDA_PREFIX}/lib just for that process:

    # activate.d/python_ld_shim.sh
    python()  { LD_LIBRARY_PATH="${CONDA_PREFIX}/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" command python  "$@"; }
    python3() { LD_LIBRARY_PATH="${CONDA_PREFIX}/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" command python3 "$@"; }

    This keeps system tools (e.g., /usr/bin/conda) clean while satisfying extensions that need libpython.

  • B. Rebuild Monarch with a stable ABI (preferred long-term)
    Build the Rust extension with PyO3’s extension-module + abi3 (e.g., abi3-py310) so it doesn’t link to libpython at all. This eliminates the need for any LD hacks and yields cross-minor CPython compatibility.

What this PR does (current fix)

  • Removed ${CONDA_PREFIX}/lib from the global LD_LIBRARY_PATH (prevents OpenSSL collisions).
  • Added Option A: shell-function shims for python/python3 in activate.d, so only Python gets ${CONDA_PREFIX}/lib.
  • Limited global LD_LIBRARY_PATH to CUDA compat/ only (driver shims) to avoid CUDA/version issues without touching unrelated system libs.

Why Option A now?

  • Zero rebuilds or binary patching; minimal, reversible change.
  • Restores conda functionality and unblocks GRPO for OSS users immediately.

Long-term plan (Option B)

  • Rebuild Monarch’s extension using PyO3 abi3 (features = ["extension-module", "abi3", "abi3-py310"]) and ship an abi3 wheel.
  • Benefits: no libpython dependency, fewer version pinning issues, simpler manylinux compliance, no LD_LIBRARY_PATH tricks.
  • Trade-off: must stay within CPython’s limited (stable) C-API; raise the minimum Python (e.g., 3.10/3.11) as needed for required stable APIs.

HOW DO YOU KNOW IT WORKS?!?!?!

  • conda --version works (no OpenSSL error).
  • python -c "import torch, vllm; ..." succeeds.
  • RL entrypoint runs: python -m apps.grpo.main.

Mic-drop, Joe out

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 3, 2025
@joecummings joecummings changed the title [Do not review] Add comment at end to remind to re-initialize the conda environment Fix install script for all dependencies without breaking conda (Real!) Sep 3, 2025
@joecummings joecummings marked this pull request as ready for review September 3, 2025 20:42
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it possible to source the install.sh script so we don't have to re-activate the conda environment (right now it kills ur terminal lol)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

@joecummings joecummings merged commit 63d1c35 into meta-pytorch:main Sep 3, 2025
5 checks passed
@joecummings joecummings deleted the whoops branch September 3, 2025 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants