POC rescue console endpoint by olethanh · Pull Request #807 · aleph-im/aleph-vm

olethanh · 2025-06-03T13:41:21Z

Explain what problem this PR is resolving

Related ClickUp, GitHub or Jira tickets : ALEPH-XXX

Self proofreading checklist

The new code clear, easy to read and well commented.
New code does not duplicate the functions of builtin or popular libraries.
An LLM was used to review the new code and look for simplifications.
New classes and functions contain docstrings explaining what they provide.
All new code is covered by relevant tests.
Documentation has been updated regarding these changes.
Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

Changes

Explain the changes that were made. The idea is not to list exhaustively all the changes made (GitHub already provides a full diff), but to help the reviewers better understand:

which specific file changes go together, e.g: when creating a table in the front-end, there usually is a config file that goes with it
the reasoning behind some changes, e.g: deleted files because they are now redundant
the behaviour to expect, e.g: tooltip has purple background color because the client likes it so, changed a key in the API response to be consistent with other endpoints

How to test

Explain how to test your PR.
If a specific config is required explain it here (account, data entry, ...)

Print screen / video

Upload here screenshots or videos showing the changes if relevant.

Notes

Things that the reviewers should know: known bugs that are out of the scope of the PR, other trade-offs that were made.
If the PR depends on a PR in another repo, or merges into another PR (i.o. main), it should also be mentioned here

Occasionally, VM creation failed because the assigned TAP network interface already existed, likely due to improper teardown from a previous execution or a concurrency issue. Displayed Error: OSError: [Errno 16] Device or resource busy This caused a retry loop, blocking the process. Solution: When assigning a VM ID, check that the network interface for that VM doesn't already exist. This acts as a double check for various issues.

Solution: Display logging token at info logging level

Endpoint `/status/check/fastapi` was raising error when there was no internet inside vm This was caused by `check_internet` raising an error instead of just returning False Note: Contrarely to what was previously understood the diagnostic vm don't return a headers when the result is False. Previous stacktrace ``` ValueError: The server cannot connect to Internet File "aiohttp/web_app.py", line 537, in _handle resp = await handler(request) File "aiohttp/web_middlewares.py", line 114, in impl return await handler(request) File "aleph/vm/orchestrator/supervisor.py", line 70, in server_version_middleware resp: web.StreamResponse = await handler(request) File "aleph/vm/orchestrator/views/__init__.py", line 215, in status_check_fastapi "internet": await status.check_internet(session, fastapi_vm_id), File "aleph/vm/orchestrator/status.py", line 124, in check_internet raise ValueError("The server cannot connect to Internet") ``` Sentry issue : https://alephim.sentry.io/issues/5654330290/

This was causing an error in the monitor payment task AttributeError: 'FunctionEnvironment' object has no attribute 'trusted_execution' File "aleph/vm/utils/__init__.py", line 90, in run_and_log_exception return await coro File "aleph/vm/orchestrator/tasks.py", line 154, in monitor_payments executions = [execution for execution in executions if execution.is_confidential] File "aleph/vm/orchestrator/tasks.py", line 154, in <listcomp> executions = [execution for execution in executions if execution.is_confidential] File "aleph/vm/models.py", line 113, in is_confidential return True if self.message.environment.trusted_execution else False

Due to a previous refactoring the code wasn't reachable and thus the code only hang if the user was not the correct one Solution: Code correction

Fix: Solve failing test removing it because is not used.

* Fix: Solve failing test removing it because is not used. * Problem: If a user allocates a VM and later forgets the VM, the payment task fails because cannot get the price for that execution. Solution: Check the message status before checking the price and remove the execution if it is forgotten or on a different status than `processed`. * Fix: Solve code style issues. * Fix: Explain the reason to use a direct API call instead using the connector.

* Fix: Solve failing test removing it because is not used. * Problem: If the service restarts, the diagnostic VM fails for network issues. Solution: Loading already loaded VMs filtering by only persistent ones. * Fix: Replaced interface check by interface remove and re-creation. * Fix: Ensure to delete the IPv6 address first before trying to delete the interface to prevent if the deletion fails. * Fix: Also delete the IPv4 ip to prevent 2 interfaces with the same IPv4. --------- Co-authored-by: Andres D. Molins <nesitor@gmail.com>

Fix: Update new `aleph-message` package version on packaging steps.

* Problem: Execution tests were very slow Solution: This was due to an import in the test app that is somehow very slow but only during testing. Haven't figured out why it is slow, but have implemented a workaround that delay the import so it's not hit during the tests * Fix 'real' executions test were testing the fake VM This was due to as settings contamination which made it runn the FAKE_DATA_PROGRAM instead of the real one Also correct some things that made the test not run (load_update_mesage instead of get_message) * Correct the Workflow name It was the same name as an other workflow which caused issue in github * Execution test were failing on Python 3.12 Due to change in behaviour of unix_socket.wait_closed * Symlink don't work so make a copy instead * add vm-connector in test runner * Increase timeout for ci * Update comment src/aleph/vm/hypervisors/firecracker/microvm.py Co-authored-by: Hugo Herter <git@hugoherter.com>

Symptoms: Could not allocate a VM on some ubuntu server because the wait_for_init/ping was failing ``` 2024-09-03 12:18:47,259 | DEBUG | command: ping -c 1 -W 2.0 172.16.4.2 2024-09-03 12:18:47,259 | ERROR | Command failed with error code 1: stdin = None command = ['ping', '-c', '1', '-W', '2.0', '172.16.4.2'] stdout = b"ping: invalid value (`2.0' near `.0')\n" 2024-09-03 12:18:47,260 | ERROR | Traceback (most recent call last): File "/home/olivier/pycharm/aleph-vm/src/aleph/vm/utils/__init__.py", line 186, in ping await run_in_subprocess(["ping", "-c", str(packets), "-W", str(timeout), host], check=True) File "/home/olivier/pycharm/aleph-vm/src/aleph/vm/utils/__init__.py", line 121, in run_in_subprocess raise subprocess.CalledProcessError(process.returncode, str(command), stderr.decode()) subprocess.CalledProcessError: Command '['ping', '-c', '1', '-W', '2.0', '172.16.4.2']' returned non-zero exit status 1. ``` Causes: The root cause seems to be that the ping command from the deb package inetutils-ping 2.5-3ubuntu4 doesn't accept a float for it's -W argument While the ping command from the package 'iputils-ping' which we use on other server accept it. Solution: Convert the argument to a int since we didn't use the float part This allow compatibility with both version of the binary

INFO: the settings `PAYMENT_RPC_API` has been renamed to `RPC_AVAX` Problem: Base chain isn't supported. Solutions: adding src/aleph/vm/orchestrator/chain.py to store Available Chains Display available_payments in status_public_config Adding checks that the chains sent is in the STREAM_CHAINS Fix: use chain_info.super_token instead of settings.PAYMENT_SUPER_TOKEN Update dependency superfluid to aleph-superfluid==0.2.1 Fix: wrong logic in monitor_payments for payg Co-authored-by: nesitor <amolinsdiaz@yahoo.es> Co-authored-by: Olivier Le Thanh Duong <olivier@lethanh.be>

Debian 11 provided Python 3.9. This branch removes the support for both Debian 11 and Python 3.9. The oldest distribution supported is now Ubuntu 22.04 with Python 3.10. That that mentions of Debian 11 were replaced in some example files that were not maintained and the change has not been tested. These remain to serve as examples for developers.

The message should be passed as a variable. Obtained with `ruff check src --fix --unsafe-fixes`

There was a teardown() inside a __del__ which was triggered by the garbage collection This resulted in an unclear lifecycle and strange log error since the teardown was already triggered before and made for strange error when running tests Solution: remove it

The execution and instances tests were failing if the runtime or dis image were not properly set up but it was not clear to the user why Solution: Use xfail to display an user to the message on how to set up properly > tests/supervisor/test_execution.py::test_create_execution XFAIL (Test Runtime not setup. run `cd runtimes/aleph-debian-12-python && sudo ./create_disk_image.sh`)

…n it yet

…ening on it yet

add host ipv4 to v2 endpoint

olethanh and others added 30 commits August 16, 2024 20:54

CI: Add more debugging info

189124b

CI : Force the eth_typing depency that was causing issue

7622499

Add duration info to pytest

ae78406

Do not reuse the id of any vm in pool.executions

885ff75

Problem: Login token was not display with default conf (#673)

2f68012

Solution: Display logging token at info logging level

Problem: Websocked auth for fail user was not returning error (#675)

2414198

Due to a previous refactoring the code wasn't reachable and thus the code only hang if the user was not the correct one Solution: Code correction

Solve failing tests on main branch (#678)

0b4fbfd

Fix: Solve failing test removing it because is not used.

Fix: Update new aleph-message package version. (#683)

cd6463c

Update aleph_message package on packaging steps (#684)

d66de42

Fix: Update new `aleph-message` package version on packaging steps.

Provide a template for new PRs (#667)

b74d05f

ci/fix(test-using-pytest): ensure hatch is always installed when needed

1c888e2

Fix: Automated fixes with ruff check src --fix

c518253

Fix: Ruff errored on exception raising

8f56dbe

The message should be passed as a variable. Obtained with `ruff check src --fix --unsafe-fixes`

Fix: Python < 3.10 is no supported anymore

7ecb84e

Fix: Type annotations could be improved

590daa1

Fix: Generator loop -> 'yield from'

48086a1

more ruffs fixes

1beef1e

Fix: Useless comma could be removed

de27ed0

Fix Pydantic error

acc2302

olethanh added 28 commits May 26, 2025 11:37

Adapt the tests

7550e53

add test for /about/capability

29f403f

Fix dependencies in ci

fe6f98d

Fix tests and parsing

e2fda77

mod: uniformize coding style

7865b42

wip

d24b7c1

Fix started_at not being set for program

6bc897a

DB: fix execution.save crashing when updating

44f343a

Handle user aggregate not existing gracefully

049bf3d

Fix alembic crash when generating migrations

d4d52d4

SAve port mapping to database

f10a627

firewall Do not create prerouting chain if it already exists

a82b9b0

fix protocol specification

56b3bef

Do not assign an already forward port, even if nothing is listening o…

09b6599

…n it yet

fix db migration

c2efaf4

fix mapped_port_format

7992d8f

add remove port redirection method

ce6b4e9

fixup! Do not assign an already forward port, even if nothing is list…

7671b92

…ening on it yet

Update ports

6c1103a

Add /control/update endpoint to refresh port config

16d0fd8

add host ipv4 to v2 endpoint

typing and test

536edcb

mod rework network

a61624a

mod rework network

c7c7cd4

fix tests

890b091

Add 2 tests for operate_update

0b7fc79

Fix cleanup issue

eb353cc

POC Rescue console , auth disabled. log not working

675d3e0

add sample client

1869b99

olethanh changed the title ~~Ol poc rescue console~~ POC rescue console endpoint Jun 3, 2025

aliel force-pushed the main branch from a8aa651 to 7a0bff1 Compare February 18, 2026 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC rescue console endpoint#807

POC rescue console endpoint#807
olethanh wants to merge 1075 commits intomainfrom
ol-poc-rescue-console

olethanh commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

olethanh commented Jun 3, 2025

Self proofreading checklist

Changes

How to test

Print screen / video

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants