Skip to content

POC rescue console endpoint#807

Draft
olethanh wants to merge 1075 commits intomainfrom
ol-poc-rescue-console
Draft

POC rescue console endpoint#807
olethanh wants to merge 1075 commits intomainfrom
ol-poc-rescue-console

Conversation

@olethanh
Copy link
Copy Markdown
Contributor

@olethanh olethanh commented Jun 3, 2025

Explain what problem this PR is resolving

Related ClickUp, GitHub or Jira tickets : ALEPH-XXX

Self proofreading checklist

  • The new code clear, easy to read and well commented.
  • New code does not duplicate the functions of builtin or popular libraries.
  • An LLM was used to review the new code and look for simplifications.
  • New classes and functions contain docstrings explaining what they provide.
  • All new code is covered by relevant tests.
  • Documentation has been updated regarding these changes.
  • Dependencies update in the project.toml have been mirrored in the Debian package build script packaging/Makefile

Changes

Explain the changes that were made. The idea is not to list exhaustively all the changes made (GitHub already provides a full diff), but to help the reviewers better understand:

  • which specific file changes go together, e.g: when creating a table in the front-end, there usually is a config file that goes with it
  • the reasoning behind some changes, e.g: deleted files because they are now redundant
  • the behaviour to expect, e.g: tooltip has purple background color because the client likes it so, changed a key in the API response to be consistent with other endpoints

How to test

Explain how to test your PR.
If a specific config is required explain it here (account, data entry, ...)

Print screen / video

Upload here screenshots or videos showing the changes if relevant.

Notes

Things that the reviewers should know: known bugs that are out of the scope of the PR, other trade-offs that were made.
If the PR depends on a PR in another repo, or merges into another PR (i.o. main), it should also be mentioned here

olethanh and others added 30 commits August 16, 2024 20:54
Occasionally, VM creation failed because the assigned TAP network interface already existed, likely due to improper teardown from a previous execution or a concurrency issue.

Displayed Error:
OSError: [Errno 16] Device or resource busy

This caused a retry loop, blocking the process.

Solution:
When assigning a VM ID, check that the network interface for that VM doesn't already exist. This acts as a double check for various issues.
Solution: Display logging token at info logging level
Endpoint `/status/check/fastapi` was raising error when there was no
internet inside vm

This was caused by `check_internet` raising an error instead of just
returning False

Note: Contrarely to what was previously understood the diagnostic vm don't
return a headers when the result is False.

Previous stacktrace
```
ValueError: The server cannot connect to Internet
  File "aiohttp/web_app.py", line 537, in _handle
    resp = await handler(request)
  File "aiohttp/web_middlewares.py", line 114, in impl
    return await handler(request)
  File "aleph/vm/orchestrator/supervisor.py", line 70, in server_version_middleware
    resp: web.StreamResponse = await handler(request)
  File "aleph/vm/orchestrator/views/__init__.py", line 215, in status_check_fastapi
    "internet": await status.check_internet(session, fastapi_vm_id),
  File "aleph/vm/orchestrator/status.py", line 124, in check_internet
    raise ValueError("The server cannot connect to Internet")
```

Sentry issue : https://alephim.sentry.io/issues/5654330290/
This was causing an error in the monitor payment task

AttributeError: 'FunctionEnvironment' object has no attribute 'trusted_execution'
  File "aleph/vm/utils/__init__.py", line 90, in run_and_log_exception
    return await coro
  File "aleph/vm/orchestrator/tasks.py", line 154, in monitor_payments
    executions = [execution for execution in executions if execution.is_confidential]
  File "aleph/vm/orchestrator/tasks.py", line 154, in <listcomp>
    executions = [execution for execution in executions if execution.is_confidential]
  File "aleph/vm/models.py", line 113, in is_confidential
    return True if self.message.environment.trusted_execution else False
Due to a previous refactoring the code wasn't reachable
and thus the code only hang if the user was not the correct one

Solution: Code correction
Fix: Solve failing test removing it because is not used.
* Fix: Solve failing test removing it because is not used.

* Problem: If a user allocates a VM and later forgets the VM, the payment task fails because cannot get the price for that execution.

Solution: Check the message status before checking the price and remove the execution if it is forgotten or on a different status than `processed`.

* Fix: Solve code style issues.

* Fix: Explain the reason to use a direct API call instead using the connector.
* Fix: Solve failing test removing it because is not used.

* Problem: If the service restarts, the diagnostic VM fails for network issues.

Solution: Loading already loaded VMs filtering by only persistent ones.

* Fix: Replaced interface check by interface remove and re-creation.

* Fix: Ensure to delete the IPv6 address first before trying to delete the interface to prevent if the deletion fails.

* Fix: Also delete the IPv4 ip to prevent 2 interfaces with the same IPv4.

---------

Co-authored-by: Andres D. Molins <nesitor@gmail.com>
Fix: Update new `aleph-message` package version on packaging steps.
* Problem: Execution tests were very slow

Solution: This was due to an import in the test app that is somehow
very slow but only during testing.

Haven't figured out why it is slow, but have implemented a workaround
that delay the import so it's not hit during the tests

* Fix 'real' executions test were testing the fake VM

This was due to as settings contamination which made it runn the FAKE_DATA_PROGRAM instead of the real one

Also correct some things that made the test not run (load_update_mesage
instead of get_message)

* Correct the Workflow name

It was the same name as an other workflow which caused issue in github

* Execution test were failing on Python 3.12

Due to change in behaviour of unix_socket.wait_closed

* Symlink don't work so make a copy instead

* add vm-connector in test runner

* Increase timeout for ci

* Update comment src/aleph/vm/hypervisors/firecracker/microvm.py


Co-authored-by: Hugo Herter <git@hugoherter.com>
Symptoms:
Could not allocate a VM on some ubuntu server because the
wait_for_init/ping
was failing

```
2024-09-03 12:18:47,259 | DEBUG | command: ping -c 1 -W 2.0 172.16.4.2
2024-09-03 12:18:47,259 | ERROR | Command failed with error code 1:
    stdin = None
    command = ['ping', '-c', '1', '-W', '2.0', '172.16.4.2']
    stdout = b"ping: invalid value (`2.0' near `.0')\n"
2024-09-03 12:18:47,260 | ERROR |
Traceback (most recent call last):
  File "/home/olivier/pycharm/aleph-vm/src/aleph/vm/utils/__init__.py", line 186, in ping
    await run_in_subprocess(["ping", "-c", str(packets), "-W", str(timeout), host], check=True)
  File "/home/olivier/pycharm/aleph-vm/src/aleph/vm/utils/__init__.py", line 121, in run_in_subprocess
    raise subprocess.CalledProcessError(process.returncode, str(command), stderr.decode())
subprocess.CalledProcessError: Command '['ping', '-c', '1', '-W', '2.0', '172.16.4.2']' returned non-zero exit status 1.
```

Causes:
The root cause seems to be that the ping command from the  deb package
inetutils-ping 2.5-3ubuntu4 doesn't accept a float for it's -W argument

While the ping command  from the package 'iputils-ping' which we use on
other server accept it.

Solution:
Convert the argument to a int since we didn't use the float part
This allow compatibility with both version of the binary
INFO: the settings `PAYMENT_RPC_API` has been renamed to `RPC_AVAX`

Problem:
Base chain isn't supported.

Solutions:
adding src/aleph/vm/orchestrator/chain.py to store Available Chains
Display available_payments in status_public_config
Adding checks that the chains sent is in the STREAM_CHAINS
Fix: use chain_info.super_token instead of settings.PAYMENT_SUPER_TOKEN
Update dependency superfluid to aleph-superfluid==0.2.1
Fix: wrong logic in monitor_payments for payg

Co-authored-by: nesitor <amolinsdiaz@yahoo.es>
Co-authored-by: Olivier Le Thanh Duong <olivier@lethanh.be>
Debian 11 provided Python 3.9.

This branch removes the support for both Debian 11 and Python 3.9.
The oldest distribution supported is now Ubuntu 22.04 with Python 3.10.

That that mentions of Debian 11 were replaced in
some example files that were not maintained and
the change has not been tested. These remain to
serve as examples for developers.
The message should be passed as a variable.

Obtained with `ruff check src --fix --unsafe-fixes`
There was a teardown() inside a __del__
which was triggered by the garbage collection
This resulted in an unclear lifecycle and strange log error since the teardown was already triggered before
and made for strange error when running tests

Solution: remove it
The execution and instances tests were failing if the runtime or dis image were not properly set up but it was not clear to the user why

Solution: 
Use xfail to display an user to the message on how to set up properly

> tests/supervisor/test_execution.py::test_create_execution XFAIL (Test
Runtime not setup. run `cd runtimes/aleph-debian-12-python && sudo
./create_disk_image.sh`)
@olethanh olethanh changed the title Ol poc rescue console POC rescue console endpoint Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants