Draft
Conversation
Occasionally, VM creation failed because the assigned TAP network interface already existed, likely due to improper teardown from a previous execution or a concurrency issue. Displayed Error: OSError: [Errno 16] Device or resource busy This caused a retry loop, blocking the process. Solution: When assigning a VM ID, check that the network interface for that VM doesn't already exist. This acts as a double check for various issues.
Solution: Display logging token at info logging level
Endpoint `/status/check/fastapi` was raising error when there was no
internet inside vm
This was caused by `check_internet` raising an error instead of just
returning False
Note: Contrarely to what was previously understood the diagnostic vm don't
return a headers when the result is False.
Previous stacktrace
```
ValueError: The server cannot connect to Internet
File "aiohttp/web_app.py", line 537, in _handle
resp = await handler(request)
File "aiohttp/web_middlewares.py", line 114, in impl
return await handler(request)
File "aleph/vm/orchestrator/supervisor.py", line 70, in server_version_middleware
resp: web.StreamResponse = await handler(request)
File "aleph/vm/orchestrator/views/__init__.py", line 215, in status_check_fastapi
"internet": await status.check_internet(session, fastapi_vm_id),
File "aleph/vm/orchestrator/status.py", line 124, in check_internet
raise ValueError("The server cannot connect to Internet")
```
Sentry issue : https://alephim.sentry.io/issues/5654330290/
This was causing an error in the monitor payment task
AttributeError: 'FunctionEnvironment' object has no attribute 'trusted_execution'
File "aleph/vm/utils/__init__.py", line 90, in run_and_log_exception
return await coro
File "aleph/vm/orchestrator/tasks.py", line 154, in monitor_payments
executions = [execution for execution in executions if execution.is_confidential]
File "aleph/vm/orchestrator/tasks.py", line 154, in <listcomp>
executions = [execution for execution in executions if execution.is_confidential]
File "aleph/vm/models.py", line 113, in is_confidential
return True if self.message.environment.trusted_execution else False
Due to a previous refactoring the code wasn't reachable and thus the code only hang if the user was not the correct one Solution: Code correction
Fix: Solve failing test removing it because is not used.
* Fix: Solve failing test removing it because is not used. * Problem: If a user allocates a VM and later forgets the VM, the payment task fails because cannot get the price for that execution. Solution: Check the message status before checking the price and remove the execution if it is forgotten or on a different status than `processed`. * Fix: Solve code style issues. * Fix: Explain the reason to use a direct API call instead using the connector.
* Fix: Solve failing test removing it because is not used. * Problem: If the service restarts, the diagnostic VM fails for network issues. Solution: Loading already loaded VMs filtering by only persistent ones. * Fix: Replaced interface check by interface remove and re-creation. * Fix: Ensure to delete the IPv6 address first before trying to delete the interface to prevent if the deletion fails. * Fix: Also delete the IPv4 ip to prevent 2 interfaces with the same IPv4. --------- Co-authored-by: Andres D. Molins <nesitor@gmail.com>
Fix: Update new `aleph-message` package version on packaging steps.
* Problem: Execution tests were very slow Solution: This was due to an import in the test app that is somehow very slow but only during testing. Haven't figured out why it is slow, but have implemented a workaround that delay the import so it's not hit during the tests * Fix 'real' executions test were testing the fake VM This was due to as settings contamination which made it runn the FAKE_DATA_PROGRAM instead of the real one Also correct some things that made the test not run (load_update_mesage instead of get_message) * Correct the Workflow name It was the same name as an other workflow which caused issue in github * Execution test were failing on Python 3.12 Due to change in behaviour of unix_socket.wait_closed * Symlink don't work so make a copy instead * add vm-connector in test runner * Increase timeout for ci * Update comment src/aleph/vm/hypervisors/firecracker/microvm.py Co-authored-by: Hugo Herter <git@hugoherter.com>
Symptoms:
Could not allocate a VM on some ubuntu server because the
wait_for_init/ping
was failing
```
2024-09-03 12:18:47,259 | DEBUG | command: ping -c 1 -W 2.0 172.16.4.2
2024-09-03 12:18:47,259 | ERROR | Command failed with error code 1:
stdin = None
command = ['ping', '-c', '1', '-W', '2.0', '172.16.4.2']
stdout = b"ping: invalid value (`2.0' near `.0')\n"
2024-09-03 12:18:47,260 | ERROR |
Traceback (most recent call last):
File "/home/olivier/pycharm/aleph-vm/src/aleph/vm/utils/__init__.py", line 186, in ping
await run_in_subprocess(["ping", "-c", str(packets), "-W", str(timeout), host], check=True)
File "/home/olivier/pycharm/aleph-vm/src/aleph/vm/utils/__init__.py", line 121, in run_in_subprocess
raise subprocess.CalledProcessError(process.returncode, str(command), stderr.decode())
subprocess.CalledProcessError: Command '['ping', '-c', '1', '-W', '2.0', '172.16.4.2']' returned non-zero exit status 1.
```
Causes:
The root cause seems to be that the ping command from the deb package
inetutils-ping 2.5-3ubuntu4 doesn't accept a float for it's -W argument
While the ping command from the package 'iputils-ping' which we use on
other server accept it.
Solution:
Convert the argument to a int since we didn't use the float part
This allow compatibility with both version of the binary
INFO: the settings `PAYMENT_RPC_API` has been renamed to `RPC_AVAX` Problem: Base chain isn't supported. Solutions: adding src/aleph/vm/orchestrator/chain.py to store Available Chains Display available_payments in status_public_config Adding checks that the chains sent is in the STREAM_CHAINS Fix: use chain_info.super_token instead of settings.PAYMENT_SUPER_TOKEN Update dependency superfluid to aleph-superfluid==0.2.1 Fix: wrong logic in monitor_payments for payg Co-authored-by: nesitor <amolinsdiaz@yahoo.es> Co-authored-by: Olivier Le Thanh Duong <olivier@lethanh.be>
Debian 11 provided Python 3.9. This branch removes the support for both Debian 11 and Python 3.9. The oldest distribution supported is now Ubuntu 22.04 with Python 3.10. That that mentions of Debian 11 were replaced in some example files that were not maintained and the change has not been tested. These remain to serve as examples for developers.
The message should be passed as a variable. Obtained with `ruff check src --fix --unsafe-fixes`
There was a teardown() inside a __del__ which was triggered by the garbage collection This resulted in an unclear lifecycle and strange log error since the teardown was already triggered before and made for strange error when running tests Solution: remove it
The execution and instances tests were failing if the runtime or dis image were not properly set up but it was not clear to the user why Solution: Use xfail to display an user to the message on how to set up properly > tests/supervisor/test_execution.py::test_create_execution XFAIL (Test Runtime not setup. run `cd runtimes/aleph-debian-12-python && sudo ./create_disk_image.sh`)
add host ipv4 to v2 endpoint
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Explain what problem this PR is resolving
Related ClickUp, GitHub or Jira tickets : ALEPH-XXX
Self proofreading checklist
packaging/MakefileChanges
Explain the changes that were made. The idea is not to list exhaustively all the changes made (GitHub already provides a full diff), but to help the reviewers better understand:
How to test
Explain how to test your PR.
If a specific config is required explain it here (account, data entry, ...)
Print screen / video
Upload here screenshots or videos showing the changes if relevant.
Notes
Things that the reviewers should know: known bugs that are out of the scope of the PR, other trade-offs that were made.
If the PR depends on a PR in another repo, or merges into another PR (i.o. main), it should also be mentioned here