Skip to content

Fix: vm-connector used obsolete software#594

Open
hoh wants to merge 992 commits intomainfrom
hoh-update-vm-connector
Open

Fix: vm-connector used obsolete software#594
hoh wants to merge 992 commits intomainfrom
hoh-update-vm-connector

Conversation

@hoh
Copy link
Copy Markdown
Member

@hoh hoh commented Apr 10, 2024

The vm-connector service used to run with obsolete versions of the aleph-sdk and Python.

Solution: Use the latest the version of the SDK and pin the version of dependencies.

olethanh and others added 5 commits April 10, 2024 17:25
Solution: Modify hatch configuration to have the environmnent properly set up using the virtual environment builtin module

https://hatch.pypa.io/1.3/plugins/environment/virtual/
It's a problem surfaced by another

The visible problem was that the new exection test were hanging, inside the runtime, during the import at the line
 from aleph.sdk.chains.remote import RemoteAccount

after some more investigative work, it was pin pointed to an inner import of eth_utils module (specifically eth_utils.network )

Second problem that made the first visible: in the runtime the pre-compiled bytecode, created during runtime creation in create_disk_image.sh was not used, which made the import of module slower. This surfaced the first problem. The cause of that second problem was that the init1.py code which run the user caude was not launched with the same optimization level as the pre-compiled bytecode and thus recompiled everything. (this is specified in the init1.py #! sheebang on the first line)

Solution: Compile the bytecode with the same optimisation level (-o 2 )
as during run

We haven't found out yet why the eth_utils.network import hang when it
is not precompiler. But this fix the test hanging issue
…asyncio'

Fix: async_sessionmaker was introduced in sqlachemy 2.0, ensure we have at least this version
otherwhise it was using a older system package
A frozen copy of the requirements.txt extracted from different systems
was present in the repository but not used nor maintained.
@github-actions
Copy link
Copy Markdown

Failed to retrieve llama text: POST 504:

504 Gateway Time-out


The server didn't respond in time.

@hoh hoh force-pushed the hoh-update-vm-connector branch from cc6f432 to 6b1781f Compare April 15, 2024 09:03
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 64.50%. Comparing base (eff82dc) to head (f8eab86).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #594   +/-   ##
=======================================
  Coverage   64.50%   64.50%           
=======================================
  Files          78       78           
  Lines        7088     7088           
  Branches      598      598           
=======================================
  Hits         4572     4572           
  Misses       2314     2314           
  Partials      202      202           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

nesitor and others added 22 commits April 16, 2024 09:27
Internet connectivity checks by the diagnostic VM relied on a single URL. If that endpoint was down, the internet connectivity of the system was assumed to be down.

Solution: Check connectivity to multiple endpoints in parallel.
'aleph program' now need an 'update' argument.
Solution: Update makefile and documentation
Problem: could not start Instances from command line

Problem  happened when launching with --run-fake-instance
Solution: Adapt to new VMPool API that take a loop
Also fix benchmarks function
Fix: Solve last CORS errors raised cause by duplication of headers returned.
We published multiple changes to the diagnostic VM recently but none of these was released.

This provides a new diagnostic VM, based on a new runtime [1], with fixes:

- Reading messages with the newer SDK
- Better handling of IPv6 detection errors
- Two different tests for signing messages (local and remote)
- aleph-message version was not specified
- fetching a single message was not tested
Point to it in the documentation
and removed duplicated information here
Solution: Start by adding some simple tests
We don't test the full allocation and deallocation here. just auth
Co-authored-by: nesitor <amolinsdiaz@yahoo.es>
When executed using `bash`, the `create_disk_image` was interrupted by a Python REPL due to the `python -OO` command being surrounded by backquotes.
Problem: Running system testing on DigitalOcean for every push consumed a lot of resources and failed frequently.

We now start to have integration testing using `pytest`, which provides a better confidence that things actually work.

Solution: Only test on DO open Pull Requests and not every push. In the future, consider only running when merged on `main`.
hoh added 2 commits March 12, 2025 09:27
This bumps their versions to
- Kubo 0.23.0 -> 0.33.1

The list of changes regarding Kubo too large to be
 mentioned here, but I mostly expect performance
 improvements as the main API has not changed much.
Due to an error reading the denylist.

The error was:
```
Error: constructing the node (see log for full detail):
error walking /home/ipfs/.config/ipfs/denylists:
lstat /home/ipfs/.config/ipfs/denylists: permission denied
```

The [documentation](https://specs.ipfs.tech/compact-denylist-format/)
mentions that:

> Implementations SHOULD look in /etc/ipfs/denylists/ and
> $XDG_CONFIG_HOME/ipfs/denylists/ (default: ~/.config/ipfs/denylists)
> for denylist files.

I am not sure why this only failed on Ubuntu 22.04
and not Debian 12 or Ubuntu 24.04. My first assumption
would be a difference in Systemd.
@hoh hoh force-pushed the hoh-update-vm-connector branch from 493337e to df70293 Compare March 12, 2025 08:43
@hoh hoh marked this pull request as ready for review March 12, 2025 09:46
@hoh hoh force-pushed the hoh-update-vm-connector branch from df70293 to 03b780b Compare March 12, 2025 09:46
olethanh and others added 23 commits March 18, 2025 15:38
* Make CI a bit more resilient

Do not run export log if cancelled or not setup
Attempt to always do the proper clean up

Print more debug information in case the droplet ipv4 cannot be parsed

* CI: Add timeouts for runtime workflow

* CI: Document the runtime workflow

* CI: Increase timeout to bring up droplet

* CI: pytest. Prevent "Ouput modules" step failure

The dep were not always installed depending on where the failure was
Since this a debug step we don't want it to fail

Do not run it if the workflow was cancelled

* CI: Merge Package workflow and Droplet workflow

This allow to reuse the packages build in the previous workflow
for the droplet test, reducing the number of package build from 12 to 3, in theses workflow.

This:
* fix the issue of package not being able to be built because of rate
limitation on other resources.
* Reduce the chances of random errors.
* Reduce the total CI times requirement.
* Do not attempt to run the droplet test if the package building phase
  fail. (Previously all the package build were launched in parallel
which mean they all failed unecessaryely)
* Make it less costly and faster to run the failed jobs

With theses chance the number of CI failure reduce greatly.
And the cause of failure is more clear

* CI: Wait till an ipv4 on droplet before proceeding

It might be a change in the Digital Ocean API but it return
before the network is setup and with empty setup info
it didn't seem to occur befor so it might be an API change from their
part

* CI: Ensure we reuse the previously calculated IP

* CI: Ensure we use the public IP v4 of the Droplet

Digital Ocean droplet always have a private IP in addition to the public
one.
The API return them in random order so the CI job occasionally tried to
use the internal one and failed.

* CI: Merge the runtime test workflow with the package workflow

Same operation as moving the Droplet workflow, we reuse the already
build package.

The resilience and speed advantage are the sames and add up.

* CI: Remove the deleted workflow, update doc

Rename the main workflow field
Document more

* improve doc

* CI: Do not stop if Export aleph logs command fail

* CI: Print commands run in install aleph step for debug

* CI: Do not fetch whole repository where not needed

In the run_on_droplet job we only require the .github/scripts dir

* CI: Fix apt broken progress bar output
Previous way was preventing them from working inside tests
set up scaffolding to test reservation system
Part of Jira ALEPH-421
- Add supports_x_vga flag to QemuGPU model
- Add system check to detect if a GPU supports x-vga feature
- Only add x-vga parameter to QEMU command when supported
- Fall back gracefully when professional GPUs don't support x-vga

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Test QemuVM device args generation with and without x-vga support
- Test x-vga detection process in AlephQemuInstance
- Test error handling for GPU x-vga detection
- Test configuration flow with x-vga support detection

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Simplify x-vga support detection by using the GPU device class from lspci
- Class 0300 (VGA compatible controller) supports x-vga
- Class 0302 (3D controller) does not support x-vga
- Remove the complex subprocess-based detection method in favor of this simpler approach
- Update unit tests to test the device class-based detection

This approach is faster, more reliable, and doesn't require running test QEMU commands.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
The test was trying to set attributes on HostGPU objects, but the HostGPU class
doesn't have the supports_x_vga field. Changed to use GpuDevice which has the
required field to fix the test.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
When aleph-vm-supervisor restarts, it properly loads existing VM interfaces but
doesn't re-add their IPv6 ranges to the ndppd proxy configuration. This change
explicitly re-adds ndp_proxy rules for existing interfaces during VM recovery.

🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
of vm when calculate available disk space

Jira: ALEPH-420

This will help the scheduler scheduler more properly the available resource on CRN

Alternative approach to #780 as Sparse file were not properly created.
Don't call pyaleph if there is no execution for user
Remove debug executions
The vm-connector service used to run with obsolete versions of the aleph-sdk and Python.

Solution: Use the latest the version of the SDK and pin the version of dependencies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants