Skip to content

Conversation

@sitaowang1998
Copy link
Collaborator

@sitaowang1998 sitaowang1998 commented Jan 8, 2026

Description

This PR:

  • Adds scripts to start, wait, initialize and stop MariaDB Docker container.
  • Adds storage creation task that uses the above scripts.
  • Updates storage-related test tasks to add storage-creation task and storage cleanup.
  • Updates following tests to use SPIDER_STORAGE_URL instead of hardcoded url
    • C++ unit tests
    • C++ integration tests
    • Python unit tests
  • Updates GitHub workflow to run all tests.
  • Updates spider-py README and test doc.

Note

Need to change the required workflow for PR from non-storage-unit-tests to tests.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • All test tasks pass.
  • GitHub workflows pass.

Summary by CodeRabbit

  • New Features

    • Automated MariaDB provisioning for tests with dynamic port allocation and start/stop helpers
    • Runtime-selectable storage URL via environment variable and a utility to find free ports
  • Documentation

    • Simplified testing guide and consolidated test workflow instructions
  • Chores

    • CI test workflow renamed and reorganized to run unified unit and integration test flows

✏️ Tip: You can customize this high-level summary in your review settings.


@sitaowang1998 sitaowang1998 requested a review from a team as a code owner January 8, 2026 04:40
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

Walkthrough

Adds MariaDB-backed test infrastructure: dynamic MariaDB provisioning (start/stop), DB init scripts, env-driven storage URL resolution in tests and helpers, CI/workflow and taskfile wiring to start MariaDB before tests, linting scope update, and removal of the old SQL init script.

Changes

Cohort / File(s) Summary
GitHub Workflow
\.github/workflows/tests.yaml
Renamed workflow/job from unit-teststests; swapped step identifiers and added test:spider-py-unit-tests.
Task Configuration
taskfiles/test.yaml
Added globals G_MARIADB_DATABASE, G_MARIADB_USERNAME, G_MARIADB_PASSWORD; added start-mariadb internal task; per-task dynamic MARIADB_PORT/container vars; wired SPIDER_STORAGE_URL and start/stop lifecycle into multiple test tasks.
Linting Configuration
taskfiles/lint.yaml
Included tools/scripts/mariadb in py lint loop (mypy/ruff).
Python Test Fixtures
python/spider-py/tests/client/test_driver.py, python/spider-py/tests/storage/test_mariadb.py
Fixtures now read SPIDER_STORAGE_URL env var (fallback to previous defaults) when constructing driver/storage.
Integration Test Client & Tests
tests/integration/client.py, tests/integration/test_*.py
Renamed g_storage_urlG_STORAGE_URL; added get_storage_url() and storage export; tests updated to call get_storage_url() for runtime URL.
C++ Storage Helper
tests/storage/StorageTestHelper.hpp
cMySqlStorageUrlconstexpr std::string_view; added get_storage_url() reading SPIDER_STORAGE_URL; create_storage_factory() uses get_storage_url().
MariaDB Container Management
tools/scripts/mariadb/start.py, tools/scripts/mariadb/stop.py
New scripts to start (pull/run, healthcheck polling, timeout) and stop MariaDB Docker containers.
MariaDB DB Init (programmatic)
tools/scripts/mariadb/wolf/init_db.py
New Python-based DB initialization executing CREATE TABLE statements for Spider schema.
Utility
tools/scripts/get_free_port.py
New helper to find an available TCP port for dynamic container binding.
Removed SQL Schema
tools/scripts/storage/init_db.sql
Deleted the previous SQL initialization script (schema now created via Python script).
Docs / README
docs/src/dev-docs/testing.md, python/spider-py/README.md
Removed manual storage setup instructions and storage-specific test guidance; updated workflow references and simplified testing docs.

Sequence Diagram(s)

sequenceDiagram
  participant TaskRunner as Task (taskfiles/test.yaml)
  participant StartScript as start-mariadb
  participant Docker as Docker Engine
  participant InitDB as init_db.py
  participant Tests as Test Runner

  TaskRunner->>StartScript: invoke start-mariadb with container name, port, creds
  StartScript->>Docker: docker pull/run container (expose hostPort->3306)
  StartScript->>Docker: docker exec healthcheck.sh (poll until ready)
  StartScript-->>TaskRunner: return success and set SPIDER_STORAGE_URL
  TaskRunner->>InitDB: invoke init_db.py with DB connection params
  InitDB-->>TaskRunner: return success after creating schema
  TaskRunner->>Tests: run tests with SPIDER_STORAGE_URL env
  TaskRunner->>StopScript: defer stop-mariadb cleanup
  StopScript->>Docker: docker stop/remove container
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • LinZhihao-723
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: adding MariaDB Docker tasks and integrating storage-related tests into GitHub workflows, which aligns with the substantial changes across taskfiles, test files, and CI configuration.
Docstring Coverage ✅ Passed Docstring coverage is 91.30% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @tools/scripts/storage/init_db.py:
- Around line 15-231: The CREATE TABLE statements incorrectly use ON UPDATE
CURRENT_TIMESTAMP for columns that should be immutable after creation; remove
the ON UPDATE CURRENT_TIMESTAMP clause from `creation_time` in the jobs table
(jobs -> `creation_time`), from `start_time` in the task_instances table
(task_instances -> `start_time`), and from `lease_time` in the scheduler_leases
table (scheduler_leases -> `lease_time`) while keeping DEFAULT CURRENT_TIMESTAMP
so the timestamp is set at insertion; leave the `heartbeat` column (drivers ->
`heartbeat`) as-is with ON UPDATE CURRENT_TIMESTAMP.

In @tools/scripts/storage/start.py:
- Line 12: The _MARIADB_IMAGE constant is pinned to "mariadb:latest", which is
non-reproducible; change _MARIADB_IMAGE to a specific, supported MariaDB tag
(e.g., "mariadb:10.11" or another project-approved version) or make it
configurable (read from an environment variable or script argument with a stable
default) so builds are deterministic; update any documentation or defaults to
reflect the chosen pinned version.

In @tools/scripts/storage/stop.py:
- Line 25: The argparse parser description contains a typo: change the string
passed to the ArgumentParser constructor (the description argument used when
creating parser) from "Stop MairaDB Docker container." to "Stop MariaDB Docker
container." so the CLI help shows the correct database name; update the
description in the parser instantiation where parser =
argparse.ArgumentParser(...) is defined.
- Around line 44-48: The variable name localstack_stop_cmd is misleading because
it targets a MariaDB container; rename it to mariadb_stop_cmd wherever defined
and referenced (the list currently built with "docker", "stop", args.name) so
the variable name matches its purpose—update the declaration and any subsequent
use of localstack_stop_cmd to mariadb_stop_cmd in the stop.py script.
🧹 Nitpick comments (5)
.github/workflows/tests.yaml (1)

1-1: Consider updating the workflow name to reflect broader test scope.

The workflow name "unit-tests" no longer accurately reflects its scope, as it now runs integration tests (line 60). Consider renaming to "tests" or "all-tests" for clarity.

📝 Suggested rename
-name: "unit-tests"
+name: "tests"
tools/scripts/storage/get_free_port.py (1)

24-25: Remove the unused noqa directive.

The static analysis tool correctly identifies that the noqa: T201 directive is unnecessary because the T201 rule (print statement detection) is not enabled in your Ruff configuration.

♻️ Proposed fix
-    # ruff: noqa: T201
     print(port)
tools/scripts/storage/init_db.py (3)

97-97: Consider rationalizing VARBINARY and VARCHAR size limits.

The value size limit of 999 bytes appears in multiple columns (data.value, task_outputs.type, task_outputs.value, task_inputs.type, task_inputs.value, client_kv_data.value, task_kv_data.value). This specific limit seems arbitrary.

Common choices are powers of 2 (1024, 2048) or clean decimal values (1000). Consider whether 999 was chosen for a specific reason or if it should be normalized.

Also applies to: 108-109, 123-123, 214-215, 224-224


271-284: Add error handling and success logging.

The script lacks error handling and logging for the database initialization process. If table creation fails, the error won't be clearly communicated to the user.

♻️ Proposed improvement
+    logger.info("Connecting to MariaDB at 127.0.0.1:%d...", args.port)
     with (
         mariadb.connect(
             host="127.0.0.1",
             port=args.port,
             user=args.username,
             password=args.password,
             database=args.database,
         ) as conn,
         conn.cursor() as cursor,
     ):
+        logger.info("Creating database tables...")
         for table_creator in _TABLE_CREATORS:
             cursor.execute(table_creator)
         conn.commit()
+        logger.info("Database initialization completed successfully.")
 
     return 0

273-273: Hardcoded host limits flexibility.

The host is hardcoded to "127.0.0.1". If MariaDB needs to run on a different host (e.g., Docker host networking scenarios), this won't be configurable.

Consider adding a --host argument with "127.0.0.1" as the default, similar to the other connection parameters.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1865df and 62d9add.

📒 Files selected for processing (16)
  • .github/workflows/tests.yaml
  • python/spider-py/tests/client/test_driver.py
  • python/spider-py/tests/storage/test_mariadb.py
  • taskfiles/lint.yaml
  • taskfiles/test.yaml
  • tests/integration/client.py
  • tests/integration/test_client.py
  • tests/integration/test_scheduler_worker.py
  • tests/integration/test_signal.py
  • tests/storage/StorageTestHelper.hpp
  • tools/scripts/storage/get_free_port.py
  • tools/scripts/storage/init_db.py
  • tools/scripts/storage/init_db.sql
  • tools/scripts/storage/start.py
  • tools/scripts/storage/stop.py
  • tools/scripts/storage/wait_for_db.py
💤 Files with no reviewable changes (1)
  • tools/scripts/storage/init_db.sql
🧰 Additional context used
🧬 Code graph analysis (5)
tools/scripts/storage/start.py (2)
tools/scripts/storage/init_db.py (1)
  • main (242-285)
tools/scripts/storage/wait_for_db.py (1)
  • main (21-65)
tools/scripts/storage/wait_for_db.py (2)
tools/scripts/storage/init_db.py (1)
  • main (242-285)
tools/scripts/storage/start.py (1)
  • main (22-114)
python/spider-py/tests/client/test_driver.py (1)
python/spider-py/src/spider_py/client/driver.py (1)
  • Driver (19-60)
python/spider-py/tests/storage/test_mariadb.py (1)
python/spider-py/src/spider_py/storage/jdbc_url.py (1)
  • parse_jdbc_url (22-55)
tests/integration/test_client.py (1)
tests/integration/client.py (2)
  • get_storage_url (116-121)
  • storage (125-134)
🪛 Ruff (0.14.10)
tools/scripts/storage/start.py

61-61: subprocess call: check for execution of untrusted input

(S603)


73-73: subprocess call: check for execution of untrusted input

(S603)


104-104: subprocess call: check for execution of untrusted input

(S603)

tools/scripts/storage/wait_for_db.py

45-45: subprocess call: check for execution of untrusted input

(S603)

tools/scripts/storage/stop.py

34-34: subprocess call: check for execution of untrusted input

(S603)


50-50: subprocess call: check for execution of untrusted input

(S603)

tools/scripts/storage/get_free_port.py

24-24: Unused noqa directive (non-enabled: T201)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: tests (ubuntu-22.04)
  • GitHub Check: tests (ubuntu-24.04)
  • GitHub Check: lint
🔇 Additional comments (27)
.github/workflows/tests.yaml (2)

20-20: LGTM!

The job rename from non-storage-unit-tests to tests appropriately reflects the expanded test scope that now includes storage-related tests.


58-62: No changes required. The test tasks (test:cpp-unit-tests, test:cpp-integration-tests, test:spider-py-unit-tests) are self-contained and properly handle the full MariaDB lifecycle:

  • Storage setup: Each task calls the start-storage subtask, which starts the MariaDB container, waits for readiness, and initialises the database. The SPIDER_STORAGE_URL environment variable is set correctly.
  • Cleanup on failure: The defer: keyword guarantees cleanup runs even if tests fail, preventing resource leaks.
  • CI compatibility: Dynamic port allocation via get_free_port.py and unique container naming via uuidgen ensure portability across environments including GitHub Actions.
python/spider-py/tests/client/test_driver.py (1)

16-17: LGTM!

The environment-driven storage URL configuration is correctly implemented with an appropriate fallback to the default test URL.

python/spider-py/tests/storage/test_mariadb.py (1)

19-20: LGTM!

The fixture correctly retrieves the storage URL from the environment variable and parses it for MariaDB connection parameters.

tools/scripts/storage/wait_for_db.py (1)

21-66: LGTM!

The wait logic correctly polls the MariaDB container health status with appropriate timeout handling and sleep intervals between attempts.

Note: The static analysis warning (S603) about untrusted subprocess input is a false positive in this context, as both docker_executable and args.name are controlled inputs.

tests/storage/StorageTestHelper.hpp (3)

17-17: Good refactoring: constexpr string_view for constant URL.

Changing from std::string const to constexpr std::string_view is more efficient and appropriate for a compile-time constant.


24-32: LGTM!

The get_storage_url() function correctly reads the SPIDER_STORAGE_URL environment variable and falls back to the default value when not set.


36-37: LGTM!

The factory creation now uses the environment-driven storage URL, aligning with the broader refactoring to support dynamic storage configuration.

taskfiles/lint.yaml (1)

139-146: LGTM!

Extending linting to cover the newly introduced storage scripts is appropriate and maintains consistency with the existing linting configuration.

tests/integration/test_signal.py (2)

15-27: LGTM!

The import changes correctly align with the refactored public API in tests/integration/client.py, switching from direct constant access to the new get_storage_url() function.


87-87: LGTM!

The fixture correctly uses get_storage_url() to obtain the runtime-configured storage URL instead of the previous hardcoded constant.

tests/integration/test_scheduler_worker.py (2)

13-30: LGTM!

The import changes correctly reflect the refactored API, bringing in get_storage_url() and storage from the updated integration.client module.


93-93: LGTM!

The fixture correctly uses get_storage_url() for runtime storage URL configuration, consistent with the environment-driven approach introduced in this PR.

tools/scripts/storage/get_free_port.py (1)

11-18: Be aware of the inherent TOCTOU race condition.

The approach of binding to port 0 to obtain a free port has a time-of-check-time-of-use (TOCTOU) race condition: between the time this script returns the port number and when the MariaDB container actually binds to it, another process could claim the port. This is an inherent limitation of this port allocation pattern and is generally acceptable for test scenarios where the window is small.

If port conflicts occur in CI/CD environments, consider implementing retry logic in the MariaDB startup script or using Docker's dynamic port allocation features instead.

tests/integration/client.py (3)

114-114: LGTM!

Renaming to uppercase G_STORAGE_URL follows Python naming conventions for module-level constants. The hardcoded credentials in the default URL are acceptable for test scenarios.


116-121: LGTM!

The new get_storage_url() function provides a clean abstraction for environment-driven storage URL configuration, falling back to the default when SPIDER_STORAGE_URL is not set.


132-132: LGTM!

The fixture correctly uses get_storage_url() to support runtime storage configuration via the environment variable.

tools/scripts/storage/init_db.py (1)

1-14: LGTM! Clean script setup.

The shebang, dependency declaration, and imports are well-structured. Using uv run --script with inline dependencies is an excellent choice for this utility script.

tools/scripts/storage/start.py (3)

1-20: LGTM! Clean script structure.

The imports, logging configuration, and image constant are well-organized.


61-69: Clarify the return code when container already exists.

Returning 1 when the container already exists (line 69) typically indicates an error, but this is more of a no-op situation. Consider whether this should return 0 for success or use a different approach.

The current behavior might be intentional if calling scripts need to distinguish between "started fresh" and "already running". Please verify the expected behavior in the task orchestration context.


57-66: Static analysis warnings are false positives.

The Ruff S603 warnings about untrusted input in subprocess calls are false positives. The docker_executable is a hardcoded string (line 59), and all other arguments are either literals or validated by argparse.

Also applies to: 73-82, 104-114

tests/integration/test_client.py (1)

10-10: LGTM! Proper migration to environment-driven storage URL.

The changes correctly replace the hardcoded g_storage_url global with the get_storage_url() accessor function. This enables runtime configuration via the SPIDER_STORAGE_URL environment variable, which aligns with the MariaDB Docker lifecycle management introduced in this PR.

The storage fixture import on line 10 provides proper lifecycle management through pytest's fixture dependency mechanism.

Also applies to: 66-66, 87-87

taskfiles/test.yaml (5)

7-9: LGTM! Clean configuration variables.

The MariaDB configuration is properly centralized in global variables, making it easy to adjust defaults across all test tasks.


24-26: Excellent cross-platform UUID normalization.

The UUID generation and normalization logic handles the case difference between macOS (uppercase) and Linux (lowercase), ensuring consistent container naming across platforms. The spider-mariadb- prefix provides clear identification.

Also applies to: 48-50, 89-91, 145-147, 178-180


27-28: Be aware of potential port allocation race condition.

Dynamic port allocation via get_free_port.py is a good approach, but there's a time-of-check-to-time-of-use (TOCTOU) race condition: between obtaining the free port and starting the container, another process could bind to that port.

This is unlikely in typical CI environments but could occur in heavily loaded systems or when running multiple test suites concurrently.

Consider whether the test infrastructure needs additional retry logic or port conflict handling. In practice, this may be acceptable given the short time window and typical usage patterns.

Also applies to: 51-52, 92-93, 148-149, 181-182


196-220: LGTM! Well-structured storage lifecycle management.

The start-storage internal task properly orchestrates the three-step process:

  1. Start the MariaDB container
  2. Wait for readiness
  3. Initialize the database schema

The sequential execution ensures each step completes before the next begins, and the internal: true flag appropriately restricts this task to internal use only.


37-39: No action needed. The stop.py script already handles non-existent containers gracefully by checking container existence with docker inspect (line 35) and exiting with code 0 while logging a warning message if the container doesn't exist or isn't running (lines 40-42). The defer blocks will safely execute even if start-storage fails, as the stop script is designed to handle this scenario without producing confusing error output.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @docs/src/dev-docs/testing.md:
- Around line 26-27: The sentence referencing the GitHub workflow
"[tests.yaml][gh-workflow-tests]" contains a grammatical error—remove the
redundant "the" so it reads "runs all unit tests and integration tests"; update
the sentence in the docs text where the phrase "runs the all unit tests and
integration tests" appears to "runs all unit tests and integration tests" to
correct the grammar.
🧹 Nitpick comments (4)
tests/integration/test_signal.py (1)

21-21: Remove unnecessary noqa directive.

The # noqa: F401 directive is unnecessary here. Ruff understands that storage is a pytest fixture and won't flag it as an unused import.

♻️ Proposed fix
-    storage,  # noqa: F401
+    storage,
tests/integration/test_scheduler_worker.py (1)

24-24: Remove unnecessary noqa directive.

The # noqa: F401 directive is unnecessary here. Ruff understands that storage is a pytest fixture and won't flag it as an unused import.

♻️ Proposed fix
-    storage,  # noqa: F401
+    storage,
tests/integration/test_client.py (1)

10-10: Remove unnecessary noqa directive.

The # noqa: F401 directive is unnecessary here. Ruff understands that storage is a pytest fixture and won't flag it as an unused import.

♻️ Proposed fix
-from integration.client import get_storage_url, SQLConnection, storage  # noqa: F401
+from integration.client import get_storage_url, SQLConnection, storage
tools/scripts/storage/init_db.py (1)

242-286: Consider adding explicit error handling for better diagnostics.

The connection and execution logic is correct and uses proper context management. However, there's no explicit error handling—any connection failure or SQL execution error will raise an unhandled exception. While fail-fast behaviour may be intentional for a setup script, wrapping the connection block in a try/except with structured logging would improve diagnostics.

♻️ Optional: Add error handling
def main() -> int:
    """Main."""
    parser = argparse.ArgumentParser(description="Initialize the database tables for Spider.")
    # ... argument parsing ...
    args = parser.parse_args()

    try:
        with (
            mariadb.connect(
                host="127.0.0.1",
                port=args.port,
                user=args.username,
                password=args.password,
                database=args.database,
            ) as conn,
            conn.cursor() as cursor,
        ):
            for table_creator in _TABLE_CREATORS:
                cursor.execute(table_creator)
            conn.commit()
        logger.info("Database tables initialized successfully.")
        return 0
    except mariadb.Error as e:
        logger.error("Failed to initialize database: %s", e)
        return 1
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 62d9add and 55f7300.

📒 Files selected for processing (9)
  • .github/workflows/tests.yaml
  • docs/src/dev-docs/testing.md
  • python/spider-py/README.md
  • tests/integration/client.py
  • tests/integration/test_client.py
  • tests/integration/test_scheduler_worker.py
  • tests/integration/test_signal.py
  • tools/scripts/storage/init_db.py
  • tools/scripts/storage/stop.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/client.py
  • .github/workflows/tests.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-04-09T17:15:24.552Z
Learnt from: davidlion
Repo: y-scope/spider PR: 100
File: src/spider/worker/worker.cpp:205-230
Timestamp: 2025-04-09T17:15:24.552Z
Learning: Documentation should be added to new functions in the spider codebase, as already discussed with the user.

Applied to files:

  • python/spider-py/README.md
🧬 Code graph analysis (4)
tools/scripts/storage/init_db.py (2)
tools/scripts/storage/start.py (1)
  • main (22-114)
tools/scripts/storage/wait_for_db.py (1)
  • main (21-65)
tests/integration/test_client.py (1)
tests/integration/client.py (2)
  • get_storage_url (118-123)
  • storage (127-136)
tests/integration/test_scheduler_worker.py (1)
tests/integration/client.py (2)
  • get_storage_url (118-123)
  • storage (127-136)
tests/integration/test_signal.py (1)
tests/integration/client.py (5)
  • get_storage_url (118-123)
  • get_task_outputs (198-223)
  • get_task_state (226-240)
  • remove_job (243-253)
  • storage (127-136)
🪛 Ruff (0.14.10)
tests/integration/test_client.py

10-10: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tests/integration/test_scheduler_worker.py

24-24: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tests/integration/test_signal.py

21-21: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

tools/scripts/storage/stop.py

34-34: subprocess call: check for execution of untrusted input

(S603)


50-50: subprocess call: check for execution of untrusted input

(S603)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint
  • GitHub Check: tests (ubuntu-24.04)
  • GitHub Check: tests (ubuntu-22.04)
🔇 Additional comments (10)
python/spider-py/README.md (1)

27-33: Verify that the simplified Testing section adequately documents the new storage-enabled test workflow.

The Testing section has been significantly simplified to a single task command. However, the PR introduces a new SPIDER_STORAGE_URL environment variable and MariaDB-backed storage integration. The README no longer documents:

  • How developers should configure or set up storage for local testing (e.g., whether storage-creation task runs automatically or requires manual invocation)
  • The role of the SPIDER_STORAGE_URL environment variable in the test workflow
  • Whether developers need specific prerequisites (e.g., Docker, MariaDB) installed locally
  • How the unified test:spider-py-unit-tests task interacts with the storage setup introduced in this PR

Consider whether the README should include additional context about storage configuration for developers setting up tests locally, or link to more detailed testing documentation if it exists elsewhere.

tests/integration/test_signal.py (2)

76-76: LGTM! Environment-driven storage configuration.

The fixture correctly uses get_storage_url() to retrieve the storage URL from the environment variable, enabling flexible test configuration.

Also applies to: 87-89


103-103: LGTM! Consistent storage fixture usage.

The test methods correctly use the storage fixture with appropriate noqa annotations for parameter redefinition.

Also applies to: 180-180

tests/integration/test_scheduler_worker.py (2)

82-82: LGTM! Environment-driven storage configuration.

The fixture correctly uses get_storage_url() to retrieve the storage URL from the environment variable, enabling flexible test configuration.

Also applies to: 93-93


284-284: LGTM! Consistent storage fixture usage.

The test methods correctly use the storage fixture with appropriate noqa annotations for parameter redefinition.

Also applies to: 314-314, 331-331, 351-351

tests/integration/test_client.py (2)

55-55: LGTM! Environment-driven storage configuration.

The fixture correctly uses get_storage_url() to retrieve the storage URL from the environment variable, enabling flexible test configuration.

Also applies to: 66-66


87-87: LGTM! Consistent storage URL usage.

The test correctly passes the environment-driven storage URL to the client_test program.

tools/scripts/storage/stop.py (2)

44-55: Stop logic is correct.

The container stop logic properly handles errors and logs appropriate messages. Past review comments about the variable name have been addressed (mariadb_stop_cmd is correctly used).


20-42: Use the docker_executable variable consistently.

The docker_executable variable is defined on line 23 with a comment stating it's to silence Ruff S607, but line 35 uses the hardcoded string "docker" instead of the variable, making the variable and comment misleading.

🔧 Proposed fix
     result = subprocess.run(
-        [docker_executable, "inspect", "-f", "{{.State.Running}}", args.name],
+        [docker_executable, "inspect", "-f", "{{.State.Running}}", args.name],
         capture_output=True,
         text=True,
         check=False,
     )

Note: Line 35 already shows docker_executable in the diff above, but verify line 45 also uses docker_executable instead of "docker".

Likely an incorrect or invalid review comment.

tools/scripts/storage/init_db.py (1)

15-231: Schema looks solid; past timestamp issues resolved.

The SQL schema is comprehensive and correctly configured. Past review comments regarding incorrect ON UPDATE CURRENT_TIMESTAMP usage on creation_time, start_time, and lease_time have been addressed—these columns now correctly use only DEFAULT CURRENT_TIMESTAMP. The heartbeat column appropriately retains ON UPDATE CURRENT_TIMESTAMP for auto-updates.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
taskfiles/test.yaml (3)

7-9: Consider documenting that these are test-only credentials.

While the hardcoded credentials are acceptable for local testing, consider adding a comment clarifying that these are for development/testing purposes only and should not be used in production environments.


111-113: Simplify pytest invocation.

Line 112 activates the virtual environment, but line 113 uses the full path to pytest, making the activation redundant.

♻️ Simplify to use pytest directly
-      - |-
-        . ../test-venv/bin/activate
-        ../test-venv/bin/pytest tests/integration
+      - "../test-venv/bin/pytest tests/integration"

18-204: Consider reducing code duplication across storage-dependent tasks.

The MariaDB setup pattern (container name generation, port allocation, storage URL construction, start-storage invocation, and defer cleanup) is duplicated across five tasks. This creates maintenance burden—changes to the storage lifecycle require updates in multiple places.

While YAML anchors have limitations with Taskfile's variable interpolation, consider whether a wrapper task or shared configuration could reduce this duplication in a future refactor.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b3fa91 and f03981e.

📒 Files selected for processing (1)
  • taskfiles/test.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint
  • GitHub Check: tests (ubuntu-22.04)
  • GitHub Check: tests (ubuntu-24.04)
🔇 Additional comments (3)
taskfiles/test.yaml (3)

44-68: Clarify the test task strategy.

The presence of three tasks (cpp-unit-tests, cpp-storage-unit-tests, cpp-non-storage-unit-tests) creates ambiguity about which task developers should use. Consider documenting the intended workflow:

  • Is cpp-unit-tests the primary entry point for running all tests?
  • When should developers use the split versions?

206-230: Well-designed internal task for storage lifecycle.

The start-storage task properly encapsulates the three-phase MariaDB setup (start → wait → init) with sensible defaults. Making it internal ensures it's only invoked by other tasks.


18-42: Well-structured storage test lifecycle.

The task properly implements the MariaDB container lifecycle with dynamic port allocation, UUID-based container naming, and cleanup via defer. The cross-platform UUID normalization is a good defensive measure. All referenced scripts and variables are properly defined and in place.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the scripts. Can we restructure the tool scripts as the following:

tool/scripts/
  get_free_port.py
  mariadb/
    start.py
    stop.py
    wolf/
      init.py
    huntsman/
      init.py
  • get_free_port.py seems to be a general helper
  • Using mariadb directly instead of storage, in case we need other storage backend in the future
  • Move init scripts into the dedicated namespace

This should make the structure cleaner.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In @taskfiles/test.yaml:
- Around line 25-31: The MARIADB_PORT var (and other similar vars using
tools/scripts/get_free_port.py) uses a relative path that breaks when tasks run
from non-root directories; update the sh value for MARIADB_PORT and the other
instances (the vars/commands at the other noted blocks) to invoke the script via
the repository root variable (i.e., prefix with {{.ROOT_DIR}}) so the path
resolves regardless of task dir:, ensuring you change every occurrence that
currently is "tools/scripts/get_free_port.py".
- Around line 21-25: The SPIDER_STORAGE_URL (and the same patterns at the other
occurrences) currently interpolates G_MARIADB_USERNAME and G_MARIADB_PASSWORD
directly into the JDBC URL which will break if credentials contain characters
like @ or &; update the template to percent-encode (URL-encode)
G_MARIADB_USERNAME and G_MARIADB_PASSWORD before embedding them into
SPIDER_STORAGE_URL (or, alternatively, move user and password out of the query
string and supply them as separate JDBC connection properties), and apply the
same fix to the other identical URL templates referenced (the blocks around
lines 47–51, 90–94, 149–152, 184–187) so all credential interpolations are
safely encoded or passed as properties.

In @tools/scripts/get_free_port.py:
- Around line 21-26: The inline Ruff suppression "# ruff: noqa: T201" in the
main function is unused and should be removed; edit the main() implementation
(which calls get_free_port() and prints the port) to delete that comment so Ruff
will report T201 as intended, or alternatively keep the comment but instead add
T201 to your project's Ruff config to explicitly allow it—pick one approach and
apply it consistently.

In @tools/scripts/mariadb/start.py:
- Around line 68-77: The code only treats a running container case but ignores
when docker inspect returns exit code 0 and stdout == "false" (a stopped
container with the same name), which will cause docker run --name ... to fail
with a name conflict; update the check in the block that runs subprocess.run
(using docker_executable and args.name and the result variable) to detect
result.returncode == 0 and result.stdout.rstrip("\n") == "false", then log a
clear error like "Container <name> exists but is stopped; remove or rename it
before running" and return a non-zero exit code (or raise/exit) instead of
proceeding; apply the same explicit stopped-container handling to the other
inspect/check sequence around the 91-121 region so both places consistently
detect and fail with a helpful message when a name conflict with a stopped
container exists.

In @tools/scripts/mariadb/stop.py:
- Around line 20-55: The code incorrectly hardcodes "docker" when building
mariadb_stop_cmd and emits a misleading warning; update mariadb_stop_cmd to use
the docker_executable variable instead of the literal "docker", and change the
logger.warning in main (the inspect check) to say the container "is not running
or does not exist" (or similar) since the inspect result covers both cases; keep
the existing return behavior and error handling for subprocess.run as-is.

In @tools/scripts/mariadb/wolf/init_db.py:
- Around line 225-269: The main() function currently lets mariadb.connect and
cursor.execute raise unstructured exceptions; add explicit try/except around the
connect/execute/commit block in main() to catch mariadb.Error (or Exception),
log a clear error message including exception details, and return a non-zero
exit code on failure; also add an optional --host argparse option (default
"127.0.0.1") and use args.host when calling mariadb.connect; refer to main(),
_TABLE_CREATORS, mariadb.connect, cursor.execute, conn.commit to locate the
changes.
🧹 Nitpick comments (3)
tools/scripts/mariadb/start.py (1)

68-73: Add subprocess timeouts to avoid hung CI jobs.

docker pull / docker exec can hang indefinitely; consider timeout=... on subprocess.run(...) (and log the timeout distinctly).

Also applies to: 80-85, 111-116, 124-136

taskfiles/test.yaml (2)

206-227: Consider using {{.ROOT_DIR}} for MariaDB scripts too (defensive against cwd changes).

Even if Task runs internal tasks from a predictable cwd, using absolute-from-root paths here makes the lifecycle more robust.

Proposed diff
-        tools/scripts/mariadb/start.py \
+        {{.ROOT_DIR}}/tools/scripts/mariadb/start.py \
@@
-        tools/scripts/mariadb/wolf/init_db.py \
+        {{.ROOT_DIR}}/tools/scripts/mariadb/wolf/init_db.py \

18-43: Optional: reduce duplication by wrapping “with-mariadb” lifecycle once.

The repeated blocks (env + vars + defer + start-mariadb) across test tasks are easy to drift; consider a single internal task that takes “test command” as a var, or a shared included Taskfile.

Also applies to: 44-68, 77-114, 144-170, 179-205

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f03981e and b21e361.

📒 Files selected for processing (6)
  • taskfiles/lint.yaml
  • taskfiles/test.yaml
  • tools/scripts/get_free_port.py
  • tools/scripts/mariadb/start.py
  • tools/scripts/mariadb/stop.py
  • tools/scripts/mariadb/wolf/init_db.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • taskfiles/lint.yaml
🧰 Additional context used
🧬 Code graph analysis (2)
tools/scripts/mariadb/stop.py (1)
tools/scripts/mariadb/start.py (1)
  • main (23-144)
tools/scripts/mariadb/start.py (3)
tools/scripts/get_free_port.py (1)
  • main (21-26)
tools/scripts/mariadb/stop.py (1)
  • main (20-55)
tools/scripts/mariadb/wolf/init_db.py (1)
  • main (225-268)
🪛 Ruff (0.14.10)
tools/scripts/get_free_port.py

24-24: Unused noqa directive (non-enabled: T201)

Remove unused noqa directive

(RUF100)

tools/scripts/mariadb/stop.py

34-34: subprocess call: check for execution of untrusted input

(S603)


50-50: subprocess call: check for execution of untrusted input

(S603)

tools/scripts/mariadb/start.py

68-68: subprocess call: check for execution of untrusted input

(S603)


80-80: subprocess call: check for execution of untrusted input

(S603)


111-111: subprocess call: check for execution of untrusted input

(S603)


124-124: subprocess call: check for execution of untrusted input

(S603)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: lint
  • GitHub Check: tests (ubuntu-22.04)
  • GitHub Check: tests (ubuntu-24.04)
🔇 Additional comments (3)
tools/scripts/get_free_port.py (1)

11-18: Be aware: “free port” selection is TOCTOU and can flake under concurrency.

This can race (port is freed immediately after Line 18). If you see sporadic “address already in use” from docker, consider adding a retry strategy around container start (or move to docker-assigned random host ports and then query the mapped port).

tools/scripts/mariadb/stop.py (1)

34-39: Subprocess S603: looks fine (no shell), but confirm container name input is trusted in your usage.

Since args.name is CLI-provided, confirm your CI/task usage doesn’t pass attacker-controlled values (even though this is a list-form call, not shell).

Also applies to: 50-50

tools/scripts/mariadb/start.py (1)

68-68: Subprocess S603 warnings: likely acceptable (no shell), but confirm this is intended.

If you want to silence these, add targeted # noqa: S603 on the specific calls (or document why this is safe in-repo).

Also applies to: 80-80, 111-111, 124-124

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tools/scripts/mariadb/stop.py (2)

25-31: Consider clarifying the help text.

The help text refers to "the name of the started MariaDB container", which is slightly awkward phrasing for a stop script. Consider simplifying to "the name of the MariaDB container".

♻️ Suggested refinement
     parser.add_argument(
         "--name",
         type=str,
         default="mariadb-spider-dev",
-        help="The name of the started MariaDB container (default: %(default)s)",
+        help="The name of the MariaDB container to stop (default: %(default)s)",
     )

40-42: Improve the warning message accuracy.

The warning message "Container '%s' doesn't exist" is misleading when returncode == 0 but the container is stopped (stdout != "true"). In that case, the container exists but is not running. Consider making the message more accurate to cover both scenarios.

♻️ Suggested refinement
     if result.returncode != 0 or result.stdout.rstrip("\n") != "true":
-        logger.warning("Container '%s' doesn't exist. Exit peacefully.", args.name)
+        logger.warning("Container '%s' is not running or doesn't exist. Nothing to stop.", args.name)
         return 0
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b21e361 and d0b32cc.

📒 Files selected for processing (1)
  • tools/scripts/mariadb/stop.py
🧰 Additional context used
🧬 Code graph analysis (1)
tools/scripts/mariadb/stop.py (2)
tools/scripts/mariadb/wolf/init_db.py (1)
  • main (225-268)
tools/scripts/mariadb/start.py (1)
  • main (23-144)
🪛 Ruff (0.14.10)
tools/scripts/mariadb/stop.py

34-34: subprocess call: check for execution of untrusted input

(S603)


50-50: subprocess call: check for execution of untrusted input

(S603)

🔇 Additional comments (5)
tools/scripts/mariadb/stop.py (5)

1-5: LGTM!

The shebang and script metadata follow the project conventions and are consistent with the companion start.py script.


7-17: LGTM!

The imports and logging configuration are appropriate for this script's functionality.


44-55: LGTM!

The container stopping logic is well-implemented with proper error handling and clear logging.


58-59: LGTM!

The entry point follows standard Python conventions.


34-55: Static analysis warnings are false positives.

The Ruff S603 warnings about subprocess calls are false positives in this context. The docker_executable is hardcoded (line 23), and commands are constructed as lists rather than shell strings. The args.name parameter is expected user input for specifying which container to stop, which is the intended functionality of this script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants