Skip to content

Conversation

tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Oct 8, 2025

Description of changes

Summarize the changes made by this PR.
This is a duplicate of PR 5547 as that branch got corrupted.

  • Improvements & Bug fixes
    • Added HTTP routes and client changes to add and remove tasks.
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

github-actions bot commented Oct 8, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tanujnay112 tanujnay112 changed the title [ENH]: Add API endpoints for create/remove tasks [ENH]: Add API endpoints for Task management Oct 8, 2025
@tanujnay112 tanujnay112 marked this pull request as ready for review October 8, 2025 19:17
Copy link
Contributor

propel-code-bot bot commented Oct 8, 2025

Add API Endpoints for Task Management (Create/Remove Task)

This pull request introduces new API endpoints and supporting infrastructure for registering and removing asynchronous tasks that process collections in Chroma. It defines the client, HTTP, Rust, and Go backend interfaces for task creation and task deletion, including support in the protobuf definitions and coordinator layer. The implementation makes these endpoints available in both the Rust frontend and Python/JS clients, adds validation, and ensures correct propagation of errors. Extensive integration and property-based Python tests demonstrate correct task registration, error handling (e.g., duplicate tasks, bad operators), and removal logic.

Key Changes

• Added new REST endpoints and handlers on /api/v2/.../collections/{collection_id}/tasks/{create,delete} for task creation and task removal in rust/frontend/src/server.rs.
• Introduced CreateTaskRequest, RemoveTaskRequest, CreateTaskResponse, RemoveTaskResponse, and matching Rust, Python, and TypeScript models.
• Extended the OpenAPI/Typescript client with new task-related endpoints and type definitions.
• Updated Python client and Collection object to provide create_task and remove_task methods; documented and validated all arguments.
• Implemented sysdb gRPC, Go backend logic, and end-to-end database persistence for task metadata, parameters (as JSON/struct), and error propagation.
• Round-tripped operator parameters using protobuf Struct for operator-specific configuration, with validation and error handling.
• Added property-based and distributed integration tests to validate task creation, duplicate task prevention, operator existence checks, multiple task logic, and edge cases.
• Bumped Rust, Go, and protobuf dependencies where needed and updated codegen.
• Added config option to control minimal record threshold for task triggering.

Affected Areas

idl/chromadb/proto/coordinator.proto (proto/model updates)
rust/frontend/src/server.rs (routing, handler, OpenAPI)
rust/frontend/src/impls/service_based_frontend.rs (endpoint logic, sysdb bridge)
rust/sysdb/src/sysdb.rs and related code (task persistence, gRPC handling)
chromadb/api/models/Collection.py and chromadb/api/fastapi.py, chromadb/api/__init__.py (Python client/task API)
chromadb/test/distributed/test_task_api.py (tests)
clients/new-js/packages/chromadb/src/api/* (JS SDK/OpenAPI updates)
go/pkg/sysdb/coordinator/task.go and related Go code (backend logic)

This summary was automatically generated by @propel-code-bot

@blacksmith-sh blacksmith-sh bot deleted a comment from tanujnay112 Oct 9, 2025
Search, Key, K, Knn, Val
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these from your editor? Or did you run the python formatter? I'm always suspicious of whitespace changes that would imply the formatter has changed or was not run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

editor, left them in because i thought it was good to get rid of trailing whitespace

task_name: Unique name for this task instance
operator_name: Built-in operator name (e.g., "record_counter")
output_collection_name: Name of the collection where task output will be stored
params: Optional dictionary with operator-specific parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this blob just passed in to the operator as e.g. a JSON value? How does the operator get these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are stored in the task definition as a JSON string that a TaskRunner receives and passes into the operator it executes for that Task.

"""Register a recurring task on a collection."""
import json

params_str = json.dumps(params) if params is not None else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nested JSON? An alternative that seems cleaner (but may have a con you've thought of) would be to make params a JSON object (it has to be anyway to round trip through JSON).

from chromadb.errors import ChromaError, NotFoundError


def test_task_create_and_remove(basic_http_client: System) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would pass if I changed the impls of create task and remove task to simply return static objects, right? Can we assert anything about the task existing and then not existing?

)


def test_task_multiple_collections(basic_http_client: System) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the adjacent: test_task_multiple_tasks that tries to register multiple tasks on the same collection?

}: CreateTaskRequest,
) -> Result<CreateTaskResponse, AddTaskError> {
// TODO: Make min_records_for_task configurable
const DEFAULT_MIN_RECORDS_FOR_TASK: u64 = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in this PR? It's just a few lines.

// This is a client-side parsing issue, not a creation failure.
let task_id =
chroma_types::TaskUuid(uuid::Uuid::parse_str(&response.task_id).map_err(|e| {
CreateTaskError::FailedToCreateTask(tonic::Status::internal(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels very awkward to me that we're returning a TypedError that throws away all typing to wrap a tonic::Status that formats a string.

What about:

enum CreateTaskError {
    NaturalLanguageReasonForFailure {
        task_id: Uuid,
    }
}

As written it's a create task error (so it failed) where the reason it failed is because it failed.

pub task_name: String,
pub operator_name: String,
pub output_collection_name: String,
pub params: Option<String>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on my other point: What if this were a serde_json::Value rather than a string for the user to parse?

Comment on lines +1682 to +1696
let response = self.client.create_task(req).await?.into_inner();

// Parse the returned task_id - this should always succeed since the server generated it
// If this fails, it indicates a serious server bug or protocol corruption
let task_id = chroma_types::TaskUuid(
uuid::Uuid::parse_str(&response.task_id).map_err(|e| {
tracing::error!(
task_id = %response.task_id,
error = %e,
"Server returned invalid task_id UUID - task was created but response is corrupt"
);
CreateTaskError::ServerReturnedInvalidData
})?,
);

Ok(task_id)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The current error handling for create_task doesn't correctly propagate the AlreadyExists error code from the gRPC service. The ? operator on line 1682 will convert any gRPC error status, including AlreadyExists, into a generic CreateTaskError::FailedToCreateTask. This causes the ServiceBasedFrontend to misinterpret it as an internal error instead of a specific AlreadyExists condition.

This is inconsistent with how get_task_by_name and delete_task_by_name handle specific error codes like NotFound. To fix this, you should match on the result of the gRPC call and explicitly handle the AlreadyExists status code.

Suggested Change
Suggested change
let response = self.client.create_task(req).await?.into_inner();
// Parse the returned task_id - this should always succeed since the server generated it
// If this fails, it indicates a serious server bug or protocol corruption
let task_id = chroma_types::TaskUuid(
uuid::Uuid::parse_str(&response.task_id).map_err(|e| {
tracing::error!(
task_id = %response.task_id,
error = %e,
"Server returned invalid task_id UUID - task was created but response is corrupt"
);
CreateTaskError::ServerReturnedInvalidData
})?,
);
Ok(task_id)
}
let response = self.client.create_task(req).await;
match response {
Ok(resp) => {
let inner = resp.into_inner();
// Parse the returned task_id - this should always succeed since the server generated it
// If this fails, it indicates a serious server bug or protocol corruption
let task_id = chroma_types::TaskUuid(
uuid::Uuid::parse_str(&inner.task_id).map_err(|e| {
tracing::error!(
task_id = %inner.task_id,
error = %e,
"Server returned invalid task_id UUID - task was created but response is corrupt"
);
CreateTaskError::ServerReturnedInvalidData
})?,
);
Ok(task_id)
}
Err(status) => {
if status.code() == tonic::Code::AlreadyExists {
Err(CreateTaskError::AlreadyExists)
} else {
Err(CreateTaskError::FailedToCreateTask(status))
}
}
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

The current error handling for `create_task` doesn't correctly propagate the `AlreadyExists` error code from the gRPC service. The `?` operator on line 1682 will convert any gRPC error status, including `AlreadyExists`, into a generic `CreateTaskError::FailedToCreateTask`. This causes the `ServiceBasedFrontend` to misinterpret it as an internal error instead of a specific `AlreadyExists` condition.

This is inconsistent with how `get_task_by_name` and `delete_task_by_name` handle specific error codes like `NotFound`. To fix this, you should match on the result of the gRPC call and explicitly handle the `AlreadyExists` status code.

<details>
<summary>Suggested Change</summary>

```suggestion
        let response = self.client.create_task(req).await;

        match response {
            Ok(resp) => {
                let inner = resp.into_inner();
                // Parse the returned task_id - this should always succeed since the server generated it
                // If this fails, it indicates a serious server bug or protocol corruption
                let task_id = chroma_types::TaskUuid(
                    uuid::Uuid::parse_str(&inner.task_id).map_err(|e| {
                        tracing::error!(
                            task_id = %inner.task_id,
                            error = %e,
                            "Server returned invalid task_id UUID - task was created but response is corrupt"
                        );
                        CreateTaskError::ServerReturnedInvalidData
                    })?,
                );
                Ok(task_id)
            }
            Err(status) => {
                if status.code() == tonic::Code::AlreadyExists {
                    Err(CreateTaskError::AlreadyExists)
                } else {
                    Err(CreateTaskError::FailedToCreateTask(status))
                }
            }
        }
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: rust/sysdb/src/sysdb.rs
Line: 1698

Comment on lines +46 to +51
serde_json::Value::Null => Kind::NullValue(0),
serde_json::Value::Bool(b) => Kind::BoolValue(b),
serde_json::Value::Number(n) => {
if let Some(f) = n.as_f64() {
Kind::NumberValue(f)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Critical Bug: Potential task creation with corrupted parameters

The JSON-to-protobuf conversion in json_to_prost_value() has a silent failure case:

serde_json::Value::Number(n) => {
    if let Some(f) = n.as_f64() {
        Kind::NumberValue(f)
    } else {
        Kind::NullValue(0)  // Silent data corruption!
    }
}

If a JSON number can't be converted to f64 (e.g., very large integers), it silently becomes null instead of failing. This could cause tasks to be created with corrupted parameters, leading to runtime failures.

Fix: Return an error or use a more robust conversion:

serde_json::Value::Number(n) => {
    Kind::NumberValue(n.as_f64().ok_or_else(|| 
        ConversionError::InvalidNumber(n.to_string()))?)
}
Context for Agents
[**CriticalError**]

**Critical Bug: Potential task creation with corrupted parameters**

The JSON-to-protobuf conversion in `json_to_prost_value()` has a silent failure case:

```rust
serde_json::Value::Number(n) => {
    if let Some(f) = n.as_f64() {
        Kind::NumberValue(f)
    } else {
        Kind::NullValue(0)  // Silent data corruption!
    }
}
```

If a JSON number can't be converted to f64 (e.g., very large integers), it silently becomes null instead of failing. This could cause tasks to be created with corrupted parameters, leading to runtime failures.

**Fix**: Return an error or use a more robust conversion:
```rust
serde_json::Value::Number(n) => {
    Kind::NumberValue(n.as_f64().ok_or_else(|| 
        ConversionError::InvalidNumber(n.to_string()))?)
}
```

File: rust/sysdb/src/sysdb.rs
Line: 51

Comment on lines 40 to +42
if existingTask != nil {
log.Info("CreateTask: task already exists, returning existing")
taskID = existingTask.ID
return nil
log.Error("CreateTask: task already exists", zap.String("task_name", req.Name))
return common.ErrTaskAlreadyExists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Logic Bug: Incorrect parameter handling

The CreateTask function has a logic error where it changes from returning existing tasks to rejecting duplicates, but the error handling pattern suggests it should still handle the case where a task might legitimately already exist:

if existingTask != nil {
    log.Error("CreateTask: task already exists", zap.String("task_name", req.Name))
    return common.ErrTaskAlreadyExists  // Always errors now
}

This change breaks idempotency. If this is intentional, the API should be documented as non-idempotent. If not, the original logic should be preserved for idempotent task creation.

Context for Agents
[**BestPractice**]

**Logic Bug: Incorrect parameter handling**

The `CreateTask` function has a logic error where it changes from returning existing tasks to rejecting duplicates, but the error handling pattern suggests it should still handle the case where a task might legitimately already exist:

```go
if existingTask != nil {
    log.Error("CreateTask: task already exists", zap.String("task_name", req.Name))
    return common.ErrTaskAlreadyExists  // Always errors now
}
```

This change breaks idempotency. If this is intentional, the API should be documented as non-idempotent. If not, the original logic should be preserved for idempotent task creation.

File: go/pkg/sysdb/coordinator/task.go
Line: 42


res, err := s.coordinator.CreateTask(ctx, req)
if err != nil {
log.Error("CreateTask failed", zap.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

For improved debuggability, consider adding more context to the error logs in this file. Including the task name when an operation fails can be very helpful. This pattern could be applied to GetTaskByName and DeleteTask as well.

For example:

Context for Agents
[**BestPractice**]

For improved debuggability, consider adding more context to the error logs in this file. Including the task name when an operation fails can be very helpful. This pattern could be applied to `GetTaskByName` and `DeleteTask` as well.

For example:

File: go/pkg/sysdb/grpc/task_service.go
Line: 18

@blacksmith-sh blacksmith-sh bot deleted a comment from tanujnay112 Oct 11, 2025
@tanujnay112 tanujnay112 merged commit 15f4458 into main Oct 14, 2025
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants