Skip to content

Conversation

@tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Nov 14, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Disabled the S3heap service and removed all references to the S3Heap outside the s3heap module.
    • Removed all references to the next_nonce​ or lowest_live_nonce​ columns in the attached_functions​ table.
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_

@tanujnay112 tanujnay112 marked this pull request as ready for review November 14, 2025 21:30
@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

tanujnay112 commented Nov 14, 2025

@github-actions
Copy link

⚠️ The Helm chart was updated without a version bump. Your changes will only be published if the version field in k8s/distributed-chroma/Chart.yaml is updated.

@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Nov 14, 2025

Retire S3heap scheduler and purge nonce sequencing from task pipeline

This PR fully decommissions the experimental S3-backed task-ordering service ("S3heap") and removes the nonce-based sequencing mechanism that once coordinated attached-function execution. The storage schema, coordinator logic, protobuf contracts, build assets, and deployment manifests are all updated to operate without S3heap and nonce fields. This significantly simplifies the task pipeline, trims an entire micro-service from the fleet, and lowers operational complexity and cost, but introduces a required DB migration and minor breaking API changes.

Key Changes

• Disabled rust/s3heap-service binary and stripped its runtime entry-point
• Removed S3heap Helm/Tilt/CI artifacts; clusters no longer build or deploy the pod
• Dropped next_nonce and lowest_live_nonce columns via new Atlas SQL migration
• Deleted nonce fields from Rust types, Go DAOs, protobuf definitions, and coordinator logic
• Refactored garbage collector, task manager, and log orchestrator to rely on commit counts / task IDs
• Removed or updated all tests, mocks, and fixtures that referenced S3heap or nonces
• Cleaned up build targets and Docker/GitHub CI configs that referenced the retired service

Affected Areas

• Coordinator (Go)
• SysDB and related Rust task types
• Metastore / DAO layer
• Protobuf & gRPC contracts
• Kubernetes Helm charts, Tiltfile, CI pipelines
• Database schema (Atlas migration)
• Integration and unit tests

This summary was automatically generated by @propel-code-bot

Comment on lines 512 to 516
/// Note: This service is currently not fully functional due to nonce removal
pub async fn entrypoint() {
eprintln!("Heap tender service is not currently implemented");
eprintln!("The heap scheduling functionality was removed");
std::process::exit(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Dead code after unconditional process exit: The entrypoint() function always exits with std::process::exit(1), making it impossible to return normally. This prevents proper cleanup and makes the function signature (async fn) misleading.

// Current - unreachable return
pub async fn entrypoint() {
    eprintln!("Heap tender service is not currently implemented");
    eprintln!("The heap scheduling functionality was removed");
    std::process::exit(1);
} // implicit return is unreachable

// Better - explicit never return or remove async
pub fn entrypoint() -> ! {
    eprintln!("Heap tender service is not currently implemented");
    eprintln!("The heap scheduling functionality was removed");
    std::process::exit(1)
}
Context for Agents
[**BestPractice**]

**Dead code after unconditional process exit**: The `entrypoint()` function always exits with `std::process::exit(1)`, making it impossible to return normally. This prevents proper cleanup and makes the function signature (`async fn`) misleading.

```rust
// Current - unreachable return
pub async fn entrypoint() {
    eprintln!("Heap tender service is not currently implemented");
    eprintln!("The heap scheduling functionality was removed");
    std::process::exit(1);
} // implicit return is unreachable

// Better - explicit never return or remove async
pub fn entrypoint() -> ! {
    eprintln!("Heap tender service is not currently implemented");
    eprintln!("The heap scheduling functionality was removed");
    std::process::exit(1)
}
```

File: rust/s3heap-service/src/lib.rs
Line: 516

Comment on lines 248 to +249
pub async fn tend_to_heap(&self) -> Result<(), Error> {
let (witness, cursor, tended) = self.read_and_coalesce_dirty_log().await?;
if !tended.is_empty() {
let collection_ids = tended.iter().map(|t| t.0).collect::<Vec<_>>();
let scheduled = self
.sysdb
.clone()
.peek_schedule_by_collection_id(&collection_ids)
.await?;
let triggerables: Vec<Option<Schedule>> = scheduled
.into_iter()
.map(|s: ScheduleEntry| -> Result<_, Error> {
let triggerable = Triggerable {
partitioning: s3heap::UnitOfPartitioningUuid::new(s.collection_id.0),
scheduling: s3heap::UnitOfSchedulingUuid::new(s.attached_function_id),
};
if let Some(next_scheduled) = s.when_to_run {
let schedule = Schedule {
triggerable,
next_scheduled,
nonce: s.attached_function_run_nonce.0,
};
Ok(Some(schedule))
} else {
Ok(None)
}
})
.collect::<Result<Vec<_>, _>>()?;
let triggerables: Vec<Schedule> = triggerables.into_iter().flatten().collect();
if !triggerables.is_empty() {
self.writer.push(&triggerables).await?;
}
}
if let Some(witness) = witness.as_ref() {
self.cursor
.save(&HEAP_TENDER_CURSOR_NAME, &cursor, witness)
.await?;
} else {
self.cursor
.init(&HEAP_TENDER_CURSOR_NAME, cursor.clone())
.await?;
}
Ok(())
Err(Error::Internal("Not implemented".to_string()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Unimplemented function returns error instead of using proper Rust patterns: The tend_to_heap() function immediately returns an error string instead of using unimplemented!() or todo!() macros, which provide better stack traces and clearer intent.

// Current - generic error message
pub async fn tend_to_heap(&self) -> Result<(), Error> {
    Err(Error::Internal("Not implemented".to_string()))
}

// Better - explicit unimplemented with context
pub async fn tend_to_heap(&self) -> Result<(), Error> {
    unimplemented!("tend_to_heap removed after nonce removal - see TODO(tanujnay112)")
}
Context for Agents
[**BestPractice**]

**Unimplemented function returns error instead of using proper Rust patterns**: The `tend_to_heap()` function immediately returns an error string instead of using `unimplemented!()` or `todo!()` macros, which provide better stack traces and clearer intent.

```rust
// Current - generic error message
pub async fn tend_to_heap(&self) -> Result<(), Error> {
    Err(Error::Internal("Not implemented".to_string()))
}

// Better - explicit unimplemented with context
pub async fn tend_to_heap(&self) -> Result<(), Error> {
    unimplemented!("tend_to_heap removed after nonce removal - see TODO(tanujnay112)")
}
```

File: rust/s3heap-service/src/lib.rs
Line: 249

@blacksmith-sh

This comment has been minimized.

assert!(found_ids.contains(&id));
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

These tests are being ignored. Since the s3heap-service is being gutted, this is understandable. However, it would be good practice to add a comment explaining why they are ignored (e.g., # [ignore = "Heap tender service is deprecated and will be removed"]) to provide context for future developers.

Context for Agents
[**BestPractice**]

These tests are being ignored. Since the `s3heap-service` is being gutted, this is understandable. However, it would be good practice to add a comment explaining why they are ignored (e.g., `#
[ignore = "Heap tender service is deprecated and will be removed"]`) to provide context for future developers.

File: rust/s3heap-service/tests/test_k8s_integration_00_heap_tender.rs
Line: 333

@tanujnay112 tanujnay112 changed the title [CHORE]: Remove nonce-related code outside s3heap [CHORE]: Disable S3heap service and remove nonce-related logic Nov 17, 2025
@tanujnay112 tanujnay112 changed the base branch from refactor_compactor to graphite-base/5866 November 17, 2025 21:39
@tanujnay112 tanujnay112 changed the base branch from graphite-base/5866 to refactor_compactor November 17, 2025 22:36
Copy link
Contributor

@rescrv rescrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have left stuff.

_, err = db.Exec(`UPDATE public.tasks SET lowest_live_nonce = NULL WHERE task_id = $1`, originalTaskID)
suite.NoError(err, "Should be able to corrupt task in database")
suite.T().Logf("Made task partial by setting lowest_live_nonce = NULL")
// TODO: Uncomment after proto regeneration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(who); also, maybe you meant to do before review.

attached_function_soft_delete_absolute_cutoff_time,
);
self.prune_heap_across_shards(cutoff_time).await;
// let cutoff_time = chrono::DateTime::<chrono::Utc>::from(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cut?

@tanujnay112 tanujnay112 changed the base branch from refactor_compactor to graphite-base/5866 November 18, 2025 20:33
@tanujnay112 tanujnay112 changed the base branch from graphite-base/5866 to main November 18, 2025 22:02
@tanujnay112 tanujnay112 merged commit c0575ef into main Nov 18, 2025
69 of 76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants