Skip to content

Conversation

@mraszyk
Copy link
Contributor

@mraszyk mraszyk commented Jan 6, 2026

WORK IN PROGRESS!

@github-actions github-actions bot added the feat label Jan 6, 2026
hash_tree: None,
certified_state_hash: Some(certification.signed.content.hash.clone().get()),
certification: Some(certification),
certification_requested_at: Instant::now(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could clean it up by defining the certification field as an enum:

enum CertificationRequest {
  Requested(Instant),
  Delivered(Certification),
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field seems to be poorly named, because we set it to the time when we computed the hash tree. IMO it would make more sense to add it into the Option of hash_tree. If you never computed a hash tree, then you never requested certification either. It was just that previously we always computed the hash tree and it was only an Option as a workaround for initialization.

},
)
(None, Some(hash)) => Some((height, hash)),
(None, None) => None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're always ending here for heights in between the current DSM height and the latest certified height of the subnet, i.e., certifications for those heights are not available in the pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning None here we are filtering out all heights that don't have a hash, which means they will not be part of the state_hashes_to_certify vector. Later on in the function we only validate artifacts for heights part of this vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have a hash for those heights, we could only easily execute this part:

        let certifications = state_hashes_to_certify
            .iter()
            .flat_map(|(height, _)| self.aggregate(certification_pool, *height))
            .collect::<Vec<_>>();
    
        if !certifications.is_empty() {
            self.metrics
                .certifications_aggregated
                .inc_by(certifications.len() as u64);
            trace!(
                &self.log,
                "Aggregated {} threshold-signatures in {:?}",
                certifications.len(),
                start.elapsed()
            );
            return certifications
                .into_iter()
                .map(ChangeAction::AddToValidated)
                .collect();
        }

but not

        let change_set = self.validate(certification_pool, &state_hashes_to_certify);

Is it enough to execute the former or do we also need to execute the latter? If we also need to execute the latter, I'd refactor the functions validate_share and validate_certification to take hash: &Option<CryptoHashOfPartialState> and skip the hash check if there's no hash available, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that you have to do the validate changes. Otherwise, you would never validate the certification shares, so the other code will also never have enough validated shares to aggregate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, I think we briefly touched on that in this comment

@mraszyk
Copy link
Contributor Author

mraszyk commented Jan 6, 2026

StateManagerImpl::latest_certified_state must be fixed to avoid:

Jan 06 19:13:28 xlosm-dkxng-5wd4k-ypkhc-diu32-fdmiq-a7v3z-6lq4w-e2vhz-k53qh-4ae orchestrator[2650]: {"log_entry":{"level":"WARN","utc_time":"2026-01-06T19:13:28.637Z","message":"Certified state at height 798 not available.","crate_":"ic_state_manager","module":"ic_state_manager","line":1787,"node_id":"xlosm-dkxng-5wd4k-ypkhc-diu32-fdmiq-a7v3z-6lq4w-e2vhz-k53qh-4ae","subnet_id":"xwdyl-b3fzw-77ibn-xold7-ryjvs-ehnqy-kgz4p-4wrvk-66s2f-hxhrs-pae"}}

}

let tip_height = self.tip_height.load(Ordering::Relaxed);
let last_certification_height_to_keep = min(last_height_to_keep, Height::new(tip_height));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be the tip height, but the latest_certified_height. We always want to keep the latest height where we have everything (hash tree/certificiation/state), so that the height where we answer queries doesn't go backwards. Or even worse, we wouldn't want to fall back into a state where we have no certified states. Specifically, we want to protect whatever height is returned from latest_certified_state.

Furthermore, also in this function, I believe that self.latest_certified_height is not updated correctly. It should always be the height for latest_certified_state, but here we only check for the presence of a certification (instead of certification+hash tree).

.certifications_metadata
.entry(height)
.or_insert_with(|| {
Self::compute_certification_metadata(&self.metrics, &self.log, &state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what is tripping up latest_certified_state. Here we populate metadata.hash, and then later consensus might call deliver_state_certification. We then have as hash tree, a certification, but no corresponding state in self.snapshots.

I think there are two fixes:

  1. Here, only poplulate metadata.certified_state_hash, but not metadata.hash_tree.
  2. Populate both, but rewrite latest_certified_state so that it doesn't assume that if we have hash tree and certification, we also have a state.

Either way, you then also need to fix how we update self.latest_certified_height in deliver_state_certification. It should only be updated to heights where we have hash tree+certification+state. Whether you pick (1) or (2) also affects how you need to update self.latest_certified_height in remove_states_below.


let latest_subnet_certified_height =
self.latest_subnet_certified_height.load(Ordering::Relaxed);
if matches!(scope, CertificationScope::Metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have would be some metrics about how often we skip steps due to this new logic. Specifically, how often we skip both cloning and hashing, how often we do it anyway due to the is_multiple_of(10) rule, and how often do we hash due to missing certifications.

states.tip = Some((height, state));
self.tip_height.store(height.get(), Ordering::Relaxed);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_multiple_of(10) case is not quite correctly handled:

  1. If we already have a CertificationMetadata we shouldn't just overwrite it. Instead, we want to preserve the certification if it has one. Otherwise we rely on consensus to be able to serve it to us again.

  2. We already do check if we already have a CertificationMetadata, and assert that the has is the same hash. Until now, this could only trigger if you have a state sync ending around the same time as execution finishes. But now, this is how we detect divergences in the is_multiple_of(10) case. So we should strengthen it a bit, and do the same as we do in deliver_state_certification on divergence. Mainly, there we call create_diverged_state_marker to log the divergence on disk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until now, this could only trigger if you have a state sync ending around the same time as execution finishes.

Why don't we log this divergence on disk already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just an oversight. But because it was basically dead code it didn't matter.

hash_tree: None,
certified_state_hash: Some(certification.signed.content.hash.clone().get()),
certification: Some(certification),
certification_requested_at: Instant::now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field seems to be poorly named, because we set it to the time when we computed the hash tree. IMO it would make more sense to add it into the Option of hash_tree. If you never computed a hash tree, then you never requested certification either. It was just that previously we always computed the hash tree and it was only an Option as a workaround for initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants