-
Notifications
You must be signed in to change notification settings - Fork 905
Memory Aware Caching in Lighthouse #7803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Memory Aware Caching in Lighthouse #7803
Conversation
beacon_node/store/src/state_cache.rs
Outdated
self.states.put(state_root, (state_root, state.clone())) | ||
{ | ||
deleted_states.push(deleted_state_root); | ||
self.cached_bytes = self | ||
.cached_bytes | ||
.saturating_sub(removed_state.memory_size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelsproul can comment but can you just count the size of a single state? Does memory_size only count the size of the data owned by the first copy of the Arc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add timer metrics on memory_size
to measure how expensive it is in terms of time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do so. And the memory_size
implementing Milhouse's MemoryTracker
only counts the data owned by Arc
, once, if that's what you meant? It does not overcount the shared Arc
data
Can you add timer metrics on
memory_size
to measure how expensive it is in terms of time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory_size
of a state is probably larger than the amount of space we save by deleting it, because some of its memory might be shared with other states.
I don't think it's feasible to store enough information to update the cached_bytes
after removing a state – we need to just recompute the cache size. This is one of the challenges of this approach. You fundamentally don't know how much memory you will recover unless you recount.
Maybe I'm wrong though? Maybe we could change the order that we count states so that the ones most likely to be pruned are tracked last. Then a state's differential size would be approximately how much memory we'd be able to recover by deleting it.
fd74766
to
257938d
Compare
… iteratively track memory usage
05166bd
to
d758cdb
Compare
@@ -10,7 +10,7 @@ pub const DEFAULT_RE_ORG_MAX_EPOCHS_SINCE_FINALIZATION: Epoch = Epoch::new(2); | |||
/// Default to 1/12th of the slot, which is 1 second on mainnet. | |||
pub const DEFAULT_RE_ORG_CUTOFF_DENOMINATOR: u32 = 12; | |||
pub const DEFAULT_FORK_CHOICE_BEFORE_PROPOSAL_TIMEOUT: u64 = 250; | |||
|
|||
pub const DEFAULT_STATE_CACHE_MAX_BYTES: usize = 536870912; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a stylistic thing but we could write this constant as 512 * (1 << 20)
, perhaps with an intermediate constant like const MEGABYTE: usize = 1 << 20
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or 512 * 1024 * 1024
is even better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config can probably be deleted actually, as you've got the default in the store config
.long("state-cache-max-bytes") | ||
.value_name("BYTES") | ||
.help("Specifies the maximum size of the state cache in bytes") | ||
.default_value("536870912") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CLI param should probably be in MiB, as I doubt it's useful to set a finer granularity than 1 MiB.
@@ -377,6 +377,12 @@ pub fn get_config<E: EthSpec>( | |||
.map_err(|_| "state-cache-size is not a valid integer".to_string())?; | |||
} | |||
|
|||
if let Some(max_bytes) = cli_args.get_one::<String>("state-cache-max-bytes") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_optional
will be slightly less verbose here
} | ||
|
||
fn subtrees(&self) -> Vec<&dyn MemorySize> { | ||
vec![] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're missing all of the fields of the BeaconState
, i.e. the interesting parts. See my previous draft PR:
fn memory_size(&self) -> usize { | ||
let wrapper = BeaconStateWrapper(self); | ||
// Timer for MemorySize | ||
let timer = Instant::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this using metrics::start_timer
and drop
(no need for manual Instant
use)
self.cached_bytes = self.cached_bytes.saturating_sub( | ||
self.states | ||
.peek(state_root) | ||
.map_or(0, |(_, state)| state.memory_size()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can measure the size of a removed state like this, because memory_size
as written will just calculate the size of this state in isolation, which ignores the structural sharing.
I think it's probably better just to not even try to update this on removal, because it would require a recompute. We can just wait for the next recompute (or trigger a recompute after deletion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looks like you're already doing multiple recalcs inside cull
, so this can just be deleted
} | ||
|
||
// Recalculate the memory size after culling. | ||
let (new_total_bytes, elapsed_time) = self.measure_cached_memory_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing multiple recalcs in a loop might be a bit slow, but we can assess once MemorySize
is implemented for real
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. I expect the calulcation to take a lot longer once we actually examine the fields of the BeaconState
, this issue:
WIP
Issue Addressed
Fixes #7449
Changes
Memory tracking
BeaconStateMemorySize
(instore/memsize.rs
) that wraps aBeaconState
inBeaconStateWrapper
and measures with MilhouseMemoryTracker
.beacon_state_memory_size_calculation_time
to time memory sizing calls.State cache API/behavior
StateCache
now has:max_cached_bytes
: configurable byte ceiling for the hot state cache.cached_bytes
: last measured total (not incrementally maintained).put_count
: counter to throttle re-measurement.needs_remeasure
: boolean, to check if there is need for remeasuring of thememory_size
put_state
:RECOMPUTE_INTERVAL
inserts calls the full re-measurement routine, ifneeds_remeasure
is true.delete_state
and related paths:Added
measure_cached_memory_size()
:MemoryTracker
across all cached states and sumsdifferential_size
to avoid double-counting sharedArc
s.Added
recompute_cached_bytes()
:max_cached_bytes
, prune in batches (usesheadroom
, clamped to 5–64) → re-measure, loop until under the limit.store_beacon_state_cache_memory_size
and logs measurement time.Config / CLI / metrics
--state-cache-max-bytes
(default 512 MiB).DEFAULT_STATE_CACHE_MAX_BYTES
.HotColdDB
updates the memory gauge withstate_cache.cached_bytes()
.max_state_cache_bytes
.