-
Notifications
You must be signed in to change notification settings - Fork 953
Description
Summary
PyBuffer heap-allocates the Py_buffer struct via Box<RawPyBuffer>, which adds significant overhead for short-lived buffer access patterns. In profiling a non-cryptographic hash function, PyBuffer::get accounts for ~850ns per call, with malloc/free alone contributing ~780ns. For a hash that completes in ~60ns on small inputs, this is a significant regression.
Profiling
CodSpeed flamegraph comparing &[u8] (baseline) vs PyBuffer<u8>:
| Source | Self Time |
|---|---|
<pyo3::buffer::PyBuffer<u8>>::get |
507ns |
malloc |
642ns |
free |
143ns |
new_uninit<pyo3::buffer::RawPyBuffer> |
89ns |
Replacing PyBuffer<u8> with a raw FFI stack-allocated Py_buffer + PyBUF_SIMPLE eliminated the regression almost entirely (~9ns overhead vs ~850ns).
Current: ~850ns overhead per call
fn hash(&self, data: PyBuffer<u8>) -> u128 {
hasher(data.as_bytes(), self.seed)
}Raw FFI workaround: ~9ns overhead per call
fn hash(&self, data: Bound<'_, PyAny>) ->
let mut view = std::mem::MaybeUninit::<pyo3::ffi::Py_buffer>::uninit();
unsafe {
// err validation is omitted here
pyo3::ffi::PyObject_GetBuffer(data.as_ptr(), view.as_mut_ptr(), pyo3::ffi::PyBUF_SIMPLE);
let mut view = view.assume_init();
let result = hasher(
std::slice::from_raw_parts(view.buf as *const u8, view.len as usize),
self.seed,
);
pyo3::ffi::PyBuffer_Release(&mut view);
result
}
}Flamegraph after implementing the workaround
Proposal
A stack-allocated buffer type for the common "oneshot" pattern.
Scoped closure
PyBuffer::with(obj, PyBUF_SIMPLE, |buf, len| {
// Py_buffer on the stack, released when closure returns
})?;Non-Send stack type
let buffer = PyBufferRef::get(obj)?; // stack-allocated + non-Send
let slice = buffer.as_bytes();
// PyBuffer_Release called on drop