Skip to content

Conversation

robn
Copy link
Member

@robn robn commented Aug 14, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

256 bytes of zio_t are devoted to two blkptr_ts, that aren't used on many zio types. Instead, lets make them pointers instead, and only allocate them when we need them.

This PR is intended as a kind of "polished prototype" for discussion. Maybe no one has strong opinions and it sails through, but idk, I have some mild unease about this one, so lets chat :)

Description

The short version: add a new cache, zio_bp_cache. In zio_create(), only allocate these when the IO types demand it - for io_bp_copy, when we need our own because the caller will free theirs, and for io_bp_orig, when we will modify the bp (ie most writes) and need the original around in case of a roll back.

The three commits are very straightforward and should be squashed in the end; the top one without the signoff is a dumb stats addition just to make it easier to understand what's happening while we're testing and playing around (probably those would be better in the cache stats really, but that's another change for another day).

But now lets talk about the things I don't like :)

I don't like the separate allocation, but its kind of what it has to be. It should be efficient enough once the cache is warm, though there are never too many out at once. The gain on the memory side seems worth it though. In my unscientific tests, some straightforward readng & writing to a 4xZ1 suggests we only allocate from zio_bp_cache for about 10% of all zio_cache allocations. A full ZTS run did about 40%, which may be more representative over time, but still seems worth a look.

This is a fairly naive conversion which took me all of a day, including testing runs, $dayjob and a big sleep. We could likely do better if we spent some time consider how we pass BPs around and where we could be a bit smarter. Could we share BPs that won't change more through the IO tree? Reach into our parent or io_logical to grab one it if we're downstream? Some refcounting and/or copy-on-write arrangement?

(at least, we can probably easily fix that L2ARC thing by just passing it arc_read_done() instead of smuggling it inside the zio).

Here's the size change. As you see, the new one has a pretty hefty hole in the middle of it. It's easy enough to reorder some things to fill it, but it doesn't materially change anything because of the cacheline alignment, we just get extra padding at the end. That feels like something for the next PR.

pahole before (1152 bytes)
struct zio {
	zbookmark_phys_t           io_bookmark;          /*     0    32 */
	zio_prop_t                 io_prop;              /*    32    52 */
	/* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */
	zio_type_t                 io_type;              /*    84     4 */
	enum zio_child             io_child_type;        /*    88     4 */
	enum trim_flag             io_trim_flags;        /*    92     4 */
	zio_priority_t             io_priority;          /*    96     4 */
	uint8_t                    io_post;              /*   100     1 */
	uint8_t                    io_state[2];          /*   101     2 */

	/* XXX 1 byte hole, try to pack */

	uint64_t                   io_txg;               /*   104     8 */
	spa_t *                    io_spa;               /*   112     8 */
	blkptr_t *                 io_bp;                /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	blkptr_t *                 io_bp_override;       /*   128     8 */
	blkptr_t                   io_bp_copy;           /*   136   128 */
	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
	list_t                     io_parent_list;       /*   264    24 */
	list_t                     io_child_list;        /*   288    24 */
	zio_t *                    io_logical;           /*   312     8 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	zio_transform_t *          io_transform_stack;   /*   320     8 */
	zio_done_func_t *          io_ready;             /*   328     8 */
	zio_done_func_t *          io_children_ready;    /*   336     8 */
	zio_done_func_t *          io_done;              /*   344     8 */
	void *                     io_private;           /*   352     8 */
	int64_t                    io_prev_space_delta;  /*   360     8 */
	blkptr_t                   io_bp_orig;           /*   368   128 */
	/* --- cacheline 7 boundary (448 bytes) was 48 bytes ago --- */
	uint64_t                   io_lsize;             /*   496     8 */
	struct abd *               io_abd;               /*   504     8 */
	/* --- cacheline 8 boundary (512 bytes) --- */
	struct abd *               io_orig_abd;          /*   512     8 */
	uint64_t                   io_size;              /*   520     8 */
	uint64_t                   io_orig_size;         /*   528     8 */
	vdev_t *                   io_vd;                /*   536     8 */
	void *                     io_vsd;               /*   544     8 */
	const zio_vsd_ops_t  *     io_vsd_ops;           /*   552     8 */
	metaslab_class_t *         io_metaslab_class;    /*   560     8 */
	enum zio_qstate            io_queue_state;       /*   568     4 */

	/* XXX 4 bytes hole, try to pack */

	/* --- cacheline 9 boundary (576 bytes) --- */
	union {
		list_node_t        l;                    /*   576    16 */
		avl_node_t         a;                    /*   576    24 */
	} io_queue_node __attribute__((__aligned__(64))); /*   576    24 */
	avl_node_t                 io_offset_node;       /*   600    24 */
	uint64_t                   io_offset;            /*   624     8 */
	hrtime_t                   io_timestamp;         /*   632     8 */
	/* --- cacheline 10 boundary (640 bytes) --- */
	hrtime_t                   io_queued_timestamp;  /*   640     8 */
	hrtime_t                   io_target_timestamp;  /*   648     8 */
	hrtime_t                   io_delta;             /*   656     8 */
	hrtime_t                   io_delay;             /*   664     8 */
	zio_alloc_list_t           io_alloc_list;        /*   672    32 */
	/* --- cacheline 11 boundary (704 bytes) --- */
	zio_flag_t                 io_flags;             /*   704     8 */
	enum zio_stage             io_stage;             /*   712     4 */
	enum zio_stage             io_pipeline;          /*   716     4 */
	zio_flag_t                 io_orig_flags;        /*   720     8 */
	enum zio_stage             io_orig_stage;        /*   728     4 */
	enum zio_stage             io_orig_pipeline;     /*   732     4 */
	enum zio_stage             io_pipeline_trace;    /*   736     4 */
	int                        io_error;             /*   740     4 */
	int                        io_child_error[4];    /*   744    16 */
	uint64_t                   io_children[4][2];    /*   760    64 */
	/* --- cacheline 12 boundary (768 bytes) was 56 bytes ago --- */
	uint64_t *                 io_stall;             /*   824     8 */
	/* --- cacheline 13 boundary (832 bytes) --- */
	zio_t *                    io_gang_leader;       /*   832     8 */
	zio_gang_node_t *          io_gang_tree;         /*   840     8 */
	void *                     io_executor;          /*   848     8 */
	void *                     io_waiter;            /*   856     8 */
	void *                     io_bio;               /*   864     8 */
	kmutex_t                   io_lock;              /*   872    48 */
	/* --- cacheline 14 boundary (896 bytes) was 24 bytes ago --- */
	kcondvar_t                 io_cv;                /*   920    72 */
	/* --- cacheline 15 boundary (960 bytes) was 32 bytes ago --- */
	int                        io_allocator;         /*   992     4 */

	/* XXX 4 bytes hole, try to pack */

	zio_cksum_report_t *       io_cksum_report;      /*  1000     8 */
	uint64_t                   io_ena;               /*  1008     8 */
	taskq_ent_t                io_tqent;             /*  1016   136 */

	/* size: 1152, cachelines: 18, members: 64 */
	/* sum members: 1143, holes: 3, sum holes: 9 */
	/* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
} __attribute__((__aligned__(64)));
pahole after (960 bytes)
struct zio {
	zbookmark_phys_t           io_bookmark;          /*     0    32 */
	zio_prop_t                 io_prop;              /*    32    52 */
	/* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */
	zio_type_t                 io_type;              /*    84     4 */
	enum zio_child             io_child_type;        /*    88     4 */
	enum trim_flag             io_trim_flags;        /*    92     4 */
	zio_priority_t             io_priority;          /*    96     4 */
	uint8_t                    io_post;              /*   100     1 */
	uint8_t                    io_state[2];          /*   101     2 */

	/* XXX 1 byte hole, try to pack */

	uint64_t                   io_txg;               /*   104     8 */
	spa_t *                    io_spa;               /*   112     8 */
	blkptr_t *                 io_bp;                /*   120     8 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	blkptr_t *                 io_bp_override;       /*   128     8 */
	blkptr_t *                 io_bp_copy;           /*   136     8 */
	list_t                     io_parent_list;       /*   144    24 */
	list_t                     io_child_list;        /*   168    24 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	zio_t *                    io_logical;           /*   192     8 */
	zio_transform_t *          io_transform_stack;   /*   200     8 */
	zio_done_func_t *          io_ready;             /*   208     8 */
	zio_done_func_t *          io_children_ready;    /*   216     8 */
	zio_done_func_t *          io_done;              /*   224     8 */
	void *                     io_private;           /*   232     8 */
	int64_t                    io_prev_space_delta;  /*   240     8 */
	blkptr_t *                 io_bp_orig;           /*   248     8 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	uint64_t                   io_lsize;             /*   256     8 */
	struct abd *               io_abd;               /*   264     8 */
	struct abd *               io_orig_abd;          /*   272     8 */
	uint64_t                   io_size;              /*   280     8 */
	uint64_t                   io_orig_size;         /*   288     8 */
	vdev_t *                   io_vd;                /*   296     8 */
	void *                     io_vsd;               /*   304     8 */
	const zio_vsd_ops_t  *     io_vsd_ops;           /*   312     8 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	metaslab_class_t *         io_metaslab_class;    /*   320     8 */
	enum zio_qstate            io_queue_state;       /*   328     4 */

	/* XXX 52 bytes hole, try to pack */

	/* --- cacheline 6 boundary (384 bytes) --- */
	union {
		list_node_t        l;                    /*   384    16 */
		avl_node_t         a;                    /*   384    24 */
	} io_queue_node __attribute__((__aligned__(64))); /*   384    24 */
	avl_node_t                 io_offset_node;       /*   408    24 */
	uint64_t                   io_offset;            /*   432     8 */
	hrtime_t                   io_timestamp;         /*   440     8 */
	/* --- cacheline 7 boundary (448 bytes) --- */
	hrtime_t                   io_queued_timestamp;  /*   448     8 */
	hrtime_t                   io_target_timestamp;  /*   456     8 */
	hrtime_t                   io_delta;             /*   464     8 */
	hrtime_t                   io_delay;             /*   472     8 */
	zio_alloc_list_t           io_alloc_list;        /*   480    32 */
	/* --- cacheline 8 boundary (512 bytes) --- */
	zio_flag_t                 io_flags;             /*   512     8 */
	enum zio_stage             io_stage;             /*   520     4 */
	enum zio_stage             io_pipeline;          /*   524     4 */
	zio_flag_t                 io_orig_flags;        /*   528     8 */
	enum zio_stage             io_orig_stage;        /*   536     4 */
	enum zio_stage             io_orig_pipeline;     /*   540     4 */
	enum zio_stage             io_pipeline_trace;    /*   544     4 */
	int                        io_error;             /*   548     4 */
	int                        io_child_error[4];    /*   552    16 */
	uint64_t                   io_children[4][2];    /*   568    64 */
	/* --- cacheline 9 boundary (576 bytes) was 56 bytes ago --- */
	uint64_t *                 io_stall;             /*   632     8 */
	/* --- cacheline 10 boundary (640 bytes) --- */
	zio_t *                    io_gang_leader;       /*   640     8 */
	zio_gang_node_t *          io_gang_tree;         /*   648     8 */
	void *                     io_executor;          /*   656     8 */
	void *                     io_waiter;            /*   664     8 */
	void *                     io_bio;               /*   672     8 */
	kmutex_t                   io_lock;              /*   680    48 */
	/* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */
	kcondvar_t                 io_cv;                /*   728    72 */
	/* --- cacheline 12 boundary (768 bytes) was 32 bytes ago --- */
	int                        io_allocator;         /*   800     4 */

	/* XXX 4 bytes hole, try to pack */

	zio_cksum_report_t *       io_cksum_report;      /*   808     8 */
	uint64_t                   io_ena;               /*   816     8 */
	taskq_ent_t                io_tqent;             /*   824   136 */

	/* size: 960, cachelines: 15, members: 64 */
	/* sum members: 903, holes: 3, sum holes: 57 */
	/* forced alignments: 1, forced holes: 1, sum forced holes: 52 */
} __attribute__((__aligned__(64)));

The other thing I'm unsure about with this is whether we generally want to look towards this "componentised" kind of model of zio_t, where it only has the pieces it needs. The fields can be grouped into a bunch of different "uses", eg, there's a handful items specific to queue operations, which aren't needed outside of read/write/trim to leaf devices. I can see arguments for and against; it does let us get the memory usage down to just what we need, but C doesn't make it easy to work with that sort of structure, and its also gotta be done carefully to avoid fragmenting memory too much. Maybe it doesn't matter if we don't see this PR as a precedent, or we just consider this a step in a direction.

How Has This Been Tested?

Full ZTS run completed successfully against Linux 6.12. I would not however be surprised to find I've introduced some very subtle NULL deref in some niche situation though.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn robn changed the title zio: use a separate allocation for io_bp_copy and io_bp_orig. zio: use a separate allocation for io_bp_copy and io_bp_orig Aug 14, 2025
robn added 4 commits August 20, 2025 11:34
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
We only need our own copy of the BP if the caller provided one that they
would free before we're done with it.

Additionally, L2ARC currently abuses io_bp_copy to smuggle the
originally BP back to arc_read_done() inside the L2ARC read ZIO. A small
helper function is provided to handle that allocation when necessary.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
We only need a copy if we would modify the original and might need to
revert, which only applies to non-vdev writes.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
Just a rough eyeball comparison to get a feel for what we're saving.
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 22, 2025
Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in this code, but I don't see any surface-level issues.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good, and I can't say I have any strong objections. Out of curiosity, have you seen cases where we build up a large number of zio allocations on the system?

That said, this does make me a little uneasy. I have to wonder how worthwhile the memory savings is, compared to the slight risk that the extra allocation will have an observable performance impact with fast devices. Mind you I doubt an impact would be measurable. The cache will be hot and we're doing other more expensive things in the pipeline already.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 29, 2025
@robn
Copy link
Member Author

robn commented Sep 3, 2025

@behlendorf great question. Mostly no, not in normal operation, which is what makes me uneasy about this change.

#16722 was one where the size of zio_t was a huge contributor to the symptom, but the problem there is arguably that we model a free as an IO when it isn't really, and perhaps that we don't have a better batching mechanism.

#17388 was more recent and the concern there was about the weight of an extra intermediate zio_t, but that wasn't really shown to be a problem, more just an acknowledgement that these things are pretty heavy and we should try to use them sparingly.

Given both of those, I think we could just drop this and tackle both cases more directly - some rework of how we handle bulk frees, and then maybe a smaller form of zio_t for when we just need something to group a bunch together without doing any direct work. What do you think?

@behlendorf
Copy link
Contributor

@robn then let's put this one on the back burner for now, and tackle those other issues directly as you said. We can always revisit the need for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants