My l2arc is smaller than I think it should be. #13342

Erotemic · 2022-04-17T23:18:47Z

Erotemic
Apr 17, 2022

I'm working with a 262 GB dataset of uncompressable images (they already have DEFLATE compression enabled). Using cat /proc/spl/kstat/zfs/arcstats | grep l2 I see that my l2arc size is 154943145984 or 144.30 GB. My program has already made an entire pass through the dataset, so I would expect the entire dataset would be in the l2arc +- the size of the arc, but I'm not seeing that. My question is: why?

My system has 64GB of memory, and zfs_arc_max is 51539607552 or 48GB.

For details my zpool status is:

(pyenv3.9.9) joncrall@toothbrush:~$ zpool status
  pool: data
 state: ONLINE
  scan: resilvered 49.1G in 00:10:03 with 0 errors on Wed Apr 13 01:00:19 2022
config:

	NAME                                                 STATE     READ WRITE CKSUM
	data                                                 ONLINE       0     0     0
	  mirror-0                                           ONLINE       0     0     0
	    wwn-0x5000c5009399acab                           ONLINE       0     0     0
	    wwn-0x5000c500a4d78d92                           ONLINE       0     0     0
	  mirror-1                                           ONLINE       0     0     0
	    wwn-0x5000c500a4e45aa5                           ONLINE       0     0     0
	    wwn-0x5000c500a3d4e682                           ONLINE       0     0     0
	cache
	  nvme-Samsung_SSD_970_EVO_Plus_2TB_S59CNM0RB05113H  ONLINE       0     0     0

The main 4 10TB HDDs are setup in a striped mirror, and I have a 2TB NVMe drive as my l2arc cache device.

I have ZFS 2.0.6 on Ubuntu 21.10:

(pyenv3.9.9) joncrall@toothbrush:~$ zfs --version
zfs-2.0.6-1ubuntu2.1
zfs-kmod-2.0.6-1ubuntu2.1

Some details from arc_summary:

ARC size (current):                                     3.4 %    1.7 GiB
        Target size (adaptive):                         4.1 %    2.0 GiB
        Min size (hard limit):                          4.1 %    2.0 GiB
        Max size (high water):                           24:1   48.0 GiB
        Most Frequently Used (MFU) cache size:         87.7 %    1.0 GiB
        Most Recently Used (MRU) cache size:           12.3 %  143.7 MiB
        Metadata cache size (hard limit):              75.0 %   36.0 GiB
        Metadata cache size (current):                  1.8 %  674.4 MiB
        Dnode cache size (hard limit):                 10.0 %    3.6 GiB
        Dnode cache size (current):                     1.5 %   55.2 MiB

L2ARC status:                                                    HEALTHY
        Low memory aborts:                                           206
        Free on write:                                                63
        R/W clashes:                                                   0
        Bad checksums:                                                 0
        I/O errors:                                                    0

L2ARC size (adaptive):                                         144.3 GiB
        Compressed:                                    83.0 %  119.7 GiB
        Header size:                                    0.2 %  318.4 MiB

L2ARC breakdown:                                                    1.7M
        Hit ratio:                                     36.6 %     612.8k
        Miss ratio:                                    63.4 %       1.1M
        Feeds:                                                    175.1k

L2ARC writes:
        Writes sent:                                    100 %      14.6k

L2ARC evicts:
        Lock retries:                                                  0
        Upon reading:                                                  0

Tunables of the l2arc are still at the defaults

        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_headroom_boost                                         200
        l2arc_meta_percent                                            33
        l2arc_mfuonly                                                  0
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_rebuild_blocks_min_l2size                       1073741824
        l2arc_rebuild_enabled                                          1
        l2arc_trim_ahead                                               0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608

        zfs_arc_average_blocksize                                   8192
        zfs_arc_dnode_limit                                            0
        zfs_arc_dnode_limit_percent                                   10
        zfs_arc_dnode_reduce_percent                                  10
        zfs_arc_evict_batch_limit                                     10
        zfs_arc_eviction_pct                                         200
        zfs_arc_grow_retry                                             0
        zfs_arc_lotsfree_percent                                      10
        zfs_arc_max                                          51539607552
        zfs_arc_meta_adjust_restarts                                4096
        zfs_arc_meta_limit                                             0
        zfs_arc_meta_limit_percent                                    75
        zfs_arc_meta_min                                               0
        zfs_arc_meta_prune                                         10000
        zfs_arc_meta_strategy                                          1
        zfs_arc_min                                                    0
        zfs_arc_min_prefetch_ms                                        0
        zfs_arc_min_prescient_prefetch_ms                              0
        zfs_arc_p_dampener_disable                                     1
        zfs_arc_p_min_shift                                            0
        zfs_arc_pc_percent                                             0
        zfs_arc_shrink_shift                                           0
        zfs_arc_shrinker_limit                                     10000
        zfs_arc_sys_free                                               0

I'm effectively looking to tune my zfs drives such that when my dataset is less than 2TB I effectively get the read speed of the SSD as I sequentially or randomly iterate over it (machine learning on image data). So if there are any further tips in that direction, or if I have a misconception about this being possible, I'd be appreciative the advice.

But again the main question I have is: why is it that my l2arc does not contain the entire dataset, even though I've iterated through every image and loaded all data from each image. From what I understand each image's data must have been fed into arc at some point and then if it ever got evicted it should be on the l2arc. My first thought was that maybe the data was compressed but a delta of 120GB is way to big, the data isn't very compressible. But perhaps I'm misunderstanding something.

zfsbot · 2022-04-18T12:57:25Z

zfsbot
Apr 18, 2022

things won't make it into L2ARC if they're not first in ARC. L2ARC holds evicted records from ARC.

you only have 2GB ARC, i think it's incredible that the L2 has filled up as much as it has. you're using a lot of your ARC just to store the headers for L2.

0 replies

zfsbot · 2022-04-18T13:00:25Z

zfsbot
Apr 18, 2022

"low memory aborts" happen when the system would like to feed out to L2ARC but instead decides that the ARC is under a fair bit of pressure.

0 replies

rincebrain · 2022-04-20T04:57:13Z

rincebrain
Apr 20, 2022
Collaborator

More generally, L2ARC does not strictly get filled as soon as something falls out of ARC.

ARC fills up, and L2ARC slowly copies things that are likely to get dropped from ARC soon, but it's not a synchronous process where things that fall from ARC always end up in L2ARC - if it were, then you'd bottleneck on freeing things from ARC on your L2ARC speed, among other issues.

By default, it fills at a very slow rate, whose value escapes me at the moment but I believe is on the order of 10-20 MB/s at most, and it doubles that rate when it thinks the L2ARC is "warming up".

So if your actual ARC is tiny, and your L2ARC is relatively huge compared to it, it would take quite some time and precisely the right patterns of churn for your L2ARC device to contain most, if not all, of the data you care about, as it would often either fall out of ARC faster than L2ARC copies it over, or have the same data in ARC for a long period and then L2ARC has nothing new to copy, more or less.

There are tunables you can adjust to change the L2ARC fill rate - l2arc_write_max, l2arc_write_boost, l2arc_headroom, l2arc_feed_min_ms, l2arc_feed_secs...plausibly others I can't think of offhand.

(There's also the overhead of L2ARC per-record, which is I think on the order of 70/80b per record? So at 128k records, you'd be occupying 79 MB/90 MB (depending on which estimate is accurate) to keep 144 GiB in the L2ARC.

If you filled the 2T device, you'd be occupying 1.09/1.25 GB.)

2 replies

aheid Oct 30, 2022

Indeed, it's important to note that the L2ARC is only fed what is, or very recently was (via the ghost lists), in the ARC.

By default, it fills at a very slow rate, whose value escapes me at the moment but I believe is on the order of 10-20 MB/s at most, and it doubles that rate when it thinks the L2ARC is "warming up".

From what I can gather from the code, once the cache is warm, it writes l2arc_write_max every "feed cycle". The "feed cycle" is throttled by either l2arc_feed_min_ms if l2arc_feed_again is enabled and the previous cycle wrote more than half that was requested, otherwise by l2arc_feed_secs.

Given the values posted by OP, that equates to at most 8388608*5 = 41943040 bytes per second being fed into the L2ARC.

Since the cache device is a Samsung 970 EVO Plus, it seems bumping the l2arc_write_max, as well as possibly reducing l2arc_feed_min_ms, might be fruitful. Given that the random read/write IO of the 970 Evo Plus 2TB is likely above 200MB/s (>1600MB/s for 128kB blocks)1, it seems l2arc_write_max above 100MB might be appropriate. Though I don't have real-world experience with this so might be mistaken.

However the small ARC could still limit what is being fed to the L2ARC, if there's enough pressure to evict stuff entirely from the ARC before it's been fed to the L2ARC.

aheid Oct 31, 2022

For others reading this, note that OP has l2arc_noprefetch = 1, which means prefetched blocks that haven't been hit yet aren't eligible for the L2ARC.

Setting l2arc_noprefetch = 0 makes a lot of data eligible for the L2ARC, and so increasing l2arc_write_max will likely cause the L2ARC device(s) to receive orders of magnitude more writes. For consumer SSDs this can negatively affect the lifespan, especially if the pool is busy. Keep in mind that 100MB/s continuously is over 8TB/day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

My l2arc is smaller than I think it should be. #13342

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

My l2arc is smaller than I think it should be. #13342

Uh oh!

Erotemic Apr 17, 2022

Replies: 3 comments · 2 replies

Uh oh!

zfsbot Apr 18, 2022

Uh oh!

zfsbot Apr 18, 2022

Uh oh!

rincebrain Apr 20, 2022 Collaborator

Uh oh!

aheid Oct 30, 2022

Uh oh!

aheid Oct 31, 2022

Erotemic
Apr 17, 2022

Replies: 3 comments 2 replies

zfsbot
Apr 18, 2022

zfsbot
Apr 18, 2022

rincebrain
Apr 20, 2022
Collaborator