Commit Graph

9363 Commits

Author SHA1 Message Date
Rob Norris 5c67820265
libzstd: also build with LIBZPOOL_CPPFLAGS
libzstd now also allocates its own abd_t, and so has the same issue as
zstream did, so this applies the same workaround: compile it with
ZFS_DEBUG. See 92fca1c2d.

This looks weird, because libzstd doesn't appear to look related to the
ZFS kernel, but there is already a cross-dependency there: zstd needs
zfs_lz4_compress, and zfs needs zfs_zstd_compress (and others), so the
two can never really be separated without more work. Another job for
another time.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mmaybee@delphix.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16489
2024-09-09 14:13:27 -07:00
Rob Norris b109925820
spa_prop_get: require caller to supply output nvlist
All callers to spa_prop_get() and spa_prop_get_nvlist() supplied their
own preallocated nvlist (except ztest), so we can remove the option to
have them allocate one if none is supplied.

This sidesteps a bug in spa_prop_get(), where the error var wasn't
initialised, which could lead to the provided nvlist being freed at the
end.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16505
2024-09-06 08:45:58 -07:00
Rob Norris 17dd66deda zpool events: expand value strings for ZIO error values
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-09-05 13:40:05 -07:00
Rob Norris 82ff9aafd6 value strings: pretty printers for flags and enums
This adds zfs_valstr, a collection of pretty printers for bitfields and
enums. These are useful in debugging, logging and other display contexts
where raw values are difficult for the untrained (or even trained!) eye
to decipher.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-09-05 13:40:05 -07:00
Don Brady d4d79451cb Add DDT prune command
Requires the new 'flat' physical data which has the start
time for a class entry.

The amount to prune can be based on a target percentage of
the unique entries or based on the age (i.e., every entry
older than N days).

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #16277
2024-09-04 14:17:02 -07:00
Rob Norris 4a4f7b019f zdb: rework dedup accounting for log, quota and prune
The simplest thing first: add the FDT and log objects to the list of
objects to be considered when checking for leaks.

The rest is based on a conceptual change in all of this patch stack: a
block on disk with a 'D' bit is not necessarily in the DDT at all
(pruned), or in the DDT ZAPs (still on the log).

As such, walking the DDT up front is difficult (for all the reasons that
walking an unflushed log is difficult) and not really useful, since it's
not a reflection of what's on disk anyway.

Instead, we rework things here to be more like the BRT checks. When we
see a dedup'd block, we look it up in the DDT, consume a refcount, and
for the second-or-later instances, count them as duplicates.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #16277
2024-09-04 14:16:42 -07:00
Seth Hoffert bf8c61f489
Remove unused sysctl node
PR #14953 removed vdev-level read cache but accidentally left this
sysctl node behind.

Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Seth Hoffert <seth.hoffert@gmail.com>
Closes #16493
2024-09-03 17:52:33 -07:00
Rob Norris b3b7491615 build: rename FORCEDEBUG_CPPFLAGS to LIBZPOOL_CPPFLAGS
This is just a very small attempt to make it more obvious that these
flags aren't optional for libzpool-using programs, by not making it seem
like there's an option to say "well, I don't _want_ to force debugging".

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Issue #16476
Closes #16477
2024-08-27 12:53:27 -07:00
Rob Norris 92fca1c2d0 zstream: build with debug to fix stack overruns
abd_t differs in size depending on whether or not ZFS_DEBUG is set. It
turns out that libzpool is built with FORCEDEBUG_CPPFLAGS, which sets
-DZFS_DEBUG, and so it always has a larger abd_t with extra debug
fields, regardless of whether or not --enable-debug is set.

zdb, ztest and zhack are also all built with FORCEDEBUG_CPPFLAGS, so had
the same idea of the size of abd_t, but zstream was not, and used the
"smaller" abd_t. In practice this didn't matter because it never used
abd_t directly.

This changed in b4d81b1a6, zstream was switched to use stack ABDs for
compression. When built with --enable-debug, zstream implicitly gets
ZFS_DEBUG, and everything was fine. Productions builds without that flag
ends up with the smaller abd_t, which is now mismatched with libzpool,
and causes stack overruns in zstream recompress.

The simplest fix for now is to compile zstream with FORCEDEBUG_CPPFLAGS
like the other binaries. This commit does that.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Issue #16476
Closes #16477
2024-08-27 12:52:23 -07:00
Rob Norris 50b32cb925
fm: pass io_flags through events & zed as uint64_t
In 4938d01db (#14086) zio_flag_t was converted from an enum (generally
signed 32-bit) to a uint64_t. The corresponding change wasn't made to
the error reporting subsystem, limiting the error flags being delivered
to zed to 32 bits. This bumps the whole pipeline to use uint64s.

A tiny bit of compatibility is added for newer zed working agsinst an
older kernel module, because its easy to do and misdetecting
scrub/resilver errors and taking action is potentially dangerous. Making
it work for new kernel modules against older zed seems to be far more
invasive for far less benefit, so I have not.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16469
2024-08-26 17:39:13 -07:00
Jitendra Patidar 73866cf346
Fix issig() to check signal_pending after dequeue SIGSTOP/SIGTSTP
When process got SIGSTOP/SIGTSTP, issig() dequeue them and return 0.
But process could still have another signal pending after dequeue. So,
after dequeue, check and return 1, if signal_pending.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes #16464
2024-08-26 17:36:49 -07:00
Mateusz Piotrowski 6be8bf5552
zpool: Provide GUID to zpool-reguid(8) with -g (#16239)
This commit extends the zpool-reguid(8) command with a -g flag, which
allows the user to specify the GUID to set.

This change also adds some general tests for zpool-reguid(8).

Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.

Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
2024-08-26 09:27:24 -07:00
Rob Norris 2420ee6e12
spl-taskq: fix task counts for delayed and cancelled tasks
Dispatched delayed tasks were not added to tasks_total, and cancelled
tasks were not removed. This notably could make tasks_total go to
UNIT64_MAX, but just generally meant the count could be wrong. So lets
not!

Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16473
2024-08-23 10:40:45 -07:00
Low-power 34118eac06
Make mount.zfs(8) calling zfs_mount_at for legacy mounts as well
Commit 329e2ffa4b has made mount.zfs(8) to
call libzfs function 'zfs_mount_at', in order to propagate dataset
properties into mount options. This fix however, is limited to a special
use case where mount.zfs(8) is used in initrd with option '-o zfsutil'.
If either initrd or the user need to use mount.zfs(8) to mount a file
system with 'mountpoint' set to 'legacy', '-o zfsutil' can't be used and
the original issue #7947 will still happen.

Since the existing code already excluded the possibility of calling
'zfs_mount_at' when it was invoked as a helper program from zfs(8), by
checking 'ZFS_MOUNT_HELPER' environment variable, it makes no sense to
avoid calling 'zfs_mount_at' without '-o zfsutil'.

An exception however, is when mount.zfs(8) was invoked with '-o remount'
to update the mount options for an existing mount point. In this case
call mount(2) directly without modifying the mount options passed from
command line.

Furthermore, don't run mount.zfs(8) helper for automounting snapshot.
The above change to make mount.zfs(8) to call 'zfs_mount_at'
apparently caused it to trigger an automount for the snapshot
directory. When the helper was invoked as a result of a snapshot
automount, an infinite recursion will occur.

Since the need of invoking user mode mount(8) for automounting was to
overcome that the 'vfs_kern_mount' being GPL-only, just run mount(8)
without the mount.zfs(8) helper by adding option '-i'.

Reviewed-by: Umer Saleem <usaleem@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <whr@rivoreo.one>
Closes #16393
2024-08-23 10:39:09 -07:00
Rob Norris cb36f4f352 zstream recompress: fix zero recompressed buffer and output
If compression happend, any garbage past the compress size was not
zeroed out.

If compression didn't happen, then the payload size was still set to
the rounded-up return from zio_compress_data(), which is dependent on
the input, which is not necessarily the logical size.

So that's all fixed too, mostly from stealing the math from zio.c.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris a537d90734 zstream decompress: fix decompress size and output
This was incorrectly using the compressed length for the size of the
decompress buffer, and quietly doing nothing if the decompressor refused
to decompress the block because there wasn't enough space.

After that, it wasn't correctly rewriting the record to indicate
"not compressed".

So that's fixed now. Sigh.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris a9c94bea9f zio_compress_data: limit dest length to ABD size
Some callers (eg `do_corrective_recv()`) pass in a dest buffer much
smaller than the wanted 87.5% of the source buffer, because the
incoming abd is larger than the source data and they "know" what the
decompressed size with be.

However, `abd_borrow_buf()` rightly asserts if we try to borrow more
than is available, so these callers fail.

Previously when all we had was a dest buffer, we didn't know how big it
was, so we couldn't do anything. Now we have a dest abd, with a size, so
we can clamp dest size to the abd size.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris f62e6e1f98 compress: change zio_compress API to use ABDs
This commit changes the frontend zio_compress_data and
zio_decompress_data APIs to take ABD points instead of buffer pointers.

All callers are updated to match. Any that already have an appropriate
ABD nearby now use it directly, while at the rest we create an one.

Internally, the ABDs are passed through to the provider directly.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris d3c12383c9 compress: change compression providers API to use ABDs
This commit changes the provider compress and decompress API to take ABD
pointers instead of buffer pointers for both data source and
destination. It then updates all providers to match.

This doesn't actually change the providers to do chunked compression,
just changes the API to allow such an update in the future. Helper
macros are added to easily adapt the ABD functions to their buffer-based
implementations.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris 522816498c compress: standardise names of compression functions
This is mostly to make searching easier.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris dd0c08f9c6 compress: remove unused abd compress prototype
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris e119483a95 compress: remove zio_decompress_data_buf
Nothing uses it anymore!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris b4d81b1a6a zstream: use zio_compress calls for compression
This is updating zstream to use the zio_compress calls rather than using
its own dispatch. Since that was fairly entangled, some refactoring
included.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris 5eede0d5fd compress: rework callers to always use the zio_compress calls
This will make future refactoring easier.

There are two we can't change for the moment, because zio_compress_data
does hole detection & collapsing which zio_decompress_data does not
actually know how to handle.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Rob Norris ba2209ec9e abd_get_from_buf_struct: wrap existing buf with ABD stored on stack
This allows a simple "wrapping" ABD for an existing linear buffer to be
allocated on the stack, avoiding an allocation.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
2024-08-22 16:22:24 -07:00
Tony Hutter 9e15877dfb
Linux 6.10 compat: META
Update the META file to reflect compatibility with the 6.10 kernel.

Reviewed-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #16466
2024-08-21 17:38:06 -07:00
Rob Norris b69bebb535 libzpool/abd_os: iovec-based scatter abd
This is intended to be a simple userspace scatter abd based on struct
iovec. It's not very sophisticated as-is, but sets a base for something
much more interesting.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:37:25 -07:00
Rob Norris 5b9e695392 abd_os: break out platform-specific header parts
Removing the platform #ifdefs from shared headers in favour of
per-platform headers. Makes abd_t much leaner, among other things.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:37:18 -07:00
Rob Norris 7a5b4355e2 abd_os: split userspace and Linux kernel code
The Linux abd_os.c serves double-duty as the userspace scatter abd
implementation, by carrying an emulation of kernel scatterlists. This
commit lifts common and userspace-specific parts out into a separate
abd_os.c for libzpool.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:37:13 -07:00
Rob Norris 2b7d9a7863 zio: no alloc canary in userspace
Makes it harder to use memory debuggers like valgrind directly, because
they can't see canary overruns.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:37:07 -07:00
Rob Norris b3f4e4e1ec abd: remove ABD_FLAG_ZEROS
Nothing ever checks it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #16253
2024-08-21 13:36:24 -07:00
shodanshok bbe8512a93
Ignore zfs_arc_shrinker_limit in direct reclaim mode
zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse
due to excessive memory reclaim. However, when the kernel is
in direct reclaim mode (ie: low on memory), limiting ARC reclaim
increases OOM risk. This is especially true on system without
(or with inadequate) swap.

This patch ignores zfs_arc_shrinker_limit when the kernel is in
direct reclaim mode, avoiding most OOM. It also restores
"echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop
(almost) all ARC.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #16313
2024-08-21 10:00:33 -07:00
Ameer Hamza a2c4e95cfd linux/zvol_os.c: cleanup limits for non-blk mq case
Rob Noris suggested that we could clean up redundant limits for the case
of non-blk mq scenario.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16462
2024-08-20 17:16:08 -07:00
Ameer Hamza 8e6a9aabb1 linux/zvol_os.c: Fix max_discard_sectors limit for 6.8+ kernel
In kernels 6.8 and later, the zvol block device is allocated with
qlimits passed during initialization. However, the zvol driver does not
set `max_hw_discard_sectors`, which is necessary to properly
initialize `max_discard_sectors`. This causes the `zvol_misc_trim` test
to fail on 6.8+ kernels when invoking the `blkdiscard` command. Setting
`max_hw_discard_sectors` in the `HAVE_BLK_ALLOC_DISK_2ARG` case resolve
the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #16462
2024-08-20 17:14:44 -07:00
Rob Norris 816d2b2bfc spl-proc: remove old taskq stats
These had minimal useful information for the admin, didn't work properly
in some places, and knew far too much about taskq internals.

With the new stats available, these should never be needed anymore.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes #16171
2024-08-19 09:50:45 -07:00
Rob Norris 3f8fd3cae0 spl-taskq: summary stats for all taskqs
This adds /proc/spl/kstats/taskq/summary, which attempts to show a
useful subset of stats for all taskqs in the system.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes #16171
2024-08-19 09:50:41 -07:00
Rob Norris db40fe4cf6 spl-taskq: per-taskq kstats
This exposes a variety of per-taskq stats under /proc/spl/kstat/taskq,
one file per taskq, named for the taskq name.instance.

These include a small amount of info about the taskq config, the current
state of the threads and queues, and various counters for thread and
queue activity since the taskq was created.

To assist with decrementing queue size counters, the list an entry is on
is encoded in spare bits in the entry flags.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes #16171
2024-08-19 09:50:35 -07:00
Rob Norris f0ad031cd9 spl-generic: bring up kstats subsystem before taskq
For spl-taskq to use the kstats infrastructure, it has to be available
first.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Syneto
Closes #16171
2024-08-19 09:49:28 -07:00
Chunwei Chen 06a7b123ac
Skip ro check for snaps when multi-mount
Skip ro check for snapshots since they are always ro regardless if ro
flag is passed by mount or not. This allows multi-mounting snapshots
without requiring to specify ro flag.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #16299
2024-08-19 09:42:17 -07:00
shodanshok 77a797a382
Enable L2 cache of all (MRU+MFU) metadata but MFU data only
`l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU
data and metadata. However it can be useful to cache as much
metadata as possible while, at the same time, restricting data
cache to MFU buffers only.

This patch allow for such behavior by setting `l2arc_mfuonly` to 2
(or higher). The list of possible values is the following:
0: cache both MRU and MFU for both data and metadata;
1: cache only MFU for both data and metadata;
2: cache both MRU and MFU for metadata, but only MFU for data.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes #16343 
Closes #16402
2024-08-16 13:34:07 -07:00
Allan Jude a60e15d6b9 Man page updates for dmu_ddt_copies
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes #15895
2024-08-16 12:04:04 -07:00
Rob Norris 0d2707815d ddt: lookup and log stats
Adds per-DDT stats counting lookups and where they were serviced from
(either log or backing zap), number of log entries in memory, and flow
rates.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:51 -07:00
Rob Norris a1902f4950 ddt: block scan until log is flushed, and flush aggressively
The dedup log does not have a stable cursor, so its not possible to
persist our current scan location within it across pool reloads.
Beccause of this, when walking (scanning), we can't treat it like just
another source of dedup entries.

Instead, when a scan is wanted, we switch to an aggressive flushing
mode, pushing out entries older than the scan start txg as fast as we
can, before starting the scan proper.

Entries after the scan start txg will be handled via other methods; the
DDT ZAPs and logs will be written as normal, and blocks not seen yet
will be offered to the scan machinery as normal.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:43 -07:00
Rob Norris cd69ba3d49 ddt: dedup log
Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.

The flushing facility monitors the amount of entries coming in and being
flushed out, and calibrates itself to try to flush enough each txg to
keep up with the ingest rate without competing too much with other IO.
Multiple tuneables are provided to control the flushing facility.

All the histograms and stats are update to accomodate the log as a
separate entry store. zdb gains knowledge of how to count them and dump
them. Documentation included!

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:35 -07:00
Rob Norris cbb9ef0a4c ddt: tuneable to override copies= on dedup metadata objects
All objects stored in the MOS get copies=3. For a large dedup table,
this requires significant extra IO and disk space, when its not really
necessary - the dedup table itself isn't needed to read or write data,
only to keep data usage down. Losing the dedup table does not render the
pool unusable, it just messes up the accounting somewhat.

This adds a dmu_ddt_copies tuneable. When set to 0, the existing
behaviour is used. When set higher, dedup table blocks (ZAP and log)
will have this many copies rather than the usual 3, while indirect
blocks will have one more again.

This is a tuneable for now mostly for testing. Losing a dedup table can
cause blocks to be leaked, and we currently have no facilities to repair
that.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:27 -07:00
Rob Norris 592f38900d ddt: compare keys 64-bits at a time, trying to match ZAP order
This yields substantial performance improvements when we only write out
some small % of entries at a time, as it will cause entries that will go
into "nearby" ZAP leaf nodes to be grouped closer together in the AVL, and
so touch fewer blocks. Without this, the distribution is an even spread,
so we touch a lot more ZAP leaf nodes for any given number of entries.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:03:19 -07:00
Rob Norris 27e9cb5f80 ddt: cleanup the stats & histogram code
Both the API and the code were kinda mangled and I was really struggling
to follow it. The worst offender was the old ddt_stat_add(); after
fixing it up the rest of the changes are mostly knock-on effects and
targets of opportunity.

Note that the old ddt_stat_add() was safe against overflows - it could
produce crazy numbers, but the compiler wouldn't do anything stupid. The
assertions in ddt_stat_sub() go a lot of the way to protecting against
this; getting in a position where overflows are a problem is definitely
a programming error.

Also expanding ddt_stat_add() and ddt_histogram_empty() produces less
efficient assembly. I'm not bothered about this right now though; these
should not be hot functions, and if they are we'll optimise them later.
If we have to go back to the old form, we'll comment it like crazy.

Finally, I've removed the assertion that the bucket will never be
negative, as it will soon be possible to have entries with zero
refcounts: an entry for a block that is no longer on the pool, but is on
the log waiting to be synced out. It might be better to have a separate
bucket for these, since they're still using real space on disk, but
ultimately these stats are driving UI, and for now I've chosen to keep
them matching how they've looked in the past, as well as match the
operators mental model - pool usage is managed elsewhere.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15895
2024-08-16 12:02:56 -07:00
Rob Norris f4aeb23f52 ddt: add "flat phys" feature
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Co-authored-by: Don Brady <don.brady@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2024-08-16 12:02:39 -07:00
Rob Norris 0ba5f503c5 ddt: slim down ddt_entry_t
This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.

The variable allocation for dde_phys is to support the upcoming flat
format.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2024-08-16 12:02:31 -07:00
Rob Norris 4d686c3da5 ddt: introduce lightweight entry
The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.

A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
2024-08-16 12:02:22 -07:00