Commit Graph

927 Commits

Author SHA1 Message Date
Laura Hild 228b064d1b Remove implication that child `disk`s aren't vdevs in zpoolconcepts(7)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Laura Hild <lsh@jlab.org>
Closes #15247
2023-09-19 08:52:06 -07:00
Serapheim Dimitropoulos 63159e5bda checkstyle: fix action failures
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #15220
2023-09-01 09:33:33 -07:00
Paul Dagnelie 7eabb0af37 Try to clarify wording to reduce zpool add incidents
Try to clarify wording to reduce zpool add incidents.
Add an attach example.

Reviewed-by: Rich Ercolani <Rincebrain@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #15179
2023-08-27 08:25:42 -07:00
наб 1b696429c1 Make zoned/jailed zfsprops(7) make more sense.
- Distribute zfs-[un]jail.8 on FreeBSD and zfs-[un]zone.8 on Linux
- zfsprops.7: mirror zoned/jailed, only available on respective platforms

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #15161
2023-08-27 08:25:42 -07:00
наб 6bdc7259d1 libzfs: sendrecv: send_progress_thread: handle SIGINFO/SIGUSR1
POSIX timers target the process, not the thread (as does SIGINFO),
so we need to block it in the main thread which will die if interrupted.

Ref: https://101010.pl/@ed1conf@bsd.network/110731819189629373
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #15113
2023-08-25 13:33:40 -07:00
Rob N 685ae4429f metaslab: tuneable to better control force ganging
metaslab_force_ganging isn't enough to actually force ganging, because
it still only forces 3% of the time. This adds
metaslab_force_ganging_pct so we can configure how often to force
ganging.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Closes #15088
2023-07-21 16:35:12 -07:00
Alexander Motin 81be809a25 Adjust prefetch parameters.
- Reduce maximum prefetch distance for 32bit platforms to 8MB as it
was previously.  Those systems didn't grow much probably, so better
stay conservative there.
 - Retire array_rd_sz tunable, blocking prefetch for large requests.
We should not penalize applications trying to be more efficient. The
speculative prefetcher by itself has reasonable distance limits, and
1MB is not much at all these days.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #15072
2023-07-21 16:35:12 -07:00
Alan Somers b6f618f8ff Don't emit cksum_{actual_expected} in ereport.fs.zfs.checksum events
With anything but fletcher-4, even a tiny change in the input will cause
the checksum value to change completely.  So knowing the actual and
expected checksums doesn't provide much more information than "they
don't match".  The harm in sending them is simply that they bloat the
event.  In particular, on FreeBSD the event must fit into a 1016 byte
buffer.

Fixes #14717 for mirrored pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #14717
Closes #15052
2023-07-21 16:35:12 -07:00
Alan Somers 51a2b59767 Don't emit checksum histograms in ereport.fs.zfs.checksum events
The checksum histograms were intended to be used with ATA and parallel
SCSI, which are obsolete.  With modern storage hardware, they will
almost always look like white noise; all bits will be wrong.  They only
serve to bloat the event.  That's a particular problem on FreeBSD, where
events must fit into a 1016 byte buffer.

This fixes issue #14717 for RAIDZ pools, but not for mirror pools.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Alan Somers <asomers@gmail.com>
Sponsored-by: Axcient
Closes #15052
2023-07-21 16:35:12 -07:00
Rich Ercolani 2b10e32561
Pack our DDT ZAPs a bit denser.
The DDT is really inefficient on 4k and up vdevs, because it always
allocates 4k blocks, and while compression could save us somewhat
at ashift 9, that stops being true.

So let's change the default to 32 KiB, which seems like a reasonable
compromise between improved space savings and inflated write sizes
for DDT updates.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14654
2023-06-30 09:42:02 -07:00
Alexander Motin fa7b2390d4
Do not report bytes skipped by scan as issued.
Scan process may skip blocks based on their birth time, DVA, etc.
Traditionally those blocks were accounted as issued, that caused
reporting of hugely over-inflated numbers, having nothing to do
with actual disk I/O.  This change utilizes never used field in
struct dsl_scan_phys to account such skipped bytes, allowing to
report how much data were actually scrubbed/resilvered and what
is the actual I/O speed.  While formally it is an on-disk format
change, it should be compatible both ways, so should not need a
feature flag.

This should partially address the same issue as c85ac731a0, but
from a different perspective, complementing it.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Akash B <akash-b@hpe.com>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #15007
2023-06-30 08:47:13 -07:00
Mateusz Piotrowski 62ace21a14
zdb: Add missing poolname to -C synopsis
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Rob Norris <robn@despairlabs.com>
Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>
Sponsored-by: Klara Inc.
Closes #15014
2023-06-29 10:54:43 -07:00
Laevos bc9d0084ea
Remove unnecessary commas in zpool-create.8
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Laevos <5572812+Laevos@users.noreply.github.com>
Closes #15011
2023-06-27 16:58:32 -07:00
Alexander Motin 8469b5aac0
Another set of vdev queue optimizations.
Switch FIFO queues (SYNC/TRIM) and active queue of vdev queue from
time-sorted AVL-trees to simple lists.  AVL-trees are too expensive
for such a simple task.  To change I/O priority without searching
through the trees, add io_queue_state field to struct zio.

To not check number of queued I/Os for each priority add vq_cqueued
bitmap to struct vdev_queue.  Update it when adding/removing I/Os.
Make vq_cactive a separate array instead of struct vdev_queue_class
member.  Together those allow to avoid lots of cache misses when
looking for work in vdev_queue_class_to_issue().

Introduce deadline of ~0.5s for LBA-sorted queues.  Before this I
saw some I/Os waiting in a queue for up to 8 seconds and possibly
more due to starvation.  With this change I no longer see it.  I
had to slightly more complicate the comparison function, but since
it uses all the same cache lines the difference is minimal.  For a
sequential I/Os the new code in vdev_queue_io_to_issue() actually
often uses more simple avl_first(), falling back to avl_find() and
avl_nearest() only when needed.

Arrange members in struct zio to access only one cache line when
searching through vdev queues.  While there, remove io_alloc_node,
reusing the io_queue_node instead.  Those two are never used same
time.

Remove zfs_vdev_aggregate_trim parameter.  It was disabled for 4
years since implemented, while still wasted time maintaining the
offset-sorted tree of TRIM requests.  Just remove the tree.

Remove locking from txg_all_lists_empty().  It is racy by design,
while 2 pair of locks/unlocks take noticeable time under the vdev
queue lock.

With these changes in my tests with volblocksize=4KB I measure vdev
queue lock spin time reduction by 50% on read and 75% on write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14925
2023-06-27 09:09:48 -07:00
Rich Ercolani 35a6247c5f
Add a delay to tearing down threads.
It's been observed that in certain workloads (zvol-related being a
big one), ZFS will end up spending a large amount of time spinning
up taskqs only to tear them down again almost immediately, then
spin them up again...

I noticed this when I looked at what my mostly-idle system was doing
and wondered how on earth taskq creation/destroy was a bunch of time...

So I added a configurable delay to avoid it tearing down tasks the
first time it notices them idle, and the total number of threads at
steady state went up, but the amount of time being burned just
tearing down/turning up new ones almost vanished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14938
2023-06-26 13:57:12 -07:00
Alexander Motin 70ea484e3e
Finally drop long disabled vdev cache.
It was a vdev level read cache, designed to aggregate many small
reads by speculatively issuing bigger reads instead and caching
the result.  But since it has almost no idea about what is going
on with exception of ZIO_FLAG_DONT_CACHE flag set by higher layers,
it was found to make more harm than good, for which reason it was
disabled for the past 12 years.  These days we have much better
instruments to enlarge the I/Os, such as speculative and prescient
prefetches, I/O scheduler, I/O aggregation etc.

Besides just the dead code removal this removes one extra mutex
lock/unlock per write inside vdev_cache_write(), not otherwise
disabled and trying to do some work.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14953
2023-06-09 12:40:55 -07:00
Rob Norris 8653f1de48 zdb: add -B option to generate backup stream
This is more-or-less like `zfs send`, but specifying the snapshot by its
objset id for situations where it can't be referenced any other way.

Sponsored-By: Klara, Inc.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes #14642
2023-06-05 11:54:42 -07:00
Graham Perrin 6c29422e90
zfs-create(8): ZFS for swap: caution, clarity
Make the section heading more generic (the section relates to ZFS files
as well as ZFS volumes).

Swapping to a ZFS volume is prone to deadlock. Remove the related
instruction, direct readers to OpenZFS FAQ. Related, but not linked
from within the manual page:

<https://openzfs.github.io/openzfs-docs/Project%20and%20Community/FAQ.html#using-a-zvol-for-a-swap-device-on-linux>
(Using a zvol for a swap device on Linux).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Perrin <grahamperrin@freebsd.org>
Issue #7734 
Closes #14756
2023-06-02 11:25:13 -07:00
Colm d3e0138a3d
Adding new read-only compatible zpool features to compatibility.d/grub2
GRUB2 is compatible with all "read-only compatible" features,
so it is safe to add new features of this type to the grub2
compatibility list. We generally want to include all compatible
features, to minimize the differences between grub2-compatible
pools and no-compatibility pools.

Adding new properties `livelist` and `zpool_checkpoint` accordingly.

Also adding them to the man page which references this file as an
example, for consistency.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes #14893
2023-05-26 10:04:19 -07:00
George Amanakis 482eeef804 Teach zpool scrub to scrub only blocks in error log
Added a flag '-e' in zpool scrub to scrub only blocks in error log. A
user can pause, resume and cancel the error scrub by passing additional
command line arguments -p -s just like a regular scrub. This involves
adding a new flag, creating new libzfs interfaces, a new ioctl, and the
actual iteration and read-issuing logic. Error scrubbing is executed in
multiple txg to make sure pool performance is not affected.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain tulsi.jain@delphix.com
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #8995
Closes #12355
2023-05-18 11:59:42 -07:00
Brian Behlendorf e34e15ed6d
Add the ability to uninitialize
zpool initialize functions well for touching every free byte...once.
But if we want to do it again, we're currently out of luck.

So let's add zpool initialize -u to clear it.

Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #12451 
Closes #14873
2023-05-18 10:02:20 -07:00
George Amanakis 6839ec6f10
Enable the head_errlog feature to remove errors
In case check_filesystem() does not error out and does not report
an error, remove that error block from error lists and logs
without requiring a scrub. This can happen when the original file and
all snapshots/clones referencing it have been removed.

Otherwise zpool status will still report that "Permanent errors have
been detected..." without actually reporting any of them.

To implement this change the functions introduced in corrective
receive were modified to take into account the head_errlog feature.

Before this change:
=============================
pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME                   STATE     READ WRITE CKSUM
        test                   ONLINE       0     0     0
          /home/user/vdev_a    ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

=============================

After this change:
=============================
  pool: test
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
config:

        NAME                   STATE     READ WRITE CKSUM
        test                   ONLINE       0     0     0
          /home/user/vdev_a    ONLINE       0     0     2

errors: No known data errors
=============================

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14813
2023-05-09 08:53:27 -07:00
George Amanakis 4eca03faaf
Fixes in head_errlog feature with encryption
For the head_errlog feature use dsl_dataset_hold_obj_flags() instead of
dsl_dataset_hold_obj() in order to enable access to the encryption keys
(if loaded). This enables reporting of errors in encrypted filesystems
which are not mounted but have their keys loaded.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14837
2023-05-08 13:35:03 -07:00
buzzingwires a46001adb9
Allow zhack label repair to restore detached devices.
This commit expands on the zhack label repair command in d04b5c9 by
adding the -u option to undetach a device by regenerating uberblocks,
in addition to the existing functionality of fixing checksums, now
represented by -c. Previous behavior is retained in the case of no
options.

The changes are heavily inspired by Jeff Bonwick's labelfix
utility, as archived at:

https://gist.github.com/jjwhitney/baaa63144da89726e482

Additionally, it is now capable of properly determining the size of
block devices and other media, as well as handling sizes which are
not divisible by 2^18. This should make it viable for use on physical
devices and partitions, in addition to files.

These changes should make it possible to import zpools that have had
their uberblocks erased, such as in the case of pools rendered
inaccessible by erroneous detach commands.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: buzzingwires <buzzingwires@outlook.com>
Closes #14773
2023-05-03 09:03:57 -07:00
Allan Jude 8eae2d214c
Add support for zpool user properties
Usage:

    zpool set org.freebsd:comment="this is my pool" poolname

Tests are based on zfs_set's user property tests.

Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680
2023-04-21 10:20:36 -07:00
rob-wing 3e4ed4213d
Create zap for root vdev
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.

Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2

Notably, this allows to get/set properties on the root vdev:

    % zpool set user:prop=value <pool> root-0

Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):

    % zpool set user:prop=value <pool> mirror-0

This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.

The following tests have been added:

    - zpool get all properties from root vdev
    - zpool set a property on root vdev
    - verify root vdev ZAP is created

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405
2023-04-20 10:07:56 -07:00
наб 3d37e7e5f5
zfsprops.7: update mandlock
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=7e59106e9c34458540f7d382d5b49071d1b7104f

Fixes: commit fb9baa9b20 ("zfsprops.8:
 remove nbmand-not-used-on-Linux and pointer to mount(8)")

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #14765
2023-04-19 09:03:42 -07:00
dodexahedron ac18dc77f3
Minor improvements to zpoolconcepts.7
* Fixed one typo (effects -> affects)
 * Re-worded raidz description to make it clearer that it is not
   quite the same as RAID5, though similar
 * Clarified that data is not necessarily written in a static
   stripe width
 * Minor grammar consistency improvement
 * Noted that "volumes" means zvols
 * Fixed a couple of split infinitives
 * Clarified that hot spares come from the same pool they were
   assigned to
* "we" -> ZFS
* Fixed warnings thrown by mandoc, and removed unnecessary
  wordiness in one fixed line.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brandon Thetford <brandon@dodecatec.com>
Closes #14726
2023-04-13 09:15:34 -07:00
Rob N ff73574cd8
vdev: expose zfs_vdev_max_ms_shift as a module parameter
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14719
2023-04-06 10:52:50 -07:00
Rob N ece7ab7e7d
vdev: expose zfs_vdev_def_queue_depth as a module parameter
It was previously available only to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14718
2023-04-06 10:31:19 -07:00
наб 3399a30ee0
contrib: dracut: fix race with root=zfs:dset when necessities required
This had always worked in my testing, but a user on hardware reported
this to happen 100%, and I reproduced it once with cold VM host caches.

dracut-zfs-generator runs as a systemd generator, i.e. at Some
Relatively Early Time; if root= is a fixed dataset, it tries to
"solve [necessities] statically at generation time".

If by that point zfs-import.target hasn't popped (because the import is
taking a non-negligible amount of time for whatever reason), it'll see
no children for the root datase, and as such generate no mounts.

This has never had any right to work. No-one caught this earlier because
it's just that much more convenient to have root=zfs:AUTO, which orders
itself properly.

To fix this, always run zfs-nonroot-necessities.service;
this additionally simplifies the implementation by:
  * making BOOTFS from zfs-env-bootfs.service be the real, canonical,
    root dataset name, not just "whatever the first bootfs is",
    and only set it if we're ZFS-booting
  * zfs-{rollback,snapshot}-bootfs.service can use this instead of
    re-implementing it
  * having zfs-env-bootfs.service also set BOOTFSFLAGS
  * this means the sysroot.mount drop-in can be fixed text
  * zfs-nonroot-necessities.service can also be constant and always
    enabled, because it's conditioned on BOOTFS being set

There is no longer any code generated at run-time
(the sysroot.mount drop-in is an unavoidable gratuitous cp).

The flow of BOOTFS{,FLAGS} from zfs-env-bootfs.service to sysroot.mount
is not noted explicitly in dracut.zfs(7), because (a) at some point it's
just visual noise and (b) it's already ordered via d-p-m.s from z-i.t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes #14690
2023-03-31 09:47:48 -07:00
George Amanakis 431083f75b
Fixes in persistent error log
Address the following bugs in persistent error log:

1) Check nested clones, eg "fs->snap->clone->snap2->clone2".

2) When deleting files containing error blocks in those clones (from
   "clone" the example above), do not break the check chain.

3) When deleting files in the originating fs before syncing the errlog
   to disk, do not break the check chain. This happens because at the
   time of introducing the error block in the error list, we do not have
   its birth txg and the head filesystem. If the original file is
   deleted before the error list is synced to the error log (which is
   when we actually lookup the birth txg and the head filesystem), then
   we do not have access to this info anymore and break the check chain.

The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accommodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).

We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.

The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64

We also introduced two new tests covering the above changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14633
2023-03-28 16:51:58 -07:00
Tino Reichardt 2bd0490faf Add colored output to zfs list
Use a bold header row and colorize the AVAIL column based on
the used space percentage of volume.

We define these colors:
- when > 80%, use yellow
- when > 90%, use red

Reviewed-by: WHR <msl0000023508@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ethan Coe-Renner <coerenner1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14621
Closes #14350
2023-03-24 10:24:11 -07:00
Tino Reichardt 7bde396aa2 Colorize zpool iostat output
Use a bold header and colorize the space suffixes in iostat
by order of magnitude like this:
- K is green
- M is yellow
- G is red
- T is lightblue
- P is magenta
- E is cyan
- 0 space is colored gray

Reviewed-by: WHR <msl0000023508@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ethan Coe-Renner <coerenner1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14621
Closes #14459
2023-03-24 10:23:52 -07:00
Rob N 5b5f518687
man: add ZIO_STAGE_BRT_FREE to zpool-events
And bump all the values after it, matching the header update in
67a1b037.

Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14665
2023-03-24 10:14:39 -07:00
Alek P f55d6ee818
Improve tests and update man page for healing recv
Fix the manpage. The "SYNOPSIS" section is incorrectly formatted for 
receive -c.  I also took this opportunity to reword some parts and 
fix a run-on sentence in the manpage.

Add large block testing for corrective recv. This adds a new test 
that makes sure blocks generated using zfs send -L/--large-block 
large-block send flag are able to be used for healing.

Since with unloaded key and errlog feature enabled corruption is not 
shown in zpool status #13675 is fixed the zfs_receive_corrective.ksh 
test no longer sets -o feature@head_errlog=disabled on pool creation 
so that it can also test for regressions related to head_errlog feature.

Note that the zfs_receive_compressed_corrective.ksh and
zfs_receive_large_block_corrective.ksh tests are still creating pools 
with -o feature@head_errlog=disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@axcient.com>
Closes #14615
2023-03-15 10:34:10 -07:00
Pawel Jakub Dawidek 67a1b03791
Implementation of block cloning for ZFS
Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).

The whole design of block cloning is documented in module/zfs/brt.c.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #13392
2023-03-10 11:59:53 -08:00
Alexander Motin a8d83e2a24
More adaptive ARC eviction
Traditionally ARC adaptation was limited to MRU/MFU distribution.  But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO.  As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.

This change removes most of existing eviction logic, rewriting it from
scratch.  This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.

The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata.  To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions.  Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums.  This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.

This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.

Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do.  Instead just call
arc_prune_async() if too much metadata appear not evictable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14359
2023-03-08 11:17:23 -08:00
Rob N 163f3d3a1f
zdb: add decryption support
The approach is straightforward: for dataset ops, if a key was offered,
find the encryption root and the various encryption parameters, derive a
wrapping key if necessary, and then unlock the encryption root. After
that all the regular dataset ops will return unencrypted data, and
that's kinda the whole thing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #11551
Closes #12707
Closes #14503
2023-03-02 13:39:09 -08:00
Rob N 4fe9cc5437
man: note that zdb operates directly on pool storage
A frequent misunderstanding is that zdb accesses the pool through the
kernel or filesystem, leading to confusion particularly when it can't
access something that it seems like it should be able to.

I've seen this confusion recently when zdb couldn't access a pool because
the user didn't have permission to read directly from the block devices,
and when it couldn't show attributes of encrypted files even though the
dataset was unlocked.

The manpage already speaks to another symptom of this, namely that zdb
may "behave erratically" on an active pool; here I'm trying to make that
a little more explicit.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes #14539
2023-02-28 17:35:33 -08:00
D. Ebdrup 4b9bc6345e
Add vdevprops.7 to the Makefile
Adding vdevprops.7 to the Makefile ensures that it gets installed
properly on FreeBSD.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Daniel Ebdrup Jensen <debdrup@FreeBSD.org>
Closes #14527
2023-02-27 16:38:09 -08:00
Brian Behlendorf 973934b965 Increase default zfs_rebuild_vdev_limit to 64MB
When testing distributed rebuild performance with more capable
hardware it was observed than increasing the zfs_rebuild_vdev_limit
to 64M reduced the rebuild time by 17%.  Beyond 64MB there was
some improvement (~2%) but it was not significant when weighed
against the increased memory usage. Memory usage is capped at 1/4
of arc_c_max.

Additionally, vr_bytes_inflight_max has been moved so it's updated
per-metaslab to allow the size to be adjust while a rebuild is
running.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14428
2023-01-27 10:02:24 -08:00
Brian Behlendorf c0aea7cf4e Increase default zfs_scan_vdev_limit to 16MB
For HDD based pools the default zfs_scan_vdev_limit of 4M
per-vdev can significantly limit the maximum scrub performance.
Increasing the default to 16M can double the scrub speed from
80 MB/s per disk to 160 MB/s per disk.

This does increase the memory footprint during scrub/resilver
but given the performance win this is a reasonable trade off.
Memory usage is capped at 1/4 of arc_c_max.  Note that number
of outstanding I/Os has not changed and is still limited by
zfs_vdev_scrub_max_active.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14428
2023-01-27 10:01:13 -08:00
Brian Behlendorf c85ac731a0
Improve resilver ETAs
When resilvering the estimated time remaining is calculated using
the average issue rate over the current pass.  Where the current
pass starts when a scan was started, or restarted, if the pool
was exported/imported.

For dRAID pools in particular this can result in wildly optimistic
estimates since the issue rate will be very high while scanning
when non-degraded regions of the pool are scanned.  Once repair
I/O starts being issued performance drops to a realistic number
but the estimated performance is still significantly skewed.

To address this we redefine a pass such that it starts after a
scanning phase completes so the issue rate is more reflective of
recent performance.  Additionally, the zfs_scan_report_txgs
module option can be set to reset the pass statistics more often.

Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14410
2023-01-25 11:28:54 -08:00
Alexander Motin 0f740a4f1d
Introduce minimal ZIL block commit delay
Despite all optimizations, tests on actual hardware show that FreeBSD
kernel can't sleep for less then ~2us.  Similar tests on Linux show
~50us delay at least from nanosleep() (haven't tested inside kernel).
It means that on very fast log device ZIL may not be able to satisfy
zfs_commit_timeout_pct block commit timeout, increasing log latency
more than desired.

Handle that by introduction of zil_min_commit_timeout parameter,
specifying minimal timeout value where additional delays to aggregate
writes may be skipped.  Also skip delays if the LWB is more than 7/8
full, that often happens if I/O sizes are constant and match one of
LWB sizes.  Both things are applied only if there were no already
outstanding log blocks, that may indicate single-threaded workload,
that by definition can not benefit from the commit delays.

While there, add short time moving average to zl_last_lwb_latency to
make it more stable.

Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS
increase by 9% instead of expected 5%.  For zfs_commit_timeout_pct of
1 there IOPS increase by 5.5% instead of expected 1%.

Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14418
2023-01-24 09:20:32 -08:00
rob-wing 69f024a56e
Configure zed's diagnosis engine with vdev properties
Introduce four new vdev properties:
    checksum_n
    checksum_t
    io_n
    io_t

These properties can be used for configuring the thresholds of zed's
diagnosis engine and are interpeted as <N> events in T <seconds>.

When this property is set to a non-default value on a top-level vdev,
those thresholds will also apply to its leaf vdevs. This behavior can be
overridden by explicitly setting the property on the leaf vdev.

Note that, these properties do not persist across vdev replacement. For
this reason, it is advisable to set the property on the top-level vdev
instead of the leaf vdev.

The default values for zed's diagnosis engine (10 events, 600 seconds)
remains unchanged.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology LLC
Closes #13805
2023-01-23 13:14:25 -08:00
George Melikov a379083d9f
Man: fix defaults for zfs_dirty_data_max_max
It was changed in e99932f7de,
but without docs update.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #14400
2023-01-17 13:12:29 -08:00
Ameer Hamza 19d3961589
Use setproctitle to report progress of zfs send
This allows parsing of zfs send progress by checking the process
title.
Doing so requires some changes to the send code in libzfs_sendrecv.c;
primarily these changes move some of the accounting around, to allow
for the code to be verbose as normal, or set the process title. Unlike
BSD, setproctitle() isn't standard in Linux; thus, borrowed it from
libbsd with slight modifications.

Authored-by: Sean Eric Fagan <sef@FreeBSD.org>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14376
2023-01-17 10:17:35 -08:00
Mateusz Piotrowski 926715b9fc
Turn default_bs and default_ibs into ZFS_MODULE_PARAMs
The default_bs and default_ibs tunables control the default block size
and indirect block size.

So far, default_bs and default_ibs were tunable only on FreeBSD, e.g.,

    sysctl vfs.zfs.default_ibs

Remove the FreeBSD-specific sysctl code and expose default_bs and
default_ibs as tunables on both Linux and FreeBSD using
ZFS_MODULE_PARAM.

One of the use cases for changing the values of those tunables is to
lower the indirect block size, which may improve performance of large
directories (as discussed during the OpenZFS Leadership Meeting
on 2022-08-16).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes #14293
2023-01-11 09:38:20 -08:00
Mateusz Piotrowski a4b21eadec
Add tunable to allow changing micro ZAP's max size
This change turns `MZAP_MAX_BLKSZ` into a `ZFS_MODULE_PARAM()` called
`zap_micro_max_size`. As a result, we can experiment with different
micro ZAP sizes to improve directory size scaling.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com>
Co-authored-by: Toomas Soome <toomas.soome@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateuszpiotrowski@klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Closes #14292
2023-01-10 13:41:54 -08:00