Commit Graph

74 Commits

Author SHA1 Message Date
Ned Bass 2957f38d78 vdev_id support for device link aliases
Add a vdev_id feature to map device names based on already defined
udev device links.  To increase the odds that vdev_id will run after
the rules it depends on, increase the vdev.rules rule number from 60
to 69.  With this change, vdev_id now provides functionality analogous
to zpool_id and zpool_layout, paving the way to retire those tools.

A defined alias takes precedence over a topology-derived name, but the
two naming methods can otherwise coexist. For example, one might name
drives in a JBOD with the sas_direct topology while naming an internal
L2ARC device with an alias.

For example, the following lines in vdev_id.conf will result in the
creation of links /dev/disk/by-vdev/{d1,d2}, each pointing to the same
target as the device link specified in the third field.

  #     by-vdev
  #     name     fully qualified or base name of device link
  alias d1       /dev/disk/by-id/wwn-0x5000c5002de3b9ca
  alias d2       wwn-0x5000c5002def789e

Also perform some minor vdev_id cleanup, such as removal of the unused
-s command line option.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #981
2012-12-03 14:04:47 -08:00
Brian Behlendorf ca8b5af89d Remove autotools products
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #718
2012-08-27 11:47:44 -07:00
Etienne Dechamps ee5fd0bb80 Set zvol discard_granularity to the volblocksize.
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.

In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).

With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #862
2012-08-07 14:55:31 -07:00
Richard Yao 739a1a82e0 Linux 3.5 compat, end_writeback() changed to clear_inode()
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict().   This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.

However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics.  This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #784
2012-07-23 12:29:36 -07:00
Richard Yao ea1fdf46e2 Linux 3.5 compat, iops->truncate_range() removed
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:32 -07:00
Richard Yao 756c3e5a9c Linux 3.5 compat, eops->encode_fh() takes inodes
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes.  This interface used to
take the child dentry and a bool describing if the parent is needed.

NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
2012-07-23 12:29:23 -07:00
Etienne Dechamps b5a28807cd Move partition scanning from userspace to module.
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.

This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.

Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.

This change also depends on API changes which are available in 2.6.37
and latter kernels.  The build system has been updated to detect this,
but there is no compatibility mode for older kernels.  This means that
online expansion will NOT be available in older kernels.  However, it
will still be possible to expand the vdev offline.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #808
2012-07-17 09:17:31 -07:00
Richard Yao ba9b5428fd Relicense zfs.gentoo.in from GPLv2 to 2-clause BSD
As the Gentoo sys-fs/zfs maintainer, I receive license compatibility
questions and at times, those questions can be harassing. I feel that
the presence of the GPL in Gentoo's package metadata promotes such
questions.  zfs.gentoo.in is the only GPLv2 licensed file in ZFS, so I
have taken the liberty of contacting all contributors to this file to
request permission to relicense it.

All of the contributors to this file have agreed to relicense it under
the 2-clause BSD license. I have added their Signed-offs to this commit,
in order of first contribution. Thank you everyone for being so
understanding.

Signed-off-by: devsk <devsku@gmail.com>
Signed-off-by: Alexey Shvetsov <alexxy@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrew Tselischev <andrewtselischev@gmail.com>
Signed-off-by: Zachary Bedell <zac@thebedells.org>
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Kyle Fuller <inbox@kylefuller.co.uk>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes #819
2012-07-10 15:00:16 -07:00
Richard Yao 6a0936babc Linux 3.4 compat, d_make_root() replaces d_alloc_root()
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:

  error: implicit declaration of function 'd_alloc_root'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current d_make_root()
interface for readability.  Then we introduce an autotools check
to determine if d_make_root() is available.  If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #776
2012-06-11 10:04:49 -07:00
Ned A. Bass 821b683436 Add vdev_id for JBOD-friendly udev aliases
vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name.  The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive.  This is particularly helpful when it
comes to tasks like replacing failed drives.  Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory.  The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.

The only currently supported topologies are sas_direct and sas_switch:

o  sas_direct - a channel is uniquely identified by a PCI slot and a
   HBA port

o  sas_switch - a channel is uniquely identified by a SAS switch port

A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'.  In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.

vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch.  The script
could be extended to support other topologies as well.  The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file.  However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.

vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.

Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #713
2012-06-01 08:55:14 -07:00
Brian Behlendorf b39d3b9f7b Linux 3.3 compat, iops->create()/mkdir()/mknod()
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'.  To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef.  There is no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #701
2012-04-30 12:52:38 -07:00
Richard Yao 2ce9d0ec61 Make Gentoo initscript use modinfo
The -l parameter to modprobe has been removed from the latest upstream
code and this change has entered Gentoo. Using modinfo as a substitute
addresses this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #636
2012-04-03 10:37:18 -07:00
Brian Behlendorf 1c5de20ae2 Add --enable-debug-dmu-tx configure option
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging.  When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.

This checking is particularly helpful when adding new dmu consumers
like Lustre.  However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.

--enable-debug-dmu-tx  - Enable/disable validation of each tx as
--disable-debug-dmu-tx   it is constructed.  By default validation
                         is disabled due to performance concerns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:25:17 -07:00
Brian Behlendorf ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Brian Behlendorf 4b787d75c8 Cleanly support debug packages
Allow a source rpm to be rebuilt with debugging enabled.  This
avoids the need to have to manually modify the spec file.  By
default debugging is still largely disabled.  To enable specific
debugging features use the following options with rpmbuild.

  '--with debug'               - Enables ASSERTs

  # For example:
  $ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm

Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers.  This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 14:08:17 -08:00
Etienne Dechamps 30930fba21 Add support for DISCARD to ZVOLs.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.

We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.

Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:19:38 -08:00
Etienne Dechamps cb2d19010d Support the fallocate() file operation.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.

To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 16:19:32 -08:00
Etienne Dechamps 34037afe24 Improve ZVOL queue behavior.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Etienne Dechamps b18019d2d8 Fix synchronicity for ZVOLs.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-07 16:23:06 -08:00
Brian Behlendorf 47621f3d76 Linux 3.3 compat, sops->show_options()
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'.  Add an autoconf check
to detect the API change and then conditionally define the expected
interface.  In either case we are only interested in the zfs_sb_t.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #549
2012-02-03 10:02:01 -08:00
Brian Behlendorf ab26409db7 Linux 3.1 compat, super_block->s_shrink
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292
2012-01-11 11:46:02 -08:00
Darik Horn 28eb9213d8 Linux 3.2 compat: set_nlink()
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit

  SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170

Use the new set_nlink() kernel function instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
2011-12-16 20:02:52 -08:00
Prakash Surya 6ba3b44614 Add make rule for building Arch Linux packages
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:

    $ ./configure
    $ make pkg     # Alternatively, one can run 'make arch' as well

on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:

    # pacman -U $package.pkg.tar.xz

In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.

NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491
2011-12-14 19:14:23 -08:00
Darik Horn 660cbada0f Quote variables in the zfs.lsb script.
For consistency and safety, quote all variables in the zfs.lsb script.
This protects in the unlikely case that any of the file names contain
whitespace.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439
2011-12-05 09:51:55 -08:00
Darik Horn c2d9c41d50 Source /etc/default/zfs after setting defaults.
Let the administrator override all script variables by sourcing the
/etc/default/zfs file after the default values are set.

The spelling mistake in the old path name makes it unlikely that this
bug affected any users.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #371
2011-12-05 09:51:20 -08:00
Brian Behlendorf 5547c2f1bf Simplify BDI integration
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code.  The updated code now just
registers the bdi during mount and destroys it during unmount.

The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it.  Luckily the bdi_setup_and_register() function
is trivial.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2011-11-08 10:19:03 -08:00
Ned Bass f021fe194f Use automatic variable in Makefile
As written, the $(init_SCRIPTS) rule in etc/init.d/Makefule.am
would not work as expected if the init_SCRIPTS variable were
to contain any elements other than zfs.  Fix this by replacing
the hard-coded 'zfs' reference with $@.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #410
2011-09-26 09:22:30 -07:00
Brian Behlendorf 1a2e6a635f Fix incorrect zpool_cache substitution
This regression was accidentally introduced by commit aa2b489.
I was attempting to simplify the init scripts and accidentally
confused the /etc/init.d and /etc/zfs paths.  This change reverts
the init script modifications.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #370
2011-08-22 16:01:59 -07:00
Brian Behlendorf aa2b4896c9 Fix autoconf variable substitution in init scripts.
Change the variable substitution in the init script templates
according to the method described in the Autoconf manual;
Chapter 4.7.2: Installation Directory Variables.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-19 16:26:14 -07:00
Brian Behlendorf de0a1c099b Autogen refresh for udev changes
Run autogen.sh using the same autotools versions as upstream:

 * autoconf-2.63
 * automake-1.11.1
 * libtool-2.2.6b
2011-08-08 16:30:27 -07:00
Kyle Fuller 12d06bac9b Move udev rules from /etc/udev to /lib/udev
This change moves the default install location for the zfs udev
rules from /etc/udev/ to /lib/udev/.  The correct convention is
for rules provided by a package to be installed in /lib/udev/.
The /etc/udev/ directory is reserved for custom rules or local
overrides.

Additionally, this patch cleans up some abuse of the bindir install
location by adding a udevdir and udevruledir install directories.
This allows us to revert to the default bin install location.  The
udev install directories can be set with the following new options.

  --with-udevdir=DIR      install udev helpers [EPREFIX/lib/udev]
  --with-udevruledir=DIR  install udev rules [UDEVDIR/rules.d]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #356
2011-08-08 16:21:10 -07:00
Brian Behlendorf 76659dc110 Add backing_device_info per-filesystem
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk.  The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance.  Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.

The replacement for pdflush is called bdi (backing device info).  The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback.  The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.

For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme.  However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.

Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.

However, there is one critical case where not implementing the bdi
functionality can cause problems.  If an application handles a page
fault it can enter the balance_dirty_pages() callpath.  This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.

Without a registered backing_device_info for the filesystem the
dirty pages will not get written out.  Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.

This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block.  It is then registered
when the filesystem mounted and unregistered on unmount.  It will
not be registered for mounted snapshots which are read-only.  This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2011-08-04 13:37:38 -07:00
Brian Behlendorf bfb73f9277 Add .gitignore for zfs.<distro> init scripts
Treat the automatically generated zfs.<distro> init scripts
as build products by adding them to a directory specific
.gitignore file.
2011-08-01 10:27:54 -07:00
Kyle Fuller 5faa9c0367 Turn the init.d scripts into autoconf config files
This change ensures the paths used by the provided init scripts
always reference the prefixes provided at configure time.  The
@sbindir@ and @sysconfdir@ prefixes will be correctly replaced
at build time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #336
2011-08-01 09:54:44 -07:00
Kyle Fuller 615ab66d18 Provide a rc.d script for archlinux
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d.  This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #322
2011-07-11 14:12:23 -07:00
Fajar A. Nugraha 3af2ce4d68 Check for "udevadm settle" vs "udevsettle"
RHEL5 does not have udevadm, so fix initscript accordingly

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #315
2011-07-08 11:43:16 -07:00
Gunnar Beutner 8b0cf399ff Updated init scripts to enable automatic sharing of ZFS datasets.
The relevant init scripts were updated so as to automatically share
ZFS datasets using "zfs share -a" at boot time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Zachary Bedell e93ced4847 Update zfs.gentoo/zfs.lsb init script
* Update paths to zpool/zfs tools,
* Log less for non-error conditions,
* Don't be fatal if umount fails at shutdown -- final init remount
  will take care of it if /usr or / are in use

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:14 -07:00
Gunnar Beutner c8082367cf Removed erroneous backticks in the zfs.lunar init script.
The backticks would cause the output of the zfs commands
to be evaluated as input for the if construct rather than
their exit status.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-05 11:25:48 -07:00
Gunnar Beutner 0f4524cca4 Fixed indentation in the zfs.lunar init script.
One of the blocks in the init script wasn't indented properly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-05 11:25:48 -07:00
Andrew Tselischev b59322a0d8 Fix 'rc_parallel="YES"' error
If rc_parallel="YES" zfs starts before localmount, which leads
to "No such file or directory" error on systems with /usr on a
separate partition.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-04 13:54:59 -07:00
Brian Behlendorf 2cf7f52bc4 Linux compat 2.6.39: mount_nodev()
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure.  When using the new
interface the caller must now use the mount_nodev() helper.

Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers.  This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use.  It provides our only entry point
in to the namespace layer for manipulating certain mount options.

This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly.  It also allowed me
to keep more of the original Solaris code unmodified.  Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do.  However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem.  Thus keeping a back
reference from the filesystem to the namespace is complicated.

Rather than introduce some ugly hack to get the vfsmount and
continue as before.  I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.

This commit updates the code to remove this vfsmount back
reference entirely.  All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.

Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code.  This
change which fairly involved has turned out nicely.

Closes #246
Closes #217
Closes #187
Closes #248
Closes #231
2011-07-01 13:36:39 -07:00
Brian Behlendorf 5c03efc379 Linux compat 2.6.39: security_inode_init_security()
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.

Closes #246
Closes #217
Closes #187
2011-07-01 12:40:08 -07:00
Brian Behlendorf 2a005961a4 Ensure all block devices are available
These days most disk drivers will probe for devices asynchronously.
This means it's possible that when you zfs init script runs all the
required block devices may not yet have been discovered.  The result
is the pool may fail to cleanly import at boot time.  This is
particularly common when you have a large number of devices.

The fix is for the init script to block until udev settles and we
are no longer detecting new devices.  Once the system has settled
the zfs modules can be loaded and the pool with be automatically
imported.
2011-06-30 14:45:33 -07:00
Prasad Joshi b312979252 Tear down and flush the mmap region
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.

The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.

Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes #255
2011-06-27 09:59:19 -07:00
Ned A. Bass 560bcf9d14 Multipath device manageability improvements
Update udev helper scripts to deal with device-mapper devices created
by multipathd.  These enhancements are targeted at a particular
storage network topology under evaluation at LLNL consisting of two
SAS switches providing redundant connectivity between multiple server
nodes and disk enclosures.

The key to making these systems manageable is to create shortnames for
each disk that conveys its physical location in a drawer.  In a
direct-attached topology we infer a disk's enclosure from the PCI bus
number and HBA port number in the by-path name provided by udev.  In a
switched topology, however, multiple drawers are accessed via a single
HBA port.  We therefore resort to assigning drawer identifiers based
on which switch port a drive's enclosure is connected to.  This
information is available from sysfs.

Add options to zpool_layout to generate an /etc/zfs/zdev.conf using
symbolic links in /dev/disk/by-id of the form
<label>-<UUID>-switch-port:<X>-slot:<Y>.  <label> is a string that
depends on the subsystem that created the link and defaults to
"dm-uuid-mpath" (this prefix is used by multipathd).  <UUID> is a
unique identifier for the disk typically obtained from the scsi_id
program, and <X> and <Y> denote the switch port and disk slot numbers,
respectively.

Add a callout script sas_switch_id for use by multipathd to help
create symlinks of the form described above.  Update zpool_id and the
udev zpool rules file to handle both multipath devices and
conventional drives.
2011-06-23 10:46:06 -07:00
Darik Horn b772aedeec Autogen refresh.
Run autogen.sh using the same autotools versions as upstream:

 * autoconf-2.63
 * automake-1.11.1
 * libtool-2.2.6b
2011-06-17 13:24:44 -07:00
Darik Horn b9f27ee765 Fix autoconf variable substitution in udev rules.
Change the variable substitution in the udev rule templates
according to the method described in the Autoconf manual;
Chapter 4.7.2: Installation Directory Variables.

The udev rules are improperly generated if the bindir parameter
overrides the prefix parameter during configure. For example:

  # ./configure --prefix=/usr/local --bindir=/opt/zfs/bin

The udev helper is installed as /opt/zfs/bin/zpool_id, but the
corresponding udev rule has a different path:

  # /usr/local/etc/udev/rules.d/60-zpool.rules
  ENV{DEVTYPE}=="disk", IMPORT{program}="/usr/local/bin/zpool_id -d %p"

The @bindir@ variable expands to "${exec_prefix}/bin", so it cannot
be used instead of @prefix@ directly.

This also applies to the zvol_id helper.

Closes #283.
2011-06-17 10:11:29 -07:00
Brian Behlendorf 2e08aedba4 Always check -Wno-unused-but-set-variable gcc support
The previous commit 8a7e1ceefa wasn't
quite right.  This check applies to both the user and kernel space
build and as such we must make sure it runs regardless of what
the --with-config option is set too.

For example, if --with-config=kernel then the autoconf test does
not run and we generate build warnings when compiling the kernel
packages.
2011-06-14 16:40:35 -07:00
Brian Behlendorf 8a7e1ceefa Check for -Wno-unused-but-set-variable gcc support
Gcc versions 4.3.2 and earlier do not support the compiler flag
-Wno-unused-but-set-variable.  This can lead to build failures
on older Linux platforms such as Debian Lenny.  Since this is
an optional build argument this changes add a new autoconf check
for the option.  If it is supported by the installed version of
gcc then it is used otherwise it is omited.

See commit's 12c1acde76 and
79713039a2 for the reason the
-Wno-unused-but-set-variable options was originally added.
2011-06-14 14:43:22 -07:00