OpenZFS

OpenZFS

Infinite Disk's design is file system agnostic, it can be build by combining multiple file system technologies like LVM, XFS, Bcache, mdadm etc. However, OpenZFS happens to have most of those technologies in one piece of software, so it is used to in the first implement of Infinite Disk.

When we use the word "ZFS" we mean the original ZFS developed and open sourced by Sun and then continued as OpenZFS, not the closed source ZFS continued by Oracle.

ZFS has been designed to be the last word in file systems and it is amazing!

Good Introductions:

  1. ZFS 101—Understanding ZFS storage and performance | Ars Technica
  2. OpenZFS Documentation — OpenZFS documentation

A good comparison of some distributed file systems is here:

Note the IBM zFS mentioned in the PDF above is NOT the same as the ZFS that OpenZFS is based on.

Making ZFS Distributed

Infinite Disk took ZFS and made it not just into a distributed file system working across thousands of machines, it also made ZFS work over the wide area network.

Lustre and ZFS

Lustre is the dominant file system in the super computing and large scale data storage.

It is increasingly using OpenZFS as it backend storage:

High Performance Computing

Besides using ZFS in the backend, Infinite Disk also take advantage of ZFS features in the frontend and in intermediate distribution layer, giving Infinite Disk substantial advantages over traditional distributed file system like Lustre and Spectrum Scale in a lot of high performance computing use cases.

ZFS Send

A popular way of using Infinite Disk is as replication target based on zfs send.

Due to Infinite Disk's use of OpenZFS Native Encryption, all zfs send command MUST have the -w option to enable raw send so encryption is preserved.

OpenZFS for Geeks

OpenZFS can be tuned easily for almost every application.

1. On Disk Format

  1. Zfs_ondiskformat.pdf (495.5 KB)
  2. Chris's Wiki :: blog/solaris/ZFSBlockPointers
  3. Chris's Wiki :: blog/solaris/ZFSLogicalVsPhysicalBlockSizes

2. Reliability

  1. https://www.usenix.org/legacy/events/fast10/tech/full_papers/zhang.pdf

3. RAID

  1. draid Declustered RAID for ZFS Installation and Configuration Guide High Performance Data Division.pdf (1.8 MB)

4. Recovery

If you cannot import a zpool, the following command might help:

zpool import -FfmX
zpool clear

If the above does not work, you will can try the really hardcore -T option and roll-back to a good tgx - this is not for the faint-hearted:

Find a good tgx with

zpool history -il
or
zdb

Rollback to the good tgx

zpool import -f -F -T <tgx>

Mount it and dd some zeros to clear out the more recent bad tgx.

There is a script called zfs_revert-0.1.py which follows a similar way of recovery. The comment at the end of that page is especially good at showing how to find a "good" tgx.

ZFS Metadata

1. Metadata Size OUTSIDE of Pool

zpool list yourpool -v

2. Metadata Size INSIDE of Pool

zdb -PLbbbs yourpool | tee ~/yourpool_metadata_output.txt 

Sum the relevant sections in the ASIZE column

(
cat ~/yourpool_metadata_output.txt \
| grep -B 9999 'L1 Total' \
| grep -A 9999 'ASIZE' \
| grep -v \
 -e 'L1 object array' -e 'L0 object array' \
 -e 'L1 bpobj' -e 'L0 bpobj' \
 -e 'L2 SPA space map' -e 'L1 SPA space map' -e 'L0 SPA space map' \
 -e 'L5 DMU dnode' -e 'L4 DMU dnode' -e 'L3 DMU dnode' -e 'L2 DMU dnode' -e 'L1 DMU dnode' -e 'L0 DMU dnode' \
 -e 'L0 ZFS plain file' -e 'ZFS plain file' \
 -e 'L2 ZFS directory' -e 'L1 ZFS directory' -e 'L0 ZFS directory' \
 -e 'L3 zvol object' -e 'L2 zvol object' -e 'L1 zvol object' -e 'L0 zvol object' \
 -e 'L1 SPA history' -e 'L0 SPA history' \
 -e 'L1 deferred free' -e 'L0 deferred free' \
| awk \
 '{sum+=$4} \
 END {printf "\nTotal Metadata\n %.0f Bytes\n" " %.2f GiB\n",sum,sum/1073741824}' \
)

Sample output:

Total Metadata
 57844416512 Bytes
 53.87 GiB

Reference:

3. Metadata Device Tips

ZFS Management

Infinite Disk comes with its own Zabbix based management console.

Other ZFS related management tools:

  1. GitHub - AnalogJ/scrutiny: Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
  2. GitHub - chadmiller/zpool-iostat-viz: "zpool iostats" for humans; find the slow parts of your ZFS pool

L2ARC

Original comments inside Sun ZFS source code usr/src/uts/common/fs/zfs/arc.c on how L2ARC works.

/*
 * Level 2 ARC
 *
 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
 * It uses dedicated storage devices to hold cached data, which are populated
 * using large infrequent writes.  The main role of this cache is to boost
 * the performance of random read workloads.  The intended L2ARC devices
 * include short-stroked disks, solid state disks, and other media with
 * substantially faster read latency than disk.
 *
 *                 +-----------------------+
 *                 |         ARC           |
 *                 +-----------------------+
 *                    |         ^     ^
 *                    |         |     |
 *      l2arc_feed_thread()    arc_read()
 *                    |         |     |
 *                    |  l2arc read   |
 *                    V         |     |
 *               +---------------+    |
 *               |     L2ARC     |    | 
 *               +---------------+    |
 *                   |    ^           |
 *          l2arc_write() |           |
 *                   |    |           |
 *                   V    |           |
 *                 +-------+      +-------+
 *                 | vdev  |      | vdev  |
 *                 | cache |      | cache |
 *                 +-------+      +-------+
 *                 +=========+     .-----.
 *                 :  L2ARC  :    |-__-|
 *                 : devices :    | Disks |
 *                 +=========+    `-__-'
 *
 * Read requests are satisfied from the following sources, in order:
 *
 *      1) ARC
 *      2) vdev cache of L2ARC devices
 *      3) L2ARC devices
 *      4) vdev cache of disks
 *      5) disks
 *
 * Some L2ARC device types exhibit extremely slow write performance.
 * To accommodate for this there are some significant differences between
 * the L2ARC and traditional cache design:
 *
 * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
 * the ARC behave as usual, freeing buffers and placing headers on ghost
 * lists.  The ARC does not send buffers to the L2ARC during eviction as
 * this would add inflated write latencies for all ARC memory pressure.
 *
 * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
 * It does this by periodically scanning buffers from the eviction-end of
 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
 * not already there.  It scans until a headroom of buffers is satisfied,
 * which itself is a buffer for ARC eviction.  The thread that does this is
 * l2arc_feed_thread(), illustrated below; example sizes are included to
 * provide a better sense of ratio than this diagram:
 *
 *             head -->                        tail
 *              +---------------------+----------+
 *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
 *              +---------------------+----------+   |   o L2ARC eligible
 *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
 *              +---------------------+----------+   |
 *                   15.9 Gbytes      ^ 32 Mbytes    |
 *                                 headroom          |
 *                                            l2arc_feed_thread()
 *                                                   |
 *                       l2arc write hand <--[oooo]--'
 *                               |           8 Mbyte
 *                               |          write max
 *                               V
 *                +==============================+
 *      L2ARC dev |####|#|###|###|    |####| ... |
 *                +==============================+
 *                           32 Gbytes
 *
 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
 * evicted, then the L2ARC has cached a buffer much sooner than it probably
 * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
 * safe to say that this is an uncommon case, since buffers at the end of
 * the ARC lists have moved there due to inactivity.
 *
 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
 * then the L2ARC simply misses copying some buffers.  This serves as a
 * pressure valve to prevent heavy read workloads from both stalling the ARC
 * with waits and clogging the L2ARC with writes.  This also helps prevent
 * the potential for the L2ARC to churn if it attempts to cache content too
 * quickly, such as during backups of the entire pool.
 *
 * 5. After system boot and before the ARC has filled main memory, there are
 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
 * lists can remain mostly static.  Instead of searching from tail of these
 * lists as pictured, the l2arc_feed_thread() will search from the list heads
 * for eligible buffers, greatly increasing its chance of finding them.
 *
 * The L2ARC device write speed is also boosted during this time so that
 * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
 * there are no L2ARC reads, and no fear of degrading read performance
 * through increased writes.
 *
 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
 * the vdev queue can aggregate them into larger and fewer writes.  Each
 * device is written to in a rotor fashion, sweeping writes through
 * available space then repeating.
 *
 * 7. The L2ARC does not store dirty content.  It never needs to flush
 * write buffers back to disk based storage.
 *
 * 8. If an ARC buffer is written (and dirtied) which also exists in the
 * L2ARC, the now stale L2ARC buffer is immediately dropped.
 *
 * The performance of the L2ARC can be tweaked by a number of tunables, which
 * may be necessary for different workloads:
 *
 *      l2arc_write_max         max write bytes per interval
 *      l2arc_write_boost       extra write bytes during device warmup
 *      l2arc_noprefetch        skip caching prefetched buffers
 *      l2arc_headroom          number of max device writes to precache
 *      l2arc_feed_secs         seconds between L2ARC writing
 *
 * Tunables may be removed or added as future performance improvements are
 * integrated, and also may become zpool properties.
 */