OpenZFS

OpenZFS

Infinite Disk's design is file system agnostic, it can be build by combining multiple file system technologies like LVM, XFS, Bcache, mdadm etc. However, OpenZFS happens to have most of those technologies in one piece of software, so it is used to in the first implement of Infinite Disk.

When we use the word "ZFS" we mean the original ZFS developed and open sourced by Sun and then continued as OpenZFS, not the closed source ZFS continued by Oracle.

ZFS has been designed to be the last word in file systems and it is amazing!

Good Introductions:

  1. ZFS 101—Understanding ZFS storage and performance | Ars Technica
  2. OpenZFS Documentation — OpenZFS documentation

A good comparison of some distributed file systems is here:

Note the IBM zFS mentioned in the PDF above is NOT the same as the ZFS that OpenZFS is based on.

Making ZFS Distributed

Infinite Disk took ZFS and made it not just into a distributed file system working across thousands of machines, it also made ZFS work over the wide area network.

Lustre and ZFS

Lustre is the dominant file system in the super computing and large scale data storage.

It is increasingly using OpenZFS as it backend storage:

Infinite Disk uses OpenZFS at a different level, but its highly distributed and parallel nature are also applicable to high performance computing in some use cases.

ZFS Send

A popular way of using Infinite Disk is as replication target based on zfs send.

Due to Infinite Disk's use of OpenZFS Native Encryption, all zfs send command MUST have the -w option to enable raw send so encryption is preserved.

OpenZFS for Geeks

OpenZFS can be tuned easily for almost every application.

1. On Disk Format

  1. Zfs_ondiskformat.pdf (495.5 KB)
  2. Chris's Wiki :: blog/solaris/ZFSBlockPointers
  3. Chris's Wiki :: blog/solaris/ZFSLogicalVsPhysicalBlockSizes

2. Reliability

  1. https://www.usenix.org/legacy/events/fast10/tech/full_papers/zhang.pdf

3. RAID

  1. draid Declustered RAID for ZFS Installation and Configuration Guide High Performance Data Division.pdf (1.8 MB)

4. Recovery

If you cannot import a zpool, the following command might help:

zpool import -FfmX
zpool clear

If the above does not work, you will can try the really hardcore -T option and roll-back to a good tgx - this is not for the faint-hearted:

Find a good tgx with

zpool history -il
or
zdb

Rollback to the good tgx

zpool import -f -F -T <tgx>

Mount it and dd some zeros to clear out the more recent bad tgx.

There is a script called zfs_revert-0.1.py which follows a similar way of recovery. The comment at the end of that page is especially good at showing how to find a "good" tgx.

ZFS Metadata

1. Metadata Size OUTSIDE of Pool

zpool list yourpool -v

2. Metadata Size INSIDE of Pool

zdb -PLbbbs yourpool | tee ~/yourpool_metadata_output.txt 

Sum the relevant sections in the ASIZE column

(
cat ~/yourpool_metadata_output.txt \
| grep -B 9999 'L1 Total' \
| grep -A 9999 'ASIZE' \
| grep -v \
 -e 'L1 object array' -e 'L0 object array' \
 -e 'L1 bpobj' -e 'L0 bpobj' \
 -e 'L2 SPA space map' -e 'L1 SPA space map' -e 'L0 SPA space map' \
 -e 'L5 DMU dnode' -e 'L4 DMU dnode' -e 'L3 DMU dnode' -e 'L2 DMU dnode' -e 'L1 DMU dnode' -e 'L0 DMU dnode' \
 -e 'L0 ZFS plain file' -e 'ZFS plain file' \
 -e 'L2 ZFS directory' -e 'L1 ZFS directory' -e 'L0 ZFS directory' \
 -e 'L3 zvol object' -e 'L2 zvol object' -e 'L1 zvol object' -e 'L0 zvol object' \
 -e 'L1 SPA history' -e 'L0 SPA history' \
 -e 'L1 deferred free' -e 'L0 deferred free' \
| awk \
 '{sum+=$4} \
 END {printf "\nTotal Metadata\n %.0f Bytes\n" " %.2f GiB\n",sum,sum/1073741824}' \
)

Sample output:

Total Metadata
 57844416512 Bytes
 53.87 GiB

Reference:

3. Metadata Device Tips

ZFS Management

Infinite Disk comes with its own Zabbix based management console.

Other ZFS related management tools:

  1. GitHub - AnalogJ/scrutiny: Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
  2. GitHub - chadmiller/zpool-iostat-viz: "zpool iostats" for humans; find the slow parts of your ZFS pool