ok and mastodon people. Answer me this, I have a ZFS filesystem on machine A, I zfs-send it to machine B (zfs send -R) (zfs receive -Fv). machine A has 4T of space *total*. machine B has 8T of space. when done, machine B has only 250g of free space available), the filesystem is almost _twice_ as large?

@david my first guess is that you have a lot of small files and something is causing zfs to insert a lot of padding.

Is ashift the same on both pools (zpool get ashift, I think)? My guess is the source may be 9 (512 byte minimum block size) and the destination is 12 (4k min block).

Is the source not raidz and destination is raidz?

How are you looking at total space? zpool and zfs commands look at different things?

@mgerdts Not small files, average filesize is close to 1gig (this is a postgres database data filesystem), there are 2 recordsizes on it, a 'precopy' snapshot with 128K records, and then I set it to 8k to get better perfomance and copied everything over, I also set lz4 to zstd on the new copies. no raid on either, straight concat/stripe (underlying hardware does all of the redundancy)

@david 8k recordsize + compression could lead to a poor interaction with ashift=12 as well. Suppose an 8k block would compress to 4200 bytes. With ashift=12, the compressed 8k block will consume 2 x 4k sectors (8k total). With ashift=9, the compressed 8k block will consume 9 x 512b sectors (4.5k total).

With raidz the overhead varies by number of drives in a raidz vdev. See my explanation here:

github.com/openzfs/zfs/blob/ma

@david while we have concluded raidz is not to blame here, I figured it may be worth mentioning that I did a talk on this work while at #Joyent.

Slides: us-east.manta.joyent.com/Joyen
Video: youtu.be/sTvVIF5v2dw

Contrary to what I predicted back then, today’s NVMe SSDs pretty much all present as 512n, not as 4Kn.

#zfs

@mgerdts @david

(probably not because it doesn't show up in usage like zfs list -o space, only at the zpool level, but:)

Is there any chance the source has (or had) dedup and that's not been carried across?

Follow

@uep @mgerdts definitely not on dedupe. the culprit was ashift

· · Web · 1 · 0 · 1

@david @mgerdts

ashift and less-optimal compression packing, yeah. Surprised it was that much but still all too plausible

@uep @mgerdts I think the key here is that it is a postgres database store, so the recordsize is 8k to align with postgres pagesize, and with ashift of 4k that means BEST case possible compression is 2x, and anything less than 2x is 1x; that means realized compression has to be in the 2.0 to 1.0 range, whereas on the original I was in the 3.x to 4.x range. Math checks out.

@david @mgerdts Yeah, the lower bound on useful compression is a common issue, the upper bound in this case is less obvious but nasty

Sign in to participate in the conversation
Cross Family's Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!