**David Cross** @david@mastodon.crossfamilyweb.com · 2023-03-07T01:32:08Z

David Cross @david@mastodon.crossfamilyweb.com

David Cross @david@mastodon.crossfamilyweb.com

ok #freebsd and #zfs mastodon people. Answer me this, I have a ZFS filesystem on machine A, I zfs-send it to machine B (zfs send -R) (zfs receive -Fv). machine A has 4T of space *total*. machine B has 8T of space. when done, machine B has only 250g of free space available), the filesystem is almost _twice_ as large?

Mar 07, 2023, 01:32 · · Web · · ·

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 02:00

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 02:00

Mar 07, 2023, 02:00

Mike Gerdts @mgerdts@fosstodon.org

@david my first guess is that you have a lot of small files and something is causing zfs to insert a lot of padding.

Is ashift the same on both pools (zpool get ashift, I think)? My guess is the source may be 9 (512 byte minimum block size) and the destination is 12 (4k min block).

Is the source not raidz and destination is raidz?

How are you looking at total space? zpool and zfs commands look at different things?

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 02:18

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 02:18

Mar 07, 2023, 02:18

David Cross @david@mastodon.crossfamilyweb.com

@mgerdts Not small files, average filesize is close to 1gig (this is a postgres database data filesystem), there are 2 recordsizes on it, a 'precopy' snapshot with 128K records, and then I set it to 8k to get better perfomance and copied everything over, I also set lz4 to zstd on the new copies. no raid on either, straight concat/stripe (underlying hardware does all of the redundancy)

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 02:19

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 02:19

Mar 07, 2023, 02:19

David Cross @david@mastodon.crossfamilyweb.com

@mgerdts I am looking at it via 'zfs list' and 'df' both show compatible information. The main difference seems to be in the refer (I am redoing the receive right now, so I am going from memory), it appears that the receive has multiple full copies.

And the zfs-receive seems to corroborate that by saying it has multiple 'full' streams ... maybe?. In ~7 more hours the receive will be finished

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 02:24

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 02:24

Mar 07, 2023, 02:24

Mike Gerdts @mgerdts@fosstodon.org

@david maybe the copies or compressratio properties on each dataset will offer clues.

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 18:42

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 18:42

Mar 07, 2023, 18:42

David Cross @david@mastodon.crossfamilyweb.com

@mgerdts ok, so compression ratios are different, by about a factor of 2x.. which explains it. But why? I looked at ashift (zpool property) on both and they are zero on both. Both are zstd (which I additionally forced with a -o on zfs-receive, since ONE of the original ones was lz4.. but even if that was a degenerate compression case in converting lz4 to zstd, it doesn't explain nearly enough of the difference)

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 18:53

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 18:53

Mar 07, 2023, 18:53

David Cross @david@mastodon.crossfamilyweb.com

@mgerdts I did see that checksums are "on" on the source and "skein" on the destination. but on 4t of 8k pages, that's just 16g or 32g of additional space, total (depending on 256bits or 512bits of hashsize, and that's worsecase since that doesn't account for fletcher7 already being 128bits)

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 19:33

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 19:33

Mar 07, 2023, 19:33

Mike Gerdts @mgerdts@fosstodon.org

@david I'm not sure what to make of ashift=0: that's surely not the real value of ashift. Based on https://openzfs.github.io/openzfs-docs/man/7/zpoolprops.7.html?highlight=ashift saying that ashift can be changed, there have been changes in this area since I last used zfs a lot.

If ashift is the same between the two pools, that points us back to the question of whether you are using raidz or draid and if so, do both pools have the same number of disks per raidz vdev?

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 19:34

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 19:34

Mar 07, 2023, 19:34

David Cross @david@mastodon.crossfamilyweb.com

@mgerdts no raidz at all, simple stripe/concat. 4x1T on machine A, 2x4T on machine B.

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 19:39

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 19:39

Mar 07, 2023, 19:39

David Cross @david@mastodon.crossfamilyweb.com

@mgerdts AH-HAH... googling indicates that I need to use zdb vs zpool to get ashift values... and.. there we are. ashift of 12 on the new devices and 9 on the old. I think we have the smoking gun... once I was actually looking in the right place. Thanks!

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 21:30

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 21:30

Mar 07, 2023, 21:30

David Cross @david@mastodon.crossfamilyweb.com

@mgerdts Now all I have to do is kill the pool... and restore.. again.. the ... 5th? time is the charm?

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 02:58

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 02:58

Mar 07, 2023, 02:58

Mike Gerdts @mgerdts@fosstodon.org

@david 8k recordsize + compression could lead to a poor interaction with ashift=12 as well. Suppose an 8k block would compress to 4200 bytes. With ashift=12, the compressed 8k block will consume 2 x 4k sectors (8k total). With ashift=9, the compressed 8k block will consume 9 x 512b sectors (4.5k total).

With raidz the overhead varies by number of drives in a raidz vdev. See my explanation here:

https://github.com/openzfs/zfs/blob/master/lib/libzfs/libzfs_dataset.c#L5340-L5426

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 03:08

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 07, 2023, 03:08

Mar 07, 2023, 03:08

Mike Gerdts @mgerdts@fosstodon.org

@david re-reading that comment I see that it was updated with draid information, which was integrated after I added this comment with a fix. So things are more complicated with raidz *and* draid. Glad to see the comment update wasn't missed!

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 08, 2023, 05:56

**Mike Gerdts** @mgerdts@fosstodon.org · Mar 08, 2023, 05:56

Mar 08, 2023, 05:56

Mike Gerdts @mgerdts@fosstodon.org

@david while we have concluded raidz is not to blame here, I figured it may be worth mentioning that I did a talk on this work while at #Joyent.

Slides: https://us-east.manta.joyent.com/Joyent_Dev/public/docs/2019-06-RAIDZ_on_small_blocks.pdf
Video: https://youtu.be/sTvVIF5v2dw

Contrary to what I predicted back then, today’s NVMe SSDs pretty much all present as 512n, not as 4Kn.

#zfs

**Daniel Carosone** @uep@infosec.exchange · Mar 08, 2023, 07:58

**Daniel Carosone** @uep@infosec.exchange · Mar 08, 2023, 07:58

Mar 08, 2023, 07:58

Daniel Carosone @uep@infosec.exchange

@mgerdts @david

(probably not because it doesn't show up in usage like zfs list -o space, only at the zpool level, but:)

Is there any chance the source has (or had) dedup and that's not been carried across?

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 08, 2023, 21:53

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 08, 2023, 21:53

Mar 08, 2023, 21:53

David Cross @david@mastodon.crossfamilyweb.com

@uep @mgerdts definitely not on dedupe. the culprit was ashift

**Daniel Carosone** @uep@infosec.exchange · Mar 08, 2023, 21:56

**Daniel Carosone** @uep@infosec.exchange · Mar 08, 2023, 21:56

Mar 08, 2023, 21:56

Daniel Carosone @uep@infosec.exchange

@david @mgerdts

ashift and less-optimal compression packing, yeah. Surprised it was that much but still all too plausible

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 08, 2023, 22:04

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 08, 2023, 22:04

Mar 08, 2023, 22:04

David Cross @david@mastodon.crossfamilyweb.com

@uep @mgerdts I think the key here is that it is a postgres database store, so the recordsize is 8k to align with postgres pagesize, and with ashift of 4k that means BEST case possible compression is 2x, and anything less than 2x is 1x; that means realized compression has to be in the 2.0 to 1.0 range, whereas on the original I was in the 3.x to 4.x range. Math checks out.

**Daniel Carosone** @uep@infosec.exchange · Mar 09, 2023, 02:45

**Daniel Carosone** @uep@infosec.exchange · Mar 09, 2023, 02:45

Mar 09, 2023, 02:45

Daniel Carosone @uep@infosec.exchange

@david @mgerdts Yeah, the lower bound on useful compression is a common issue, the upper bound in this case is less obvious but nasty

**jade** @leftpaddotpy@hachyderm.io · Mar 08, 2023, 08:15

**jade** @leftpaddotpy@hachyderm.io · Mar 08, 2023, 08:15

Mar 08, 2023, 08:15

jade @leftpaddotpy@hachyderm.io

@mgerdts @david wtf, why??

**bsmaalders** @bsmaalders@mas.to · Mar 07, 2023, 02:10

**bsmaalders** @bsmaalders@mas.to · Mar 07, 2023, 02:10

Mar 07, 2023, 02:10

bsmaalders @bsmaalders@mas.to

@david Are the dedup settings the same for both pools?

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 02:11

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 02:11

Mar 07, 2023, 02:11

David Cross @david@mastodon.crossfamilyweb.com

@bsmaalders No dedupe on either

**Javier Henderson 🇦🇷** @javierk4jh@social.afront.org · Mar 07, 2023, 03:06

**Javier Henderson 🇦🇷** @javierk4jh@social.afront.org · Mar 07, 2023, 03:06

Mar 07, 2023, 03:06

Javier Henderson 🇦🇷 @javierk4jh@social.afront.org

@david is the record size the same? If your source uses larger than 128K records and you didn't use -L it may be using 128K records on the target (they incur larger overhead than larger records do). I saw this when moving a zvol with 1M records to one with 128K records.

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 18:39

**David Cross** @david@mastodon.crossfamilyweb.com · Mar 07, 2023, 18:39

Mar 07, 2023, 18:39

David Cross @david@mastodon.crossfamilyweb.com

@javierk4jh My understanding from reading zfs-send and zfs-receive and online searches is that you actually cannot change recordsize that way as the stream is deltas itself.

That is if the incremental says to "set block 15 to 0xfeedface", then block 0xfeedface doesn't have the context of the rest of the block to fill in.

Granted this is a solvable problem to just read the original and write out the whole, but they opted to not have that complexity

I did check anyway, and recordsizes look good

**Antranig Vartanian** @antranigv@sigin.fo · Mar 07, 2023, 07:43

**Antranig Vartanian** @antranigv@sigin.fo · Mar 07, 2023, 07:43

Mar 07, 2023, 07:43

Antranig Vartanian @antranigv@sigin.fo

@david what about snapshots? Can you check them? `zfs list -all`, maybe they are taking that much space :)

Resources

Developers

What is Mastodon?

mastodon.crossfamilyweb.com

More…