XFS: the filesystem of the future?

Posted Jan 23, 2012 12:44 UTC (Mon) by wdaniels (guest, #80192)
Parent article: XFS: the filesystem of the future?

"It just is not practical to take a petabyte-scale filesystem offline to run a filesystem check and repair tool; that work really needs to be done online in the future."

It's also not practical sometimes to backup/format/restore just to shrink some much smaller volumes. It takes time to copy 6TB of data even over a gigabit link. The last time I tried to use XFS I got caught out because I didn't know that you can't just shrink an XFS filesystem the way you can with ext4.

Has this changed? If not, is it ever likely to?

"So, he asked: why do we still need ext4?"

The problem for me is that choice of filesystem is not _usually_ significant _enough_ for whatever I'm doing to research the differences in very much depth. I used to be more inclined to experiment, but got caught out too many times and ended up losing time on projects because of it.

It's not often very important to be able to easily shrink a filesystem, so long as you know you're not going to be able to do it in advance.

I pretty much know where I am with ext so that is always my first choice until I next encounter some particular requirement for best performance in some way or another.

So that really is the point of ext4 as far as I'm concerned...fewer surprises for people with other priorities. And I think that reasoning holds up well for choosing a default filesystem in distros.

to post comments

XFS: the filesystem of the future?

Posted Jan 23, 2012 17:35 UTC (Mon) by sandeen (guest, #42852) [Link]

IIRC Dave addressed a similar question about shrink in the talk, with something like "datasets generally gets bigger, not smaller" and when pressed, suggested that one should (could?) instead use thin provisioning to manage dynamically changing space requirements.

As you point out, there is value to familiarity, but it's also worth poking at the familiar now and then, to see if that familiarity is enough to warrant automatic selection...

XFS: the filesystem of the future?

Posted Jan 23, 2012 22:41 UTC (Mon) by dgc (subscriber, #6611) [Link] (12 responses)

[stuff about shrinking filesystems]

There is no reason why we can't shrink an XFS filesystem - it's not rocket science but there's quite a bit of fiddly work to do and validate:

http://xfs.org/index.php/Shrinking_Support

If you really want shrink support, there's nothing stopping you from doing the work - we'll certainly help you as needed and test and review the changes if you really want to do it. That invitation is extended to anyone who wants to help implement it and write all the tests needed to validate the implementation. I'd estimate about a man-year of work is needed to get it to production ready.

However...

The reason it hasn't been done is that there is basically no demand for shrinking large filesystems. Storage is -cheap-, and in most environments data sets and capacity only grow.

I mentioned thin provisioning in my talk when asked about shrinking - it makes shrinking a redundant feature. All you need to do is run fstrim on the filesystem to tell the storage what regions are unused and all that unused space is returned to the storage free space pool. The filesystem has not changed at all, but the amount of space it consumes is now only the allocated blocks. It will free up more space than even shrinking the filesystem will....

Further, shrinking via thin provisioning is completely filesystem independent so the "shrink" method is common across all filesystems that supports discard operations. IOWs, there's less you need to know about individual filesystem functionality...

In comparison, shrinking is substantially more complex, requires moving data, inodes, directories and other metadata around (i.e. new transactions), requires some tricky operations (like moving the journal!), invalidates all your incremental backups (because inode numbers change), and on top of it all you have to be prepared for a shrink to fail. This means you need to take a full backup before running the operation. If a shrink operation fails you could be left in an unrecoverable situation, requiring a mkfs/restore to recover from. At that point, you may as well just do a dump/mkfs/restore cycle...

IOWs, shrinking is not a simple operation, it has a considerable risk associated with it, and requires a considerable engineering and validation effort to implement it. Those are good arguments for not supporting it, especially as thin provisioning is a more robust and faster way of managing limited storage pools.

Dave.

XFS: the filesystem of the future?

Posted Jan 23, 2012 23:10 UTC (Mon) by dlang (guest, #313) [Link] (5 responses)

shrinking can fail, that I agree with.

but shrinking should never fail in a way that leaves the filesystem invalid.

block numbers will need to change, but I don't see why inode numbers would have to change (and if you don't change those, then lots of other problems vanish), they are already independent of where on the disk the data lives.

this seems fairly obvious to me, what am I missing that makes the simple approach of

identify something to move
copy the data blocks
change the block pointers to the new blocks
free the old blocks
repeat until you have moved everything

not work? (at least for the file data)

if you try to do this on a live filesystem, then you need to do a lot of locking and other changes to make sure the data doesn't change under you (and that new data doesn't go into the space you are trying to free), but if the filesystem is offline for the shrink this shouldn't be an issue.

moving metadata will be more complex, but the worst case should be that something can't be moved, and so you can't shrink the filesystem beyond that point, but there should still be no risk.

XFS: the filesystem of the future?

Posted Jan 24, 2012 0:03 UTC (Tue) by dgc (subscriber, #6611) [Link] (3 responses)

> block numbers will need to change, but I don't see why inode numbers
> would have to change (and if you don't change those, then lots of other
> problems vanish), they are already independent of where on the disk the
> data lives.

Inode numbers in XFS are an encoding of their location on disk. To shrink, you have to physically move inodes and so their number changes.

> this seems fairly obvious to me, what am I missing that makes the
> simple approach of

[snip description of what xfs_fsr does for files]

> not work? (at least for file data)

Moving data and inodes is trivial - most of that is already there with the [almost finished] xfs_reno tool (moves inodes) and the xfs_fsr (moves data) tools. It's all the other corner cases that are complex and very hard to get right.

The "identify something to move" operation is not trivial in the case of random metadata blocks in the regions that will be shrunk. A file may have all it's data in a safe location, but it may have metadata in some place that needs to be moved (e.g. an extent tree block). Same for directories, symlinks, attributes, etc. That currently requires a complete metadata tree walk which is rather expensive. It will be easier and much faster when the reverse mapping tree goes in, though.

The biggest piece of work is metadata relocation. For each different type of metadata that needs to be relocated, the action is different - reallocation of the metadata block and then updating all the sibling, parent and multiple index blocks that point to it is not a simple thing to do. It's easy to get wrong and hard to validate. And there are a lot of different types. e.g. there are 6 different types of metadata blocks with multiply interconnected indexes in the directory structure alone.

> if you try to do this on a live filesystem

If we want it to be a fail-safe operation then it can only be done online. xfs_fsr and xfs_reno already work online and are fail-safe. Essentially, every metadata change must be atomic and recoverable and that means it has to be done through the transaction subsystem. We don't have a transaction subsystem implemented for offline userspace utilities, so a failure during an offline shrink would almost certainly result in a corrupted filesystem or data loss. :(

In case you hadn't guessed by now, one of the reasons we haven't implemented shrinking is that we know *exactly* how complex it actually is to get it right. We're not going to support a half-baked implementation that screws up, so either we do it right the first time or we don't do it at all. But if someone wants to step up to do it right then they'll get all the help they need from me. ;)

Dave.

XFS: the filesystem of the future?

Posted Jan 24, 2012 0:41 UTC (Tue) by dlang (guest, #313) [Link] (2 responses)

> Inode numbers in XFS are an encoding of their location on disk. To shrink, you have to physically move inodes and so their number changes.

If I understand this correctly, this means that a defrag operation would have the same problems. Does this mean that there is no way (other than backup/restore) to defrag XFS?

as for the rest of the problems (involving moving metadata), would a data-only shrink that couldn't move metadata make any sense at all?

XFS: the filesystem of the future?

Posted Jan 24, 2012 2:04 UTC (Tue) by dgc (subscriber, #6611) [Link] (1 responses)

xfs_fsr doesn't change the inode number. It copies the data to another temporary file and if the source file hasn't changed once the copy is complete, it atomically swaps the extents between the two inodes via a special transaction. It uses invisible IO, so not even the timestamps on the inode being defragged get changed.

As to data only shrink, that makes no sense because metadata like directories will pin the blocks high up in the filesystem. and so you won't be able to shrink it anyway....

XFS: the filesystem of the future?

Posted Jan 24, 2012 8:13 UTC (Tue) by tialaramex (subscriber, #21167) [Link]

OK, so XFS doesn't support full defrag, it can't move metadata to improve performance - but it does have a data-only defrag which will be enough for some people.

XFS: the filesystem of the future?

Posted Jan 26, 2012 4:58 UTC (Thu) by sandeen (guest, #42852) [Link]

The other thing to consider about shrinking is that without a LOT of work, it will almost certainly give you a "best fit" into your new space, not an optimal layout. I've seen extN filesystems that have gone through a lot of shrink/grow/shrink/grow and the result is quite a mess, allocation wise. That's not really even a dig at ext4; if you are constantly rescrambling any filesystem like that, you're going to stray from any optimal allocations you may have had before you started...

XFS: the filesystem of the future?

Posted Jan 24, 2012 2:06 UTC (Tue) by wdaniels (guest, #80192) [Link] (5 responses)

Hi Dave,

I would not want to argue that shrinking support is actually needed and that was not my intention, though fstrim is certainly new to me and I thank you for that pointer. Let me explain more precisely my XFS problem in case you would like to understand more of the kinds of things that happen to people that discourage them from moving away from the familiar:

I was provisioning a new 8TB server (4 x 2TB physical disks) that was to serve as a storage area for a number of different systems. Some peculiarities of the applications meant that separate partitions were desirable. Overall, I needed a number of small (~50-250GB) volumes and to be able to utilise the remaining space for VM images. Since I wasn't sure about how many of the smaller partitions I needed, I used LVM to create one 8TB PV with LVs allocated upon that to suit.

When it came to creating the largest logical volume for all the remaining free space, I decided to try XFS because I had heard it was more efficient with large files and I was slightly worried about the performance impact of LVM (it was my first time playing with LVM also).

So I provisioned around 6TB that remained for the large XFS partition and soon filled it up. Then came the half-expected requirement to add another couple of smaller partitions. No problem I thought, LVM to the rescue. Or it would have been if I could have shrunk the XFS filesystem to truncate the logical volume!

I may well misunderstand the subtleties of XFS block allocations over LVM's physical extent mappings, sparse file allocation and the like for thin-provisioning (I often do misunderstand such things) but I don't think fstrim would have helped me there even had I known about it at the time.

I only had a 100Mbps NIC on the server so it took quite some time (days if I recall) to backup all that data, recreate the filesystem as ext4 and copy it all back.

It may well be that my use case was highly unusual, my research insufficient, my knowledge limited and/or my strategy idiotic. I'm a programmer first and reluctant sysadmin. But this is not at all unusual outside of such groups of experienced experts such as you'd find at LWN.

One reason for my posting about the shrinking issue was that I hadn't seen it mentioned yet (sorry, I did not watch the video of the full talk), but really my point was that I ended up causing myself a great deal of trouble which only re-enforced to me the wisdom of the tech dinosaurs I have worked with in my time that you should avoid deviating from what you know and trust, without sufficiently good reason.

There are many who take this view, at least enough that it seems improbable to me that you will succeed in convincing the partly-informed, risk-averse and time-constrained majority to displace ext4 for XFS as a default choice in their minds.

I think it's great that you are taking the time to promote the benefits of XFS and to keep improving on it. I read this article precisely because I was made aware through my previous screw-ups that I need to invest more time learning about different filesystems, but I generally feel better knowing about where I'm likely to face problems as much as what I have to gain. And for that reason I still find it useful when people pick up on minor detractions, even if they seem unimportant in the grand scheme of things.

Hope you understand!
-Will

XFS: the filesystem of the future?

Posted Jan 24, 2012 4:49 UTC (Tue) by raven667 (subscriber, #5198) [Link]

Here's some advice. Shrinking a filesystem, any filesystem, is generally very problematic and often not supported because it is a very complicated operation as detailed elsewhere in the comments. My policy is to use LVM to size the filesystem for your immediate needs and a small amount of breathing room and extend as necessary. Good, modern filesystems can safely extend while mounted. With that policy you would never have the situation you describe, you would either add space where it is needed or there just physically isn't enough space and you need to buy more, those are the only options and both are easy to support.

XFS: the filesystem of the future?

Posted Jan 27, 2012 0:59 UTC (Fri) by dgc (subscriber, #6611) [Link] (3 responses)

Hi Will,

Your problem is poster-child case for why you should use thin provisioning. Make the filesystems as large as you want, and let actually usage of the filesystems determine where the space is used. When you then realise that the 6TB XFS volume was too large, remove stuff from it and run fstrim on it to release the free space back to the thinp pool where it is now available to be used by the other filesystems that need space. No need to shrink at all, and you have an extremely flexible solution across all your volumes and filesystems.

And if you want to limit an XFS filesystem to a specific, lesser amount of space than the entire size it was made with (after releasing all the free space), you could simply apply a directory tree quota to /...

Dave.

XFS: the filesystem of the future?

Posted Feb 7, 2012 0:22 UTC (Tue) by ArbitraryConstant (guest, #42725) [Link] (2 responses)

> Your problem is poster-child case for why you should use thin provisioning.

hm... It doesn't seem consistent to talk about how xfs is well suited to inexpensive servers, but then require features not available from inexpensive servers for important functionality.

XFS: the filesystem of the future?

Posted Feb 7, 2012 3:22 UTC (Tue) by dgc (subscriber, #6611) [Link] (1 responses)

> hm... It doesn't seem consistent to talk about how xfs is well
> suited to inexpensive servers, but then require features not
> available from inexpensive servers for important functionality.

Thin provisioning is available to any linux system via the Device Mapper module dm-thinp. You don't need storage hardware that supports this functionality any more - all recent kernels support it.

Dave.

XFS: the filesystem of the future?

Posted Feb 7, 2012 3:40 UTC (Tue) by ArbitraryConstant (guest, #42725) [Link]

Interesting - that's a very useful feature! Thanks