Filesystems (ext3, reiser, xfs, jfs) comparison on Debian Etch
Posted by hansivers on Fri 21 Apr 2006 at 11:10
There are a lot of Linux filesystems comparisons available but most of them are anecdotal, based on artificial tasks or completed under older kernels. This benchmark essay is based on 11 real-world tasks appropriate for a file server with older generation hardware (Pentium II/III, EIDE hard-drive).
Since its initial publication, this article has generated a lot of questions, comments and suggestions to improve it. Consequently, I'm currently working hard on a new batch of tests to answer as many questions as possible (within the original scope of the article). Results will be available in about two weeks (May 8, 2006) Many thanks for your interest and keep in touch with Debian-Administration.org! Hans
Why another benchmark test?
I found two quantitative and reproductible benchmark testing studies using the 2.6.x kernel (see References). Benoit (2003) implemented 12 tests using large files (1+ GB) on a Pentium II 500 server with 512MB RAM. This test was quite informative but results are beginning to aged (kernel 2.6.0) and mostly applied to settings which manipulate exclusively large files (e.g., multimedia, scientific, databases).
Piszcz (2006) implemented 21 tasks simulating a variety of file operations on a PIII-500 with 768MB RAM and a 400GB EIDE-133 hard disk. To date, this testing appears to be the most comprehensive work on the 2.6 kernel. However, since many tasks were "artificial" (e.g., copying and removing 10 000 empty directories, touching 10 000 files, splitting files recursively), it may be difficult to transfer some conclusions to real-world settings.
Thus, the objective of the present benchmark testing is to complete some Piszcz (2006) conclusions, by focusing exclusively on real-world operations found in small-business file servers (see Tasks description).
Test settings
- Hardware
- Processor : Intel Celeron 533
- RAM : 512MB RAM PC100
- Motherboard : ASUS P2B
- Hard drive : WD Caviar SE 160GB (EIDE 100, 7200 RPM, 8MB Cache)
- Controller : ATA/133 PCI (Silicon Image)
- OS
- Debian Etch (kernel 2.6.15), distribution upgraded on April 18, 2006
- All optional daemons killed (cron,ssh,saMBa,etc.)
- Filesystems
- Ext3 (e2fsprogs 1.38)
- ReiserFS (reiserfsprogs 1.3.6.19)
- JFS (jfsutils 1.1.8)
- XFS (xfsprogs 2.7.14)
Description of selected tasks
- Operations on a large file (ISO image, 700MB)
- Copy ISO from a second disk to the test disk
- Recopy ISO in another location on the test disk
- Remove both copies of ISO
- Operations on a file tree (7500 files, 900 directories, 1.9GB)
- Copy file tree from a second disk to the test disk
- Recopy file tree in another location on the test disk
- Remove both copies of file tree
- Operations into the file tree
- List recursively all contents of the file tree and save it on the test disk
- Find files matching a specific wildcard into the file tree
- Operations on the file system
- Creation of the filesystem (mkfs) (all FS were created with default values)
- Mount filesystem
- Umount filesystem
The sequence of 11 tasks (from creation of FS to umounting FS) was run as a Bash script which was completed three times (the average is reported). Each sequence takes about 7 min. Time to complete task (in secs), percentage of CPU dedicated to task and number of major/minor page faults during task were computed by the GNU time utility (version 1.7).
RESULTS
Partition capacity
Initial (after filesystem creation) and residual (after removal of all files) partition capacity was computed as the ratio of number of available blocks by number of blocks on the partition. Ext3 has the worst inital capacity (92.77%), while others FS preserve almost full partition capacity (ReiserFS = 99.83%, JFS = 99.82%, XFS = 99.95%). Interestingly, the residual capacity of Ext3 and ReiserFS was identical to the initial, while JFS and XFS lost about 0.02% of their partition capacity, suggesting that these FS can dynamically grow but do not completely return to their inital state (and size) after file removal.
Conclusion : To use the maximum of your partition capacity, choose ReiserFS, JFS or XFS.
File system creation, mounting and unmounting
The creation of FS on the 20GB test partition took 14.7 secs for Ext3, compared to 2 secs or less for other FS (ReiserFS = 2.2, JFS = 1.3, XFS = 0.7). However, the ReiserFS took 5 to 15 times longer to mount the FS (2.3 secs) when compared to other FS (Ext3 = 0.2, JFS = 0.2, XFS = 0.5), and also 2 times longer to umount the FS (0.4 sec). All FS took comparable amounts of CPU to create FS (between 59% - ReiserFS and 74% - JFS) and to mount FS (between 6 and 9%). However, Ex3 and XFS took about 2 times more CPU to umount (37% and 45%), compared to ReiserFS and JFS (14% and 27%).
Conclusion : For quick FS creation and mounting/unmounting, choose JFS or XFS.
Operations on a large file (ISO image, 700MB)
The initial copy of the large file took longer on Ext3 (38.2 secs) and ReiserFS (41.8) when compared to JFS and XFS (35.1 and 34.8). The recopy on the same disk advantaged the XFS (33.1 secs), when compared to other FS (Ext3 = 37.3, JFS = 39.4, ReiserFS = 43.9). The ISO removal was about 100 times faster on JFS and XFS (0.02 sec for both), compared to 1.5 sec for ReiserFS and 2.5 sec for Ext3! All FS took comparable amounts of CPU to copy (between 46 and 51%) and to recopy ISO (between 38% to 50%). The ReiserFS used 49% of CPU to remove ISO, when other FS used about 10%. There was a clear trend of JFS to use less CPU than any other FS (about 5 to 10% less). The number of minor page faults was quite similar between FS (ranging from 600 - XFS to 661 - ReiserFS).
Conclusion : For quick operations on large files, choose JFS or XFS. If you need to minimize CPU usage, prefer JFS.
Operations on a file tree (7500 files, 900 directories, 1.9GB)
The initial copy of the tree was quicker for Ext3 (158.3 secs) and XFS (166.1) when compared to ReiserFS and JFS (172.1 and 180.1). Similar results were observed during the recopy on the same disk, which advantaged the Ext3 (120 secs) compared to other FS (XFS = 135.2, ReiserFS = 136.9 and JFS = 151). However, the tree removal was about 2 times longer for Ext3 (22 secs) when compared to ReiserFS (8.2 secs), XFS (10.5 secs) and JFS (12.5 secs)! All FS took comparable amounts of CPU to copy (between 27 and 36%) and to recopy the file tree (between 29% - JFS and 45% - ReiserFS). Surprisingly, the ReiserFS and the XFS used significantly more CPU to remove file tree (86% and 65%) when other FS used about 15% (Ext3 and JFS). Again, there was a clear trend of JFS to use less CPU than any other FS. The number of minor page faults was significantly higher for ReiserFS (total = 5843) when compared to other FS (1400 to 1490). This difference appears to come from a higher rate (5 to 20 times) of page faults for ReiserFS in recopy and removal of file tree.
Conclusion : For quick operations on large file tree, choose Ext3 or XFS. Benchmarks from other authors have supported the use of ReiserFS for operations on large number of small files. However, the present results on a tree comprising thousands of files of various size (10KB to 5MB) suggest than Ext3 or XFS may be more appropriate for real-world file server operations. Even if JFS minimize CPU usage, it should be noted that this FS comes with significantly higher latency for large file tree operations.
Directory listing and file search into the previous file tree
The complete (recursive) directory listing of the tree was quicker for ReiserFS (1.4 secs) and XFS (1.8) when compared to Ext3 and JFS (2.5 and 3.1). Similar results were observed during the file search, where ReiserFS (0.8 sec) and XFS (2.8) yielded quicker results compared to Ext3 (4.6 secs) and JFS (5 secs). Ext3 and JFS took comparable amounts of CPU for directory listing (35%) and file search (6%). XFS took more CPU for directory listing (70%) but comparable amount for file search (10%). ReiserFS appears to be the most CPU-intensive FS, with 71% for directory listing and 36% for file search. Again, the number of minor page faults was 3 times higher for ReiserFS (total = 1991) when compared to other FS (704 to 712).
Conclusion : Results suggest that, for these tasks, filesystems can be regrouped as (a) quick and more CPU-intensive (ReiserFS and XFS) or (b) slower but less CPU-intensive (ext3 and JFS). XFS appears as a good compromise, with relatively quick results, moderate usage of CPU and acceptable rate of page faults.
OVERALL CONCLUSION
These results replicate previous observations from Piszcz (2006) about reduced disk capacity of Ext3, longer mount time of ReiserFS and longer FS creation of Ext3. Moreover, like this report, both reviews have observed that JFS is the lowest CPU-usage FS. Finally, this report appeared to be the first to show the high page faults activity of ReiserFS on most usual file operations.
While recognizing the relative merits of each filesystem, only one filesystem can be install for each partition/disk. Based on all testing done for this benchmark essay, XFS appears to be the most appropriate filesystem to install on a file server for home or small-business needs :
- It uses the maximum capacity of your server hard disk(s)
- It is the quickest FS to create, mount and unmount
- It is the quickest FS for operations on large files (>500MB)
- This FS gets a good second place for operations on a large number of small to moderate-size files and directories
- It constitutes a good CPU vs time compromise for large directory listing or file search
- It is not the least CPU demanding FS but its use of system ressources is quite acceptable for older generation hardware
While Piszcz (2006) did not explicitly recommand XFS, he concludes that "Personally, I still choose XFS for filesystem performance and scalability". I can only support this conclusion.
References
Benoit, M. (2003). Linux File System Benchmarks.
Piszcz, J. (2006). Benchmarking Filesystems Part II. Linux Gazette, 122 (January 2006).
[ Parent | Reply to this comment ]
http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm http://m.domaindlx.com/LinuxHelp/resources/fs-benchmarks.htm
The first column names the filesystem tested. The second column records the total time (in seconds) it took to run the filesystem benchmarking software bonnie++ (Version 1.93c). The third column records the total number of megabytes needed to store 655 megabytes of raw data.
SMALLER is better.
FILESYSTEM | TIME | DISK USAGE |
---|---|---|
REISER4 (lzo) | 1,938 | 278 |
REISER4 (gzip) | 2,295 | 213 |
REISER4 | 3,462 | 692 |
EXT2 | 4,092 | 816 |
JFS | 4,225 | 806 |
EXT4 | 4,408 | 816 |
EXT3 | 4,421 | 816 |
XFS | 4,625 | 799 |
REISER3 | 6,178 | 793 |
FAT32 | 12,342 | 988 |
NTFS-3g | >10,414 | 772 |
Each test was preformed 5 times and the average value recorded. SMALLER is better.
The Reiser4 filesystem clearly had the best test results.
The FAT32 filesystem had the worst test results.
The bonnie++ tests were preformed, with the following parameters:
bonnie++ -n128:128k:0
[ Parent | Reply to this comment ]
These days, tools and integration have been setup quite nicely by the distribution maintainers, and a fsck-interface is used by all of the ones I've tried anyway.
Ofcourse, if you want to play with more advanced options, dump filesystems or do anything out of the ordinary your findings may be different -- I would not know since I rarely, if ever, use these features.
On the other hand, IMO it rarely matters which filesystem you use anyway. I would challenge anybody to guess the filesystem running on a light to medium loaded desktop or server. Differences (in speed of mature common journaling filesystems) really are rather small for general use, and it's not until you have very specific tasks to be done or very i/o loaded systems to be managed that the choice of journaling filesystem becomes a real issue.
I believe that XFS has upcoming (or perhaps already has some) support in FreeBSD, though. FreeBSD has ext2 (read) support too. And IIRC, there was a (non-microsoft, obviously) driver adding ext2 (hence, ext3) support to Windows -- if you run that OS and want to allow it to touch your nice Linux system, that is. I suppose that falls under compatibility too.
[ Parent | Reply to this comment ]
- What did you use to compare the times? What tools? Which commands?
- It was not mentioned the fact that Ext3 reserves 5% of the HD to the root user. (see http://ubuntuforums.org/showthread.php?t=215177)
- Some points in the article like the other is better. That to generic to say that other FS took less or spend less. (e.g.: The ReiserFS used 49% of CPU to remove ISO, when other FS used about 10%). What other FS took about 10% of CPU?
The rest of the article I found good. My tip is trying to use a better server, like a Core Duo or an AMD X2, or even an Xeon or Opteron, since we`re talking of bussiness servers (but that`s good enough if you don`t have one to test). Maybe, a good thing is to test in SATA drivers or SCSI...
Note: I use ext3 mostly because of compatibility. I like new stuff, but I don`t like to play with FS's.
[ Parent | Reply to this comment ]
I use ReiserFS because it's the only filesystem that supports shrinking a filesystem - veeeery useful with LVM! - JFS, XFS, ... - backup - resize - restore - with 1 TB (that's not that much anymore nowadays) - that's a joke. Measuring the performance of this operation is missing in the article.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
The tasks selected here were not intended to be comprehensive, since I was focusing more on adding some new hard data to Piszcz's excellent benchmark series. I will surely add your suggestion in a follow-up testing.. Thanks!
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
You stated that all filesystems were created using default values. So ext3 loses approx. 5% of its capacity because of its reserved-blocks feature. For a fileserver you would create your data-partition without reserved blocks, as it is not needed there.
Or was this taken into account already?
Cheers,
Johannes
[ Parent | Reply to this comment ]
/Nafallo
[ Parent | Reply to this comment ]
Why would you want to make a fileserver to perform badly? The 5% default reserve is in part to have some slack for use by 'root', but mostly because when a filesystem is nearly full its performance becomes very bad as fragmentation increases nonlinearly.
My experience is that 'ext3' requires at least 10% free space to perform decently over time (the 5% default is the absolute minimum that should be done) and 20-30% free space reserve is a lot better.
The problem is indeed intrinsic: as the filesystem nears 100% usage the chances of finding contiguous or nearby free blocks when writing or extending a file becomes a lot smaller. This applies to both extent based filesystems like JFS and XFS and to block based filesystems like 'ext3' (even if usually extent based filesystems do a bit better).
[ Parent | Reply to this comment ]
Intrinsic perhaps, but highlights what is missing, such as how filesystems address the issue. Part of the cost of writing the ReiserFS is all that messing with binary trees, the report doesn't attempt to understand, or address, why Reiser thinks it is worth doing all this (admittedly a lot of people said it was just too expensive to be worth trying).
Very little in the way of "aging" the filesystem.
Nothing on consistency, or limits, or features.
I think the test choice is not great, filesystem creation, mounting, and manipulating ISOs are generally not time critical tasks (well if you don't use NTFS they aren't!). I'd happily use a file system that take 100 times longer to create than any of those tested if it conveyed other discernable benefits, and it takes my CD writer over 5 minutes to write a full ISO, so a few second here or there matters not at all to me on manipulating ISOs.
Be interesting to see also how representative people think 7500 files being 1.9GB is. My understanding was that mean file size, whilst on the way up, hadn't got to that sort of size yet. Certainly I have 15GB (df -h) of files on this box, and just under 0.5 million files (find / -type f), clocking in just under 32KB average file size for my file size, rather than the 253KB used in the test. Perhaps someone is hoarding ISOs?
It is well known there is a cost-benefit trade off in ReiserFS that means it performs relatively less well on larger files than XFS. So something like mean file size is likely to explain the difference between the results here, and of other authors, on the performance of ReiserFS.
I'd also prefer to see more edge cases examined -- what happens when 100,000 emails are delivered, and then sorted, and a selection deleted, to a single maildir? For most people it is probably performing in a sensible manner under these edge cases, that matters far more than if it takes 130 or 135s to copy 7000 files.
I'd like to see blocking I/O cases, and similar examined, email delivery being a classic, and fairly easy to test.
Hardest of all to test, I want to know that the filesystem journalling "does what it says on the tin", and have someone pull the plug in the middle of these transaction, and see nothing is corrupted, and that everything is in a consistent state, and how long the recovery to that consistent state takes, and that it is automatic.
Why no "bonnie++" statistics -- I'd have thought as a test it was trivial to run, and might show up something, even if the I/O types measured are a tad artificial.
Then again I appreciate these tests take a lot of time and effort to do.
Nothing here will shake my own choice of ReiserFS for most general purpose filesystems, it has a good performance on real world benchmarks, and is the most mature of the journalling file systems presented. Although I'm looking at XFS for a project, but not because of its performance, but because of other features it brings to the table.
[ Parent | Reply to this comment ]
There's a myth abroad that modern drives detect falling supply voltage and do stuff to protect the media. (You might read of using the motor as a generator to provide power to "park" the heads.) If it was ever true, it was only in high-end drives, and not any more. Even making sure data really is physically on the disk surface before reporting the write complete, something we like to think of as the basic promise, is no longer widely supported, though the drives will claim otherwise. In practice you need up to a few seconds of power after a crash to drain sectors from the cache to the disk surface.
The mantra is, if reliability matters, replicate and use battery ("UPS") power.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I personally lost half of my homedir to XFS because it does only metadata journaling. Sure, all my files were there because of the journaling, but they were filled with binary zeroes. Oh and yes, the XFS FAQ says that this may happen (well at least it did when I was searching for an explanation of that behaviour back then).
I really don't want to use any "journaling" filesystem that does not journal my data, because then its worthless. I don't need a filesystem that has a clean tree and can be mounted if I lose half of my data in it. ReiserFS was pretty good at garbling file contents ("WTF is my mp3 playlist now that videos I just downloaeded?") too before they had data journaling.
[ Parent | Reply to this comment ]
Just to let you know, Hitachi does uses this technology to move heads into proper place if power loss detected.However I have no idea if drive is able to flush buffers in this scenario.Perhaps they able, since Hitachi is intentionally(?) restricts write buffer size while read buffer size allowed to be whole size of installed RAM IC.
[ Parent | Reply to this comment ]
I'm fairly new to linux and may well have been doing something wrong, but my box regularly had the power pulled on it (my area used to be prone to power dips and I couldn't afford a UPS).
When I was using ext3 it sometimes took me a few hours to get the system to even boot, because it refused while there were errors. Once I switched to ReiserFS all those problems went away.
In a single user environment, which is realist enough for me, i'd chose the FS that allows me to recover quickest over an FS that might take a second or two longer to do something. (Can't remember the last time I copied 7000 files, if ever.)
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Once I switched to ReiserFS all those problems went away.I once used ReiserFS on a fileserver and after a broken PSU took the server down, it was the data that went away, because the tree ReiserFS uses was corrupted.
The recovery utilities supplied for ReiserFS tried their best to recover the filesystem, but in the end I had to restore from the backup of the night before.
After this incident, I decided to give Ext3 and XFS a try. While XFS seems to be the more modern filesytem, it does lack the ability to shrink, which is a real problem in times of LVM2 and Software-RAID.
Ext3 hasn't let me down ever since. It's disaster recovery tools are the maturest of all the filesystems tested (in addition to being included in every live/recovery CD on the planet) and it's online resizing capabilities really go together well with virtualized (as well as real) infrastructure.
On a side note, I also found Ext3 to be the most tolerant filesystem for use on "flaky" hardware. If, for example, a part of the binary tree used in ReiserFS happens to land on a defective sector of the harddisk, then it's bye-bye time for your entire FS. Ext3, on the other hand, will cope quite well and allows for full recovery using one of its redundantly stored superblocks.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I added these data to replicate other previous observations about filesystem creation and mounting time.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Pete .
[ Parent | Reply to this comment ]
Every time you choose to leave your computer turned on, you are choosing to disregard a finite chance that pollution will make this planet uninhabitable for future generations. That's not such a "simple answer" any more is it?
Wasting electricity despoils the commons. Turn the computer off when you aren't using it.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
It's so much more useful to leave the machine on all the time!
Perhaps you are shutting it down to reduce noise, in which case, I commend quietpc.com to you - I can hear birdsong over mine, with the windows shut.
[ Parent | Reply to this comment ]
Why keep it running if you don't need it or can easily wake it up if you need it?
[ Parent | Reply to this comment ]
Slow the CPU (especially if it's a P4 or old Athlon) and hibernate the monitor. That itself will save a good amount of energy.
Or, better, host your home server on a passively-cooled Via system. Then you can shut-down your PC any time you want, while the server stays up, sipping watts.
[ Parent | Reply to this comment ]
Actually, SWSUSP2 works well enough now that I hibernate all of the machines at home (except the server) when they're not going to be used, such as overnight. Hibernation on my HP laptop is almost infallible -- I've got an "uptime" of over a month, hibernating once or twice (and sometimes more) every day -- and, while it takes a good minute to go into hibernation, it comes out of it within 35 seconds... that's from hitting the power switch to having my KDE desktop back up.
I just wish S3 worked as well, and that the kernel folks would adopt SWSUSP2, which works so much better than the default hibernate mechanism.
--- SER
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
File system creation time can't really be an issue to anyone, I guess.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
The problem is that unless the IO subsystem supports mailboxing and tagged queueing, which are only available in practice on SCSI and SCSI/ATA host adapters (3ware and up), multiple concurrent accesses have awful performance.
However there are already some filesystem speed tests for suitable IO subsystems, alluded to by some other comment, for example:
http://ext2.SourceForge.net/2005-ols/ols-presentation-html/img38. html
BTW, in this graph the JFS performance comes out badly, I think that an older version of JFS was used that had excessive locking like 'ext3' for most of its life.
There are more links to filesystem speed tests here:
http://WWW.sabi.co.UK/Notes/anno05-3rd.html#050911
[ Parent | Reply to this comment ]
NMP
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
* during a file tree copy, pull the plug on the machine. (could be simulated by running the test under vmware and killing the virtual machine) - then check how well it the fs recovers the data - which files got corrupted (if any), how long it takes to fix (replaying journals, etc).
* call your initial tree t0. make a new tree apporximate the same size called t1. for a concurrency test:
cp -a t0 t2 & # one tree
cp -a t1 t3 & # other tree
cp -a t0 t4 & # merge the trees
cp -a t1 t4 &
Hrm, should probably test mixed concurrency (deletes too!) so:
cp -a t0 t5 && rm -rf t5 &
I'm sure there's more to be added, maybe this will give you some ideas.
[ Parent | Reply to this comment ]
In these experiments the test variable was disk configuration rather than file system. A similar test across different file systems might produce a worthwhile indication of comparative reliability.
I suspect write caching would need to be disabled in the disk system to prevent corruptions of the kind being investigated in the link above from affecting the results. This would have an impact on absolute performance, but relative measurements could still be made.
[ Parent | Reply to this comment ]
Yank the plug while multiple processes are updating the disks. See what happens.
Repeat 8 or 10 times.
Yes, it's manual and time-consuming.
[ Parent | Reply to this comment ]
With ReiserFS 3 on the other hand, I've got two such events and both times It managed to somehow completely destroy multiple files which were not even open at the time of incident.
(Yes this is anecdotal evidence, but I'm not using it anymore because of there incidents)
[ Parent | Reply to this comment ]
It just isn't fun to see the filesystem break and read a publicity message about being able to ask questions for $25... (ReiserFS)
[ Parent | Reply to this comment ]
Get a recent ReiserFS and mount it with data=journal!
[ Parent | Reply to this comment ]
An intesting (and simple) test would be to simulate power failures during copy/delete file operations (large ISO file and files tree) and see how each FS handles each situation. But I'm aware that this is only a small part of real data integrity testing..
If anybody has seen hard data about FS reliability, feel free to post a link here. I would be very interested to investigate this, in order to produce more comprehensive and real-world benchmarks.
[ Parent | Reply to this comment ]
However, I have found one annoying feature with xfs: whenever there is power failure then all open text files are filled with "^@^@^@^@^@^@^@^@...". You can easily replicate this by opening file /etc/fstab to emacs and then unplug the power cord. Why /etc/fstab.... well, then you know why I find this feature REALLY annoying. So, this powerfailure test would be the First on my test list for real world servers.
Anyway, I enjoyed reading your article. And being on professional researcher myself I know that there is always room for improvement. Looking forward reading the new comparison from you.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
mail / # mount|grep -i xfs
/dev/hda3 on / type xfs (rw,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
/dev/mapper/vg-usr on /usr type xfs (rw,nodev,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
/dev/mapper/vg-home on /home type xfs (rw,nosuid,nodev,noatime,usrquota,grpquota,logbufs=8,logbsize=327 68,ihashsize=65567)
/dev/mapper/vg-opt on /opt type xfs (rw,nodev,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
/dev/mapper/vg-var on /var type xfs (rw,nodev,noatime,usrquota,grpquota,logbufs=8,logbsize=32768,ihas hsize=65567)
/dev/mapper/vg-tmp on /tmp type xfs (rw,noexec,nosuid,nodev,noatime,usrquota,grpquota,logbufs=8,logbs ize=32768,ihashsize=65567)
/dev/hda1 on /boot type xfs (rw,noatime,logbufs=8,logbsize=32768,ihashsize=65567)
mail / # ls -lah /boot/grub/*xfs*
-rw-r--r-- 1 root root 11K Jul 1 2005 /boot/grub/xfs_stage1_5
mail / # grub --version
grub (GNU GRUB 0.96)
mail / #
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I have xfs on all my nfs servers. OS is SuSE 9.2/9.3/10.0.
[ Parent | Reply to this comment ]
- Peder
[ Parent | Reply to this comment ]
You said "While recognizing the relative merits of each filesystem, an system administrator has no choice but to install only one filesystem...". I don't understand why you believe there is or should be this sort of restriction.
Can't the administrator decide to 'partition' usage onto different volumes, using different fs types, based on their performance for the usage?
For example, I might create a volume to hold users homes, expecting many small files while requiring maximum speed, and so choose to use XFS, while a volume to hold large files (video, audio, backups, still images, etc.), I might choose to use JFS instead.
Note that I'm not advocating or even suggesting that the above is in some way an optimal setup, it's just an 'off the top of my head' example. The question is "Why shouldn't I be able to do this sort of thing if I so choose?" Is there something I'm missing?
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
For example, did you unmount the relevant partition before every single operation? You don't say, but if you did not (and not many people know that this is essential) your results are largely meaningless.
For far more sensible, documented and informative tests look at mine here:
http://WWW.sabi.co.UK/Notes/anno05-3rd.html#050908
and in a few entries around that date. Some amusing updates here:
http://WWW.sabi.co.UK/Notes/anno05-3rd.html#050913
http://WWW.sabi.co.UK/Notes/anno06-2nd.html#060416
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I used to think much the same, but over time I discovered that I'd rather use JFS across the board (except for filesystems that need to be accessible from MS Windows, where I use 'ext2' as there is an excellent filesystem for it).
The first reason is that 'ext3' performance is awesome when the filesystem has just been created and loaded, but degrades very badly over time while JFS degrades significantly but a lot less:
http://WWW.sabi.co.UK/Notes/anno06-2nd.html#060416
The second reason is that probably because of some happenstance 'dir_index' can slow down things pretty significantly:
http://WWW.sabi.co.UK/Notes/anno05-4th.html#051204
A rather less significant advantage of JFS is that since it uses extents and dynamically allocated inodes it usually uses a lot less space for metadata, often like 3-5% of the total filesystem space.
[ Parent | Reply to this comment ]
--------
Felipe Sateler
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
grep -i acl config-2.6.15:
CONFIG_XFS_POSIX_ACL=y
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
The biggest problem however is not making or formatting a filesystem, it is how long it takes to 'fsck' it, and how much memory is necessary.
Times of over two months to 'fsck' a filesystem have been reported for 'ext3' and XFS sometimes requires more than 4GB of memory to run 'fsck' (it is possible to create and use an XFS filesystem on a system with a 32 bit that can only be 'fsck'ed on a 64 bit CPU, and at least one case has actually happened).
The basic problem is that while very large filesystems using JFS or XFS (or very recent 'ext3') perform well on RAID storage, because they take advantage of the parallel nature of the underlying storage system, 'fsck' is single threaded in every Linux file system design that I have seen. Bad news.
More details here:
http://www.sabi.co.uk/Notes/anno05-4th.html#051012
http://www.sabi.co.uk/Notes/anno05-4th.html#051009
I am very surprised that your experience is that «jfs just wasn't stable enough», perhaps you may want to report to the JFS mailing list, as the authors of JFS are very responsive to reports of instaibility, and usually find a fix pretty quickly.
As to FC5 support, all Red Hat systems only support 'ext3', at least officially and in the installer, but after installation you can use any of the filesystems included in the kernel. I typically install to a small temporary partition which is 'ext3' formatted, and then convert it to JFS by copying its contents over to the real ''root'' partition which is JFS formatted.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Fig
[ Parent | Reply to this comment ]
The shread command works by writing random data, zeros, and ones over and over to the spots on the disk that the file you want to shred was located. The hopes are that with enough writes, the data will actually be overwritten on the disk. (The head of the hard drive varies a small amout as it traces its path over the disk, so the data might not be completely erased).
The problem is, journaling file systems write data to the journal before they write it to the final location on the disk. So shredding the file blocks on the disk, an attacker might be able to recover data from wherever on the disk the journal is located, even if the data blocks are unreadable.
The real issue is that shredding a file even on ext2 does not always work, because modern hard drives sometimes transparently remap bad sectors on the disk... so what the operating system thinks is the location on the disk it originally wrote mysecret.txt to, the drive might have moved it. An attacker could still read data from the "bad" sector using the right tools.
Realisticly, shred should never be relied on. Using dm-crypt to encrypt a full filesystem is a much better solution, and with today's CPUs power, the performance is small enough trade off for secrecy.
For more information about securly destroying data, you can read the paper TKS1 on this page, which is really interesting. (Scroll down to section 3 on page 4)
[ Parent | Reply to this comment ]
From man shred:
"... Note that shred relies on a very important assumption: that the file
system overwrites data in place. This is the traditional way to do things, but
many modern file system designs do not satisfy this assumption.
[snip ]...
In the case of ext3 file systems, the above disclaimer applies (and shred is thus
of limited effectiveness) only in data=journal mode, which journals file data in
addition to just metadata. In both the data=ordered (default) and data=writeback
modes, shred works as usual. "
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Thats rather wide of the mark: most filesystem are supposed to recover A CONSISTENT STATE of the METADATA ONLY from such events.
'ext3' additionally can make an attempt at recovering the contents of files too, if ordered or data journaling is enabled.
However the proper way to ensure data (as opposed to metadata) recoverability is to ensure the application handles that, using atomic data transactions, because that's the only way, and even if 'ext3' often succeeds blindly, that is not the right way.
Large scale filesystems like JFS and XFS, designed for mission critical applications, don't do any attempt at data recovery, because indeed that should be handled by the applications themselves.
Many people who don't understand this then complain that then these two filesystems cause loss of data...
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
In all the cases you mentioned, your chances would indeed be best if you indeed had your data on an Ext3 filesystem, since it's the only one not using some binary tree structure to manage where your data is stored.
The problem of today's hardware is all the caching that's going on at various levels. The application can't really tell whether a certain file has really been written to a block on the harddisk, because all of that is completely hidden away in some HAL.
I think that Sun's approach with it's ZFS filesystem is suitable to tackle this challenge. It uses end-to-end checksums to detect file corruption at the harddisk right through to the application level. Too bad it isn't GPL'd, so we'll hardly see much of it in the Linux world.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
BTW, you're article isawesome
Good Work :-)
[ Parent | Reply to this comment ]
«The sequence of 11 tasks (from creation of FS to umounting FS) was run as a Bash script which was completed three times (the average is reported).»
If what is written is true the tests have been run without 'umount'/'mount' before each of them, so these are mostly buffer cache tests, not filesystems tests.
How can then the article be awesome?
Also, the reason why JFS etc. leave more free space after formatting than 'ext3' is obvious, but is not explained (dynamic inode allocation), and so is why some like JFS have a little more used space after all files are deleted (the table containing the dynamically allocated inodes can only grow but not shrink).
[ Parent | Reply to this comment ]
It's the clueless leading the clueless.
[ Parent | Reply to this comment ]
private mail and rtfm'ed him, including at least terse pointers
on where to read what to improve.
Hans did an effort for this one, and it's a nice -improvable- overview.
And it looks like he might improve it in a followup.
Now what did you do? No data/facts/suggestions but only a personal
insult. Nice indeed, and how full of merit.
Hans, ignore that troll.
Better yet "inappropriate comments will be removed" was
written somewhere, wasn't it :)
[ Parent | Reply to this comment ]
The substantive point is that people should be much more cautious and skeptical about this report than many are being.
[ Parent | Reply to this comment ]
I would have liked to see Reiser4 on the list though (yes, I know it's not fully supported yet and is quite contraversial, but it is the latest generation of Reiser tech).
I went from ReiserFS to ext3 for support reasons a number of years ago. At work my box was installed for me with JFS and seems fairly nippy.
[ Parent | Reply to this comment ]
- did you umount the fs between tests or purge the kernel file cache by other means?
- did you use the same partition for each fs? (difference between outer and inner sector)?
- did you recreate the fs between tests (fragmentation - actually testing a fragmented fs would reflect the reality better, but is ver yhard to reproduce/do equally for different fs's)?
- what were the mkfs and mount options (block size, root reserved space, reiser notail, extended attributes etc etc)
- what were the test commands? Did you count the sync (you did do that, right?) after the command in the elapsed time?
and so on...
[ Parent | Reply to this comment ]
Indeed, notail matters a great deal. Also, mounting all filesystems with noatime should strongly be considered; access times are virtually useless information that is quite expensive to maintain. These, as well as numerous other FS configuration parameters, are commonly used by experienced administrators, and the comparison is meaningless without taking them into account. The author of this comparison means well but is apparently quite lacking in requisite experience and expertise.
[ Parent | Reply to this comment ]
My experience is that reiser's performance seriously degrades over time on partitions that are changing frequently (i.e. /var). apt and dpkg-operations really fly when /var/lib/dpkg is on a fresh reiser partition but crawl after a couple of weeks following debian/unstable.
I guess that is due to the repacker that is necessary for reiser not being available in the distributions. IIRC that important tool is only available to customers of Mr. Reiser. I stopped using reiserfs due to this.
[ Parent | Reply to this comment ]
All the filesystems, some more some less, have degrading performance on filesystems with high rewrites. This is more or less inevitable, in major part because very few speed tests are about this aspect, and why waste time optimizing an issue that is not that obvious?
However, the best way by far to fix the issue is not to use a defragmenter, even a background one, like the ReiserFS repacker.
Defragmenters are both dangerous and slow, because they do same-disk copies and in-place modification.
Also, in any case one should backup before defragmenting.
Now, the best way to defragment, is to do a disk-to-disk image backup followed by a re-format of the original partition and a disk to-disk tree restore, for example (where '/dev/hda' is the active drive and '/dev/hdc' the backup drive):
umount /dev/hda6
dd bs=4k if=/dev/hda6 of=/dev/hdc6
jfs_fsck /dev/hdc6
jfs_mkfs /dev/hda6
mount /dev/hda6 /mnt/hda6
mount /dev/hdc6 /mnt/hdc6
(cd /mnt/hdc6 && tar -cS -b8 --one -f - .) \
(cd /mnt/hda6 && tar -xS -b8 -p -f -)
umount /dev/hdc6
This is just a simplified example of the steps... it can be used with the 'root' filesystem too with some modifications (easiest though if done from a liveCD).
Doing this copy has these important advantages:
* A backup is done just before the filesystem is optimized, as part of the process itself.
* Both the backup and the restore are disk-to-disk copies, which is a lot faster than same-disk copying.
* One of the copies is a very fast image copy, and the other is a sequential read and a sequential write, which are about as fast as a logical copy can go.
The risk and slowness of in-place, same-disk defragmentation might have been acceptable when backup was economical only to slow tape; but currently backup to disk is the best value, and one should take advantage of that.
[ Parent | Reply to this comment ]
That's dumb; the "backup" is immediately followed by deleting all the data on the original, so it's not a backup at all, it's just a pointless relocation. It would make more sense to mkfs the second disk, tree copy the first disk to the second disk, and then unmount the first disk and mount the second disk on the mount point (you of course use partition labels rather than absolute device names), which leaves the first disk as a backup that can be copied to tape or other backup media without affecting performance of the live filesystem. Of course, this all assumes that the filesystem can be unmounted in the first place, which often isn't possible -- making background defragmentation the best choice.
[ Parent | Reply to this comment ]
That's dumb; the "backup" is immediately followed by deleting all the data on the original, so it's not a backup at all, it's just a pointless relocation.»
Nahhh, there is an essential detail here: in-place defragmentation is done on the filesystem itself, and there is no backup. If the in-place fails, goodbye data.
Instead by making a copy to another spindle and copying back there is always a valid copy.
«It would make more sense to mkfs the second disk, tree copy the first disk to the second disk, and then unmount the first disk and mount the second disk on the mount point»
Well, in theory one can first image copy and then tree copy back, or viceversa as you suggest.
I would rather first do the image copy, because if the filesystem is damaged it is important to have an exact image copy, including the ''free'' bits from which to attempt recovery. Doing first a tree copy has its advantages, but only copies the ''reachable from root'' subset of the filesystem, so it is not as full a backup as an image copy.
«the filesystem can be unmounted in the first place, which often isn't possible -- making background defragmentation the best choice.»
Well, no, if the file system cannot be unmounted than we have a VLDB-style filesystem:
http://WWW.sabi.co.UK/Notes/anno05-4th.html#051009
and for them in-place live restructuring is even more dangerous. It is not clear to me how best to handle 24x7 filesystems, but I suspect tree-based mirroring is the least bad option.
[ Parent | Reply to this comment ]
Nahhh, there is an essential detail here: in-place defragmentation is done on the filesystem itself, and there is no backup. If the in-place fails, goodbye data.
You're babbling; I didn't say anything about in-place defragmentation in the statement you quoted. And when there is in-place defragmentation, who says there's no backup? One always does a backup before defragmenting. Sheesh.
Instead by making a copy to another spindle and copying back there is always a valid copy.
In your scenario, you copied one disk to another and then immediately did a mkfs on one of the first disk. That makes the copy STUPID, compared to simply mkfsing the second disk; in both cases you have one disk containing the original FS and the other containing an empty FS. Sheesh.
[ Parent | Reply to this comment ]
Now, the best way to defragment, is to do a disk-to-disk image backup followed by a re-format of the original partition and a disk to-disk tree restoreI disagree. A good deframenter would defragment based on usage statistics. Frequently used files would be placed where the disk is faster (outer tracks) while unused files would be placed whis it is slow. Files would be grouped based on use scheme. For example, files used at boot time would be placed together.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I am well aware that copying the data (better all the files containing the data) will get me back to optimal performance. Maybe you can afford the downtime necessary to copy data back and forth on the server, I absolutly need some runtime mechanism for that.
[ Parent | Reply to this comment ]
Good stuff. Oh, a table summarizing the data would be cool.
[ Parent | Reply to this comment ]
I've tried XFS, and I honestly think it's pretty slow. I used it on a similar machine as mine, and un-tarring the Linux source tree took much longer on the XFS FS than on my ReiserFS (v3.6) partition. Also, XFS caches a lot of data in RAM, so there's high risk of data loss upon improper shutdowns.
I must note: ReiserFS was designed to take advantage of the CPU to make a faster filesystem. The creator of ReiserFS noticed that most other filesystems barely used any CPU at all when doing I/O operations. Therefore, maybe if you tested all of the filesystems on a faster CPU, the results would be different.
Great article, though. :)
[ Parent | Reply to this comment ]
This seems a reasonable conclusion from several tests I have seen: XFS has relatively high overheads but scales better than others, so it is less suitable for small scale filesystems.
In particular it has high CPU overheads. Now I have also noticed that the whole Linux IO subsystem hash pretty high CPU overheads, that make it rather CPU bounds except on the latest and greatest (Athlon 64 3000+ and similar) CPUs. If one has a slower CPU and adds a relatively high CPU file system like XFS (or ReiserFS) that is not going to be that exciting...
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Most file systems, not just 'ext3', perform pretty badly when they are near capacity, and in particular extents based file systems, which derive a fair bit of their advantage from having a lot less metadata, which happens only if there are long extents.
If the anti-fragmentation effects of a 5% reserve are dubious, it is only because 5% is way too low. It should be at least 10%, and ideally 20-30%.
[ Parent | Reply to this comment ]
A large reserve is the stupidest possible way to address the fragmentation problem, and is no substitute for proper defragmentation techniques.
[ Parent | Reply to this comment ]
Here is a simple scenario for CVS:
- multiple cvs commit (in particular look at files and directories fragmentation on server)
- cvs checkout
- cvs -n update
- cvs update (in particular look at files and directories fragmentation on client)
- cvs tag TAG
- cvs co ; cvs update -j TAG
- cvs diff
- cvs diff -j TAG1 -j TAG2
To avoid network problems, server and client environments would be on the same machine, but on different disks.
Comparing different SCS would be important as each one has different filesystem usage patterns, so filesystems may be more adapted to one SCS than the others.
Olivier Mengué
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Hi everybody,
Wow! I *really* do not expect this first contribution to generate so much comments and interesting discussions. Some sound a bit "hard", since this review has a limited and modest objective (to complete some data published by Piczsz with tasks I felt were unclear or missing) but I understand the points made by these authors. I apologize for missing or unclear information.
I'm currently working to rerun all my testing while taking into account many suggestions :
- Be explicit about FS creation and mounting options
- Umount and remount partition between each test
- Report initial, residual as well as full disk capacity
- Compute time and CPU usage for fsck each FS
I'm also investigating the best way to test :
- CVS operations (thanks for this excellent suggestion)
- concurrent tasks (another excellent one!)
within the scope of small-business file server operations.
Some discussions were initiated about how to test data integrity after unexpected system shutdowns. I feel it will be a very interesting metric to benchmark, since small-business and home servers may be less likely to have power failures protection (UPS, etc.).
QUESTION TO EXPERIENCED CONTRIBUTORS :
Since it's my first contribution to debian-administration.org, I've restricted myself to html tags suggested in the "Submit an article" section. However, I agree with previous comments that it will be more interesting to publish graphics of results. How can I do that here (other than upload graphs on a personnal website and link them here)?
Thanks everybody!
[ Parent | Reply to this comment ]
http://www.sabi.co.uk/Notes/anno05-3rd.html#050908
and later entries.
I suspect that it would be far more interesting to see tests done with large partitions/filesystem sizes than I had the patience to do myself (my tests involved 4GB/8GB). But then I do have a bit of a lull now, and a couple of partitions with 40-70GB in them, so I could do that myself. The main problem is that takes time and it is very tedious... :-)
Even better it would be to find a largish filesystem that has been churned a fair bit and compare the times between its churned state and its freshly reloaded state.
Unless you have some host adapter with tagged queueing, doing concurrent accesses is going to be a waste of time, because it will perform awfully. Simple tests as to that were done many years ago, for example:
http://groups.google.com/group/comp.arch/msg/7da6f8635c3e14db
As someone else mentioned, more sophisticated examples have already been published, on properly massive IO subsystems though (those likely to have host adapters supporting tagged queueuing).
[ Parent | Reply to this comment ]
Well, most of the improvements you suggest are already in the tests I did a while ago
Yes, I've already noticed your various posts about your work. Really interesting! Too bad I became aware of it after publishing my initial report. It would have help me to better *bullet-proof* my methodology.. :D
I feel that independant replication of results is as important as good methodology for the advancement of knowledge. I've published my initial results here and, from the beginning, I've invited readers to share comments and suggestions to improve this ongoing work. As in many other scientific fields, it's the accumulation of evidence that helps to conclude about facts, more than waiting for the "definitive" study to come. It's probably my statistician background who speaks here, and I respect that not everybody may share this view.
I suspect that it would be far more interesting to see tests done with large partitions/filesystem sizes than I had the patience to do myself (my tests involved 4GB/8GB).
I'm working actually on 40Gb partition and I'm planning to test on 160Gb partition (transfer sizes and operations will be proportionally increased), to see whether some results scaled up linearly or exponentially.
[ Parent | Reply to this comment ]
Oh well, I was too curious and I have a high tolerance for boredom, so I have redone my informal tests on a 65GiB filesystem:
http://WWW.sabi.co.UK/Notes/anno06-2nd.html#060424
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
Either mail me the images to go along with the article - or host them somewhere yourself and I'll copy them over.
I'm happy to include images in pieces where they are useful and this type of article would benefit from them I agree!
(I'd much prefer to host images here since then I don't have to worry about them disappearing. However if you feel strongly that this is not a good idea I would allow you to host them.)
Thanks for the article, I too was suprised how many readers and commentors appreciated it!
[ Parent | Reply to this comment ]
Maybe recorded as the total time elapsed. Many people focus on the CPU time used, but I think the real figure to focus on is the total elapsed real time, as this is what the user experiences. The FS may differ here in their efficiency given their intelligence in grouping IO requests together, and eg having inodes and blocks located closely.
I would like to see some consideration given to common jobs that take a long time:
searces in a directory tree, like find, or creation of tar file.
Boot time (for the same OS)
loading of huge applications, like openoffice.org
concurrent reading of files, like a file server or a mirror server.
database handling, like a backup of a database.
Also reported disk usage of the partitions for the different FS for your system would be interesting. The /root system particuarily, as this is likely to be common for most systems.
A survey of recovery utils could be nice. Which FS allows to recover one or more deleted files?
[ Parent | Reply to this comment ]
The second place takes XFS. It is more stable then ReiserFS and has a about the same recovery chances (like ReiserFS), however deleted files are recovered without real size (recovered files size is greater and is rounded to FS block size).
Ext3 has almost worst recovery chances because it wipes inodes immediately. File deletion on Ext3 works through journal and file could be recovered until 'deletion update' record is pushed off journal. The number of recoverable deleted files depends of journal size and journal use.
I didn't use JFS on Linux, just because it is not intended for (actually it is my opinion for file system which design is not fully compatible to Linux VFS).
Data recovery techniques and software:
There are many techniques described on Linux forums using "**fs_repair", "grep" and other similar basic tools, however read/write access to block device with lost information could cause permanent data loss. That's why it could be recommended for non-critical data only.
Any valuable information should be recovered in read-only mode.
There are plenty tools from different vendors that claim support of lost/deleted files recovery for all these file systems.
Here we using Linux on production, so deleted/lost files recovery was eventually critical...
We did some tests and most accurate recovery was with commercial UFS Explorer Standard Recovery (version 4.x) from SysDev Laboratories (http://www.ufsexplorer.com/download_stdr.php).
The tests included fragmented files deletion, file system damage (with partial wipe, metadata damage) etc.
The current version of this tool misses only most critical thing - ability to run natively on Linux (currently it is Windows tool).
If anyone did more researches in this subject I'd be glad to see other opinions.
[ Parent | Reply to this comment ]
Maybe reboot too.
I dunno, but I've been told, that suse's distro specific patches make reiser better than other distros. Maybe that's why the suse default fs is reiser. It's the only distro where I use reiser. :-)
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Reiser's benchmarks are here: http://www.namesys.com/benchmarks.html
[ Parent | Reply to this comment ]
To keep wear on Compact Flash/USB Memory Stick/MMC-Card/SD-Card low it would be relevant to know how many blocks were actually changed on the device for writing some smaller and some larger files.
Is this information available somewhere?
[ Parent | Reply to this comment ]
- MySQL or Postgres database add/update (typical business application)
- OpenLDAP or Fedora Directory database add/update (typical business application)
- Streaming of multiple files (typical for multimedia servers)
- Reading/writing to (pseudo-)random locations within a file (many tests are inherently sequential or indexed sequential, this guarantees you have a baseline for random access)
It would be good if the CPU load and kernel memory consumption was also tracked (so there was an indication of FS overhead per unit of performance), especially if the tests were run for a normal setup and on two configurations that were deliberately reduced, so that it was possible to extrapolate how the filing system would perform under any other configuration (assuming FS performance follows a simple curve).
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Also, it would be nice to differentiate between metadata and full-data journalling.
I tend to use ext3 unless performance is critical, then for many things I use ext2. In special cases, I've used JFS and others, but I tend to be very cautious about that these days. I have lost critical and expensive data when a drive with a JFS filesystem went bad. Despite the personal and financial impact of the loss, I could not afford the $30k that was the best data recovery quote I received. The damage was stuff that fsck could have fixed on ext2 or ext3... I would have lost some files to corruption or total loss, but most of the data would have been recovered. I've helped a friend (resign himself to loss) after a similar problem with XFS.
One of the advantages to ext2/ext3 is the very redundancy that tends to slow it down...
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
What that meant in practice was that the server would be up for a year or so, and I'd need to do a quick reboot. Time had expired on the fsck and as filesystems have grown, the check became around 30 minutes (too long for quick).
In every way that matters to me, I have found XFS to be superior in speed and flexibility to ext3.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
Your article is a benchmark article, but it would be nice to acknowledge that there are other things to consider besides performance when selecting a filesystem. For me, RAS (Reliability, Availability, Serviceability) is very important. Some aspects of this:
1) How robust is the filesystem? How does it react to power failures? Hardware failures?
2) How quickly can it be recovered from a failure? How well does it recover data from a broken filesystem? Broken hardware?
3) What utilities or tools are available for the filesystem? For example, does it have a dump/restore facility?
It is the above that lead me to choose ext3 for my use.
RCP
[ Parent | Reply to this comment ]
This category of users shall choose ext3 and ONLY ext3. They shall remove all other filesystems from the kernel ( except ext2 of course) and re-compile and install that kernel.
[ Parent | Reply to this comment ]
I would like to say few words about Ext3. First of all, if you use default settings s when formatting it, it will be potentially slower. Ext3 needs to be tuned to show it's full preformance.
Now, some guy earlier said he switched his servers to XFS because of a ext3 fs check. He obviously didn't know that fscheck can be turned off, or postponed.
tune2fs -c [max_mount_counts]
Adjust the maximal mounts count between two filesystem checks. If max-mount-counts is 0 or -1, the number of times the filesys- tem is mounted will be disregarded by e2fsck(8) and the kernel.
tune2fs -i [interval-between-checks(d|m|w) ]
Adjust the maximal time between two filesystem checks. No post- fix or d result in days, m in months, and w in weeks. A value of zero will disable the time-dependent checking.
Now, that's settled, back to preformance. Ext3 supports few journal modes. You can read more about them in the manual pages. But, if you want maximum preformance for the workstation, best mode is writeback:
journal_data_writeback
When the filesystem is mounted with journalling enabled, data may be written into the main filesys- tem after its metadata has been committed to the journal. This may increase throughput, however, it may allow old data to appear in files after a crash and journal recovery. Next, if you test ext3 preformance, you MUST, and I mean MUST, turn on dir_index!!! dir_index Use hashed b-trees to speed up lookups in large directories. Now, ext3 also supports few nice mount switches, like commit=[time], altough it won't affcet preformance too much. Yes, default ext3 is painfully slow, but journal_data_writeback and dir_indexes give it a steroid pump, and that in preforms far better, so please, in the next edition, tune ext3! You'll be surprised... Why I like ext3 so much? Well, first of all, it's the oldest FS on Linux and has great support, recovery, stability... and best of all, it's really really tuneable. Also, you can mount it as ext2 (Without journal) if you want! Now, don't get panic when I say this, but I run my workstation on Gentoo. And, because new versions of programs get into portage almost every day, I often upgrade programs, which leads to deletion of old files and creation of new ones - lot's of I/O. I was running ReiserFS for few months, and I've noticed pretty bad fragmentation (speed degradation over time). When I forced defragmentation (copied whole parition to another one, formatted, then copied data back), Reiser was preforming unbelievably! Bad thing is that it degrades quite quick, and no test can take this into accounting. Then, I've switched to JFS, and that's an excellent FS for a workstation, quickest as far as I can see. Not noticeably slower that Reiser, but it lacks Reiser's degradation "feature". And about XFS, I don't know about you, but it smashed head of my hard drive all around, and I don't like it... It's like drive is little noiser with it. Now, another thing. Why ext2 (with dir_index) wasnt tested too? If you tend fragment your / into many paritions (/var /tmp /var/tmp /usr/src ....), than ext2 is worth of checking. For example, if you have /usr/src, there's little point of having journal for that partition. Journal gives qiute an overhead, so I would suggest not to use it if you don't need to. All these recoverable partitions can go along with ext2. So, my little conclusion: If you want speed and medium security - 1. JFS, 2. ext3 tuned for speed If you want maximum security for your data - ext3 tuned for data security If you need pure speed and data is easily replacable (/usr/src) - ext2 tuned Jakov Sosic jsosic@jsosic.homeunix.org
[ Parent | Reply to this comment ]
I would like to say few words about Ext3. First of all, if you use default settings s when formatting it, it will be potentially slower. Ext3 needs to be tuned to show it's full preformance.
Now, some guy earlier said he switched his servers to XFS because of a ext3 fs check. He obviously didn't know that fscheck can be turned off, or postponed.
tune2fs -c [max_mount_counts]
Adjust the maximal mounts count between two filesystem checks. If max-mount-counts is 0 or -1, the number of times the filesys- tem is mounted will be disregarded by e2fsck(8) and the kernel.
tune2fs -i [interval-between-checks(d|m|w) ]
Adjust the maximal time between two filesystem checks. No post- fix or d result in days, m in months, and w in weeks. A value of zero will disable the time-dependent checking.
Now, that's settled, back to preformance. Ext3 supports few journal modes. You can read more about them in the manual pages. But, if you want maximum preformance for the workstation, best mode is writeback:
journal_data_writeback
When the filesystem is mounted with journalling enabled, data may be written into the main filesys- tem after its metadata has been committed to the journal. This may increase throughput, however, it may allow old data to appear in files after a crash and journal recovery.
Next, if you test ext3 preformance, you MUST, and I mean MUST, turn on dir_index!!!
dir_index
Use hashed b-trees to speed up lookups in large directories.
Now, ext3 also supports few nice mount switches, like commit=[time], altough it won't affcet preformance too much. Yes, default ext3 is painfully slow, but journal_data_writeback and dir_indexes give it a steroid pump, and that in preforms far better, so please, in the next edition, tune ext3! You'll be surprised...
Why I like ext3 so much? Well, first of all, it's the oldest FS on Linux and has great support, recovery, stability... and best of all, it's really really tuneable. Also, you can mount it as ext2 (Without journal) if you want!
Now, don't get panic when I say this, but I run my workstation on Gentoo. And, because new versions of programs get into portage almost every day, I often upgrade programs, which leads to deletion of old files and creation of new ones - lot's of I/O. I was running ReiserFS for few months, and I've noticed pretty bad fragmentation (speed degradation over time). When I forced defragmentation (copied whole parition to another one, formatted, then copied data back), Reiser was preforming unbelievably! Bad thing is that it degrades quite quick, and no test can take this into accounting. Then, I've switched to JFS, and that's an excellent FS for a workstation, quickest as far as I can see. Not noticeably slower that Reiser, but it lacks Reiser's degradation "feature". And about XFS, I don't know about you, but it smashed head of my hard drive all around, and I don't like it... It's like drive is little noiser with it.
Now, another thing. Why ext2 (with dir_index) wasnt tested too? If you tend fragment your / into many paritions (/var /tmp /var/tmp /usr/src ....), than ext2 is worth of checking. For example, if you have /usr/src, there's little point of having journal for that partition. Journal gives qiute an overhead, so I would suggest not to use it if you don't need to. All these recoverable partitions can go along with ext2.
So, my little conclusion:
If you want speed and medium security - 1. JFS, 2. ext3 tuned for speed
If you want maximum security for your data - ext3 tuned for data security
If you need pure speed and data is easily replacable (/usr/src) - ext2 tuned
Jakov Sosic jsosic@jsosic.homeunix.org
[ Parent | Reply to this comment ]
I would like to say few words about Ext3. First of all, if you use default settings s when formatting it, it will be potentially slower. Ext3 needs to be tuned to show it's full preformance.
Now, some guy earlier said he switched his servers to XFS because of a ext3 fs check. He obviously didn't know that fscheck can be turned off, or postponed.
tune2fs -c [max_mount_counts]
Adjust the maximal mounts count between two filesystem checks. If max-mount-counts is 0 or -1, the number of times the filesys- tem is mounted will be disregarded by e2fsck(8) and the kernel.
tune2fs -i [interval-between-checks(d|m|w) ]
Adjust the maximal time between two filesystem checks. No post- fix or d result in days, m in months, and w in weeks. A value of zero will disable the time-dependent checking.
Now, that's settled, back to preformance. Ext3 supports few journal modes. You can read more about them in the manual pages. But, if you want maximum preformance for the workstation, best mode is writeback:
journal_data_writeback
When the filesystem is mounted with journalling enabled, data may be written into the main filesys- tem after its metadata has been committed to the journal. This may increase throughput, however, it may allow old data to appear in files after a crash and journal recovery.
Next, if you test ext3 preformance, you MUST, and I mean MUST, turn on dir_index!!!
dir_index
Use hashed b-trees to speed up lookups in large directories.
Now, ext3 also supports few nice mount switches, like commit=[time], altough it won't affcet preformance too much. Yes, default ext3 is painfully slow, but journal_data_writeback and dir_indexes give it a steroid pump, and that in preforms far better, so please, in the next edition, tune ext3! You'll be surprised...
Why I like ext3 so much? Well, first of all, it's the oldest FS on Linux and has great support, recovery, stability... and best of all, it's really really tuneable. Also, you can mount it as ext2 (Without journal) if you want!
Now, don't get panic when I say this, but I run my workstation on Gentoo. And, because new versions of programs get into portage almost every day, I often upgrade programs, which leads to deletion of old files and creation of new ones - lot's of I/O. I was running ReiserFS for few months, and I've noticed pretty bad fragmentation (speed degradation over time). When I forced defragmentation (copied whole parition to another one, formatted, then copied data back), Reiser was preforming unbelievably! Bad thing is that it degrades quite quick, and no test can take this into accounting. Then, I've switched to JFS, and that's an excellent FS for a workstation, quickest as far as I can see. Not noticeably slower that Reiser, but it lacks Reiser's degradation "feature". And about XFS, I don't know about you, but it smashed head of my hard drive all around, and I don't like it... It's like drive is little noiser with it.
Now, another thing. Why ext2 (with dir_index) wasnt tested too? If you tend fragment your / into many paritions (/var /tmp /var/tmp /usr/src ....), than ext2 is worth of checking. For example, if you have /usr/src, there's little point of having journal for that partition. Journal gives qiute an overhead, so I would suggest not to use it if you don't need to. All these recoverable partitions can go along with ext2.
So, my little conclusion:
1. If you want speed and medium security - 1. JFS, 2. ext3 tuned for speed
2. If you want maximum security for your data - ext3 with journal_data
3. If you need pure speed and data is easily replacable (for example /usr/src) - ext2 with dir_index
PS. Sorry for previous posts, I'm new to this site, and I tried to edit them, but it seems that can't be done... Jakov Sosic jsosic@jsosic.homeunix.org
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I'd love to see a review of the filesystems on RAID (hardware and software) some time in the future!
[ Parent | Reply to this comment ]
What I'm missing is a comparison of various tunings for the FSes. For example (since I use ext3 all these are ext3 features) for ext3 changing the journal mode has a huge effect on speed at the cost of security. The dir_index flag is also a big plus for directories with many files (and now default).
The amount of reserved space can be tuned on ext3 (and I always set it to 0 on non system disks). What effect does that have on speed? How do filesystems behave when they are getting full? How about filling the disk up to X% with files of random size and then creating randomly sized files and deleting random files keeping the FS always at X%.
How does the speed varry with X approaching 100? How does the speed varry after 0, 1000, 1000000 files have been created/removed?
The amount of inodes can be tuned too in ext3. The default of one inode per block is quite insane for your mp3 or movie collection with avearge filesize of 1Mb or 100Mb. For a mail or news spool on the other hand it is vital. Setting -T largefile or -T largefile4 on mke2fs has a huge impact on creation speed and frees up several GiB on large partitions too. Does it affect other things too?
MfG
Goswin
[ Parent | Reply to this comment ]
Could it be possible that there is some sort of optimization for XFS usage, both hardware based, or maybe XFS could be made to use certain optimizations if some chip is found.
This would definitely make a point into buying such hardware to achieve even better results, right?
[ Parent | Reply to this comment ]
Anyway you can read linux/fs/xfs to get the insight.
// Artem S. Tashkinov
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I wondered if you could give me your permission to translate this article into Spanish in order to share it with my community (ubuntu-es).
[ Parent | Reply to this comment ]
Voici un autre compte rendu sur les systemes de fichiers ReiserFS, Ext3, Jfs et Xfs :
http://arnofear.free.fr/linux/files/contrib/bench/bench.html
[ Parent | Reply to this comment ]
I've had all sorts of issues with ReiserFS (SLES9.x) in medium loaded applications ... a NFS mounted mail store for 10,000+ Maildir accounts. Recovery from failures also left a lot to be desired compared to EXT3 and XFS. Despite good tools, I always experienced data corruption using ReiserFS.
What most reader's don't understand is that there is no perfect filesystem for everyone ... although EXT3 comes damn close and should suit most everyone's general use.
Personally, I'm phasing out my SLES installs and running CentOS4/5 servers. All my SAMBA/NFS filesystems get XFS and everything else gets EXT3. When I moved my mail store from ReiserFS to XFS, all my IMAP users noticed a dramatic increase in performance. The server didn't change: Dual Opteron 64bit, 4Gb RAM, 1Tb RAID10 storage array.
[ Parent | Reply to this comment ]
If the source was ext3 it had an unfair advantage over reiserfs&xfs.
Did you enable barriers for ext3?
If not, reiserfs&xfs had an unfair disadvantage, since extx3 disables barriers by default, but xfs and reiserfs enable it by default. If enabled there should be (if you believe the kernel people) a 30% penalty for ext3.
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
[ Parent | Reply to this comment ]
I use ext3 because most tools are written for it, and everything Linux supports it.
[ Parent | Reply to this comment ]