Sparse files are a great feature of Linux filesystems. They become very handy when working with virtualization technologies like KVM. You don’t need to think long on how big you make a VM disk, just create a disk which is definitely big enough (I’m using 20GB normally for my linux based servers). If only 1GB is used the file uses only this amount of physical disk space and not the whole 20GB.
QEmu creates sparse files already by default when using raw images.
Example: qemu-img create myserver.img 20G
When adding the “s” option to the ls
command you see the real used size in the first column.
ls -lhs realsize virtualsize 0 -rw-r--r-- 1 gergap gergap 20.0G Aug 10 11:27 myserver.img
However these sparse files are a problem when copying them, especially when you need to move a disk image to another machine over network.
Local copies: When copying files locally with tools that are not aware of sparse files the whole 20GB will be copied. It may sound strange, but that’s the desired behavior. A sparse file with 20GB should look like a normal file to applications, so they see to complete 20GB, even though the most data is just zeros.
Luckily the “cp
” command is aware of sparse files and will autodetect if a source is a sparse file. Then also the copy will become a sparse file and only the real data gets copied which is much faster. If the source is not sparse you can use “cp --sparse=always source dest
“, then the destination will become a sparse file.
Now lets come to network transfer. Most admins are using rsync, which can copy a lot of files very quickly over SSH. rsync is very efficient in detecting what files have changed and only transmits the files that have been changed. So it’s easy to keep e.g. an FTP mirror in sync with its source or to implement backup strategies.
KVM images are different. You don’t have many files, but the files you have are huge sparse files. You don’t want to transmit 20GB over network if only a few MB have changed in the disk image. Even transmitting 1GB of actually used data takes quite a long time.
The solution is to use the “--inplace
” option of rsync. This option only transmits the changed blocks of a file, not the whole file. The problem with “--inplace
” is that is does not create sparse files.
But rsync can handle sparse files when passing the “--sparse
” option. Unfortunately “--sparse
” and “--inplace
” cannot be used together.
Solution: When copying the file the first time, which means it does not exist on the target server use “rsync --sparse
“. This will create a sparse file on the target server and copies only the used data of the sparse file.
When the file already exists on the target server and you only want to update it use “rsync --inplace
“. This will only transmit the changed blocks and can also append to the existing sparse file.
I hope rsync will become more smart in the future and allows the combination of “--inplace --sparse
” or can even autodetect the best strategy. But for now we have at least a working solution.
I hope this blog was helpful for understanding sparse files and rsync.
the dash dash option are not well formatted, take care ! this is
--
inplace or--
sparseThx, I fixed the formatting.
Thank you! This information is also consistent with information on serverfault, but better explained here.
http://serverfault.com/questions/66338/how-do-you-synchronise-huge-sparse-files-vm-disk-images-between-machines
Thanks! Thought something was wrong when I ran out of disk space moving images to a *larger* box.
I’m not sure why, but in my setup I tried to transfer lxc-based sparse files via NFS between two storages.
Step 1: VM is still running. Do a first transfer, using –sparse
Step 2: Verifying. The target file came out as a sparse file as intended!
Step 3: VM stopped. Do a sync of any changes that might have occurred in the meantime. This is done, as you suggested, using –inplace
Step 4: Verifying. The target file now suddenly is not a sparse file any more and takes up 100% of the size.
Any other people around here that deal with the same issue?
(Setup: Linux 4.4.13, rsync 3.1.1)
* /)/)
Hi Daniel,
thx for that hint.
I haven’t checked this yet, but this is possible.
The purpose of my explanation was not to save disk space, but speed up copying.
The “–sparse” options avoids copying zero over the network in the first fresh copy.
The “–inplace” just transmits changed blocks.
If “–inplace” makes the file non-sparse I didn’t recognize this so far, but I also didn’t care.
I normally have enough disk space and the aim was just speeding up things not saving disk space.
You can always make the file a sparse file again locally by using “cp –sparse=always “.
I’ll check this ASAP to see if I can reproduce your issue.
regards,
Gerhard
Follow-up note for you: “cp” on HP-UX does not handle sparse files and doesn’t have any option to do so
And, unfortunately, rsync still can’t support both –inplace and –sparse either, which causes no end of problems when ISAM database files (full of holes) get rebuilt or ‘stretched’ (to add more logical size/space). rsync “see’s” the rebuilt file and –inplace then picks it up, copies over and fills-in all the holes (we have regular –inplace rsync setup to another HP-UX server and I then have to delete the filled-in file from target server and manually rsync again using –sparse before the next –implace job starts).
I am so sorry my article is in Danish, however i made some quick comparisons between the different ways of transferring kvm-images across the network … and nfs + cp won big time!
A file which scp og rsync transferred in more than 10 minutes took only 1:30 through nfs
http://www.specialhosting.dk/kvm-overforsel-af-images-sparse-files-mellem-servere/ (i am sure Google Translate will make it readable)
Hi. This is because you ar using rsync wirh ssh which encrypts all the traffic. With the rsync protocol this is much faster. Have a look at my other post: https://gergap.wordpress.com/2013/08/13/optimizing-speed-in-kvm-image-synchronization-using-rsync/