Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
06-21-2025, 12:42 AM
|
#1
|
|
Member
Registered: Oct 2022
Posts: 132
Rep:
|
How to split a gzip into valid smaller gzip files?
[ Log in to get rid of this advertisement]
How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?
Something like split -b 4095M example.img.gz would not work given that it would cut through the internal structure of the gzip file. My goal is for the gzip chunks to be re-asssemble-able from multiple devices, so each gzip part file has to be valid on its own.
The gzip file itself should be of a predetermined size, not the content it expands to.
For re-assembling, the following is useless because it requires all the files to be available at the same time, making it useless for splitting across devices:
Code:
cat example.img.gz.part* |gzip -d -c >> exaple.img
What I want is this:
Code:
gzip -d -c example.img.part*.gz >> example.img
I want each part of the gzip to be a valid gzip file of its own, so they can be assembled from multiple devices.
The pre-determined size doesn't have to be exact, but within a few megabytes of a desired value. For example, if I want 4096 MiB chunks, something like 4090 MiB is acceptable too.
This is already possible with bzip2 by "abusing" bzip2recover. Normally, bzip2recover was intended for merging undamaged parts of copies of damaged archives, but it can be used to break down any bzip2 into one file per bzip2 block. This way, multiple blocks can be concatenated into parts of any desired approximate size.
The reason I prefer gzip for this purpose is its much faster speed.
Last edited by exerceo; 06-21-2025 at 02:26 PM.
Reason: edited title for clarity
|
|
|
|
06-21-2025, 03:38 AM
|
#2
|
|
Senior Member
Registered: Jul 2020
Posts: 1,610
|
Can't be done. You'll have to use archiver with multi-volume support - 7z or rar, and if speed is so important - reduce compression quality. Or you may split a single gzip archive using split and combine it back into a named pipe attached at the other end to the decompressing process, this way it will wait for parts. Technique to prevent pipe from closing after each file is described e.g. here https://superuser.com/questions/7664...to-named-pipes
Last edited by lvm_; 06-21-2025 at 03:50 AM.
|
|
|
|
06-21-2025, 04:13 AM
|
#3
|
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,429
|
I was thinking similar. Maybe lzip as well - I like the idea of lziprecover and ddrescue but haven't pursued it as I should have.
|
|
|
|
06-21-2025, 04:31 AM
|
#4
|
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578
|
Quote:
Originally Posted by exerceo
How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?
The gzip file itself should be of a predetermined size, not the content it expands to.
|
That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.
Even in case of 7zip you need to have all the parts available to be able to unpack: https://askubuntu.com/questions/1342...t-7zip-archive.
By the way we have a gzip recover too, but that still won't help on it.
|
|
|
|
06-21-2025, 11:21 AM
|
#6
|
|
Member
Registered: Oct 2022
Posts: 132
Original Poster
Rep:
|
gzip splitting
Quote:
Originally Posted by lvm_
Can't be done.
|
Quote:
Originally Posted by pan64
That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.
|
While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.
Quote:
Originally Posted by boughtonp
|
I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.
Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.
But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.
I am surprised nothing like this has been implemented in decades. It is clearly doable.
|
|
|
|
06-21-2025, 11:39 AM
|
#7
|
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578
|
Quote:
Originally Posted by exerceo
While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.[/code]
|
No, gzip by itself cannot split input files based on the compressed size. Although that looks achievable, it has not implemented [yet]. Most probably because you need to concatenate those uncompressed parts to get the original file, so those partial compressed archives will not contain anything usable (just together with the other parts).
Quote:
Originally Posted by exerceo
I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.
Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.
But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.
I am surprised nothing like this has been implemented in decades. It is clearly doable.
|
reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.
|
|
|
|
06-21-2025, 12:03 PM
|
#8
|
|
Moderator
Registered: Aug 2002
Posts: 26,881
|
tar can produce multi-volume archives but they can not be compressed.
The zip format supports multi-volume archives of a fixed size. You do not need to assemble the individual files but they need to be all in the same directory to unzip. There used to be a known problem which might still exist if the size of the volumes were a multiple of the buffer size i.e 16KiB they would not unzip correctly. I have not created multi-volume zip files in decades.
|
|
|
|
06-21-2025, 01:05 PM
|
#9
|
|
LQ Guru
Registered: Oct 2004
Distribution: Arch
Posts: 5,476
|
My 2 cents.
Code:
zcat file.gz | split -l 200 - file.part
#or
gunzip –c file.gz | split -l 200 - file.part
#or
gzip -c file | split -b 1024m - file.gz.part
cat file.part* > file.gz
Code:
file /usr/share/man/man1/bash.1.gz
/usr/share/man/man1/bash.1.gz: gzip compressed data, max compression, from Unix, original size modulo 2^32 351525
ls -l /usr/share/man/man1/bash.1.gz
-rw-r--r-- 1 root root 97104 Mar 11 16:53 /usr/share/man/man1/bash.1.gz
Let me try that with a man page.
Code:
zcat /usr/share/man/man1/bash.1.gz | split -l 1000 - file.part
And that gave me file.partaa to file.partal
Code:
file file.partaa
file.partaa: troff or preprocessor input, ASCII text
ls -l file.partaa
-rw-r--r-- 1 me me 30259 Jun 21 12:49 file.partaa
And that is readable:
Code:
man ~/file.partaa
BASH(1) General Commands Manual BASH(1)
NAME
bash - GNU Bourne-Again SHell
SYNOPSIS
bash [options] [command_string | file]
COPYRIGHT
Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
DESCRIPTION
Bash is an sh-compatible command language interpreter that executes commands
read from the standard input or from a file. Bash also incorporates useful
features from the Korn and C shells (ksh and csh).
...
Code:
cat file.part** > test.gz
file test.gz
test.gz: troff or preprocessor input, ASCII text
ls -l test.gz
-rw-r--r-- 1 me me 351525 Jun 21 12:52 test.gz
And that reads all the way through with man.
Code:
gzip -c test.gz > test2.gz
file test2.gz
test2.gz: gzip compressed data, was "test.gz", last modified: Sat Jun 21 17:52:47 2025, from Unix, original size modulo 2^32 351525
ls -l test2.gz
-rw-r--r-- 1 me me 97654 Jun 21 13:02 test2.gz
Code:
man test2.gz
BASH(1) General Commands Manual BASH(1)
NAME
bash - GNU Bourne-Again SHell
SYNOPSIS
bash [options] [command_string | file]
COPYRIGHT
Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
DESCRIPTION
Bash is an sh-compatible command language interpreter that executes commands
read from the standard input or from a file. Bash also incorporates useful
features from the Korn and C shells (ksh and csh).
Bash is intended to be a conformant implementation of the Shell and Utilities
portion of the IEEE POSIX specification (IEEE Standard 1003.1). Bash can be
configured to be POSIX-conformant by default.
...
Last edited by teckk; 06-21-2025 at 01:08 PM.
|
|
|
|
06-21-2025, 01:33 PM
|
#10
|
|
Member
Registered: Oct 2022
Posts: 132
Original Poster
Rep:
|
Quote:
Originally Posted by pan64
reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.
|
The problem is that it takes time to finish, and takes lots of space too.
Running split on a .gz file would slice through the internal structure of the gzip file, so you will not get back to the original data if you gzip -d -c the resulting files.
With current methods, you can not get back to the original uncompressed data in a single command, unless all USB sticks with all the gzip parts are inserted in the computer at the same time. Then you could concatenate (cat) all the parts and then |gzip -d -c >>original_file . But if you don't have enough USB ports, you have to cycle through the USB sticks to get back to the original file.
The only way to do so in a single pass is to have gzip part files that are valid themsleves. And this seems to be a feature gap.
Quote:
Originally Posted by pan64
so those partial compressed archives will not contain anything usable (just together with the other parts).
|
Actually, the structure of the TAR format (no centralized TOC, just a bunch of headers with their file contents) would indeed allow recovering parts of incomplete TAR files. But the reason I want valid gzip files is so I can cycle through USB sticks and get back to the original file in a single pass.
Quote:
Originally Posted by michaelk
You do not need to assemble the individual files but they need to be all in the same directory to unzip.
|
Unfortunately, this makes it impossible to split it across different devices.
Quote:
Originally Posted by teckk
My 2 cents.
|
I am looking to split the compressed data into valid non-truncated gzip files, not the uncompressed data, but thanks anyway for trying.
Last edited by exerceo; 06-21-2025 at 01:37 PM.
Reason: plural
|
|
|
|
06-22-2025, 04:07 AM
|
#11
|
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578
|
Quote:
Originally Posted by exerceo
The problem is that it takes time to finish, and takes lots of space too.
|
Just think about it twice. It will take space anyway. You can only try to modify "when": before or after decompressing. And actually before decompressing it would need less space.
Decompressing a split archive won't require any additional space, just insert the next usb disk and continue the operation (as long as the full file is restored).
Quote:
Originally Posted by exerceo
With current methods, you can not get back to the original uncompressed data in a single command, unless all USB sticks with all the gzip parts are inserted in the computer at the same time. Then you could concatenate (cat) all the parts and then |gzip -d -c >>original_file . But if you don't have enough USB ports, you have to cycle through the USB sticks to get back to the original file.
The only way to do so in a single pass is to have gzip part files that are valid themsleves. And this seems to be a feature gap.
|
no, that is wrong. Decompressor can handle that changing of usb sticks. It has been already implemented and working for decades (yes, decades. first with floppy drives).
Quote:
Originally Posted by exerceo
Actually, the structure of the TAR format (no centralized TOC, just a bunch of headers with their file contents) would indeed allow recovering parts of incomplete TAR files. But the reason I want valid gzip files is so I can cycle through USB sticks and get back to the original file in a single pass.
|
You cannot restore anything from a random part/slice of a tar file, that is very unlikely. And not in one pass. You will need to reconstruct the whole tar archive to be able to use the content.
Again, if the goal is to get back the original file it is completely irrelevant if those parts are compressed before or after that split.
Quote:
Originally Posted by exerceo
I am looking to split the compressed data into valid non-truncated gzip files, not the uncompressed data, but thanks anyway for trying.
|
Let's implement it.
Last edited by pan64; 06-22-2025 at 05:00 AM.
|
|
|
|
06-23-2025, 07:14 AM
|
#12
|
|
Member
Registered: Oct 2022
Posts: 132
Original Poster
Rep:
|
Quote:
Originally Posted by pan64
Decompressing a split archive won't require any additional space, just insert the next usb disk and continue the operation (as long as the full file is restored).
|
It requires additional space during the first pass of the decompression.
Before getting back to the original uncompressed file, you'd have to reassemble the gzip file (pass 1), and then get the original uncompressed file (pass 2). After that, you can delete the reassembled gzip file. But with the way I described above, you'd skip the first pass and directly get back to the uncompressed data.
Quote:
|
Decompressor can handle that changing of usb sticks. It has been already implemented and working for decades (yes, decades. first with floppy drives).
|
Let me try.
Code:
$ split example_file.gz -b 1M
$ ls x*
xaa xac xae xag xai xak xam xao xaq xas xau xaw xay xba xbc
xab xad xaf xah xaj xal xan xap xar xat xav xax xaz xbb xbd
$ for file in x*; do gzip -d -c "$file" >> example_file.uncompressed ; done
gzip: xaa: unexpected end of file
gzip: xab: not in gzip format
gzip: xac: not in gzip format
gzip: xad: not in gzip format
gzip: xae: not in gzip format
gzip: xaf: not in gzip format
gzip: xag: not in gzip format
gzip: xah: not in gzip format
gzip: xai: not in gzip format
gzip: xaj: not in gzip format
gzip: xak: not in gzip format
gzip: xal: not in gzip format
gzip: xam: not in gzip format
gzip: xan: not in gzip format
gzip: xao: not in gzip format
gzip: xap: not in gzip format
gzip: xaq: not in gzip format
gzip: xar: not in gzip format
gzip: xas: not in gzip format
gzip: xat: not in gzip format
gzip: xau: not in gzip format
gzip: xav: not in gzip format
gzip: xaw: not in gzip format
gzip: xax: not in gzip format
gzip: xay: not in gzip format
gzip: xaz: not in gzip format
gzip: xba: not in gzip format
gzip: xbb: not in gzip format
gzip: xbc: not in gzip format
gzip: xbd: not in gzip format
Quote:
|
You cannot restore anything from a random part/slice of a tar file, that is very unlikely.
|
You can restore some files by dumping the file starting from the next TAR header because each TAR header contains the metadata (name, size, modified time) for one file.
7z and zip use centralized tables of content, meaning the file listing loads immediately but recovering portions and appending new files without rewriting the entire archive is not possible.
Should it be an option for gzip or a separate tool named something like gzsplit? It would reuse much of the code from gzip.
|
|
|
|
06-23-2025, 07:34 AM
|
#13
|
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578
|
Theoretically a decompressor can read those disks one by one on the fly and reconstruct the original file. This can be dangerous, especially if one of the discs is damaged and you lose everything. gzip cannot do this, you need another tool. ("should it be an option" is not enough to make it work that way).
You still cannot restore anything from a random part/slice of a tar file, that is very unlikely, especially if a single file won't fit into the slice. You may restore occasionally a few files, if you could identify them at all, but that is not an acceptable solution.
Additionally if you want to make decompressable slices you will need to add an overhead to every slice and also you cannot use the full size (as it was discussed before), therefore at the end you may need more slices.
|
|
|
|
06-23-2025, 08:19 AM
|
#14
|
|
Moderator
Registered: Aug 2002
Posts: 26,881
|
tar and gzip are separate utilities with gzip being only a compression tool. 7zip and zip are both an archive and compression tool. I think to fully implement it you would need something new. I suppose you could implement it with a script of some sort but it would probably be messy.
The old DOS pkzip utility creates multi-volume floppy archives that would prompt for the next disk. Although it does have a centralized toc on the last disk.
|
|
|
1 members found this post helpful.
|
All times are GMT -5. The time now is 05:41 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|