LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 06-21-2025, 12:42 AM   #1
exerceo
Member
 
Registered: Oct 2022
Posts: 132

Rep: Reputation: 30
Question How to split a gzip into valid smaller gzip files?


[Log in to get rid of this advertisement]
How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?

Something like split -b 4095M example.img.gz would not work given that it would cut through the internal structure of the gzip file. My goal is for the gzip chunks to be re-asssemble-able from multiple devices, so each gzip part file has to be valid on its own.

The gzip file itself should be of a predetermined size, not the content it expands to.

For re-assembling, the following is useless because it requires all the files to be available at the same time, making it useless for splitting across devices:
Code:
cat example.img.gz.part* |gzip -d -c >> exaple.img
What I want is this:
Code:
gzip -d -c example.img.part*.gz >> example.img
I want each part of the gzip to be a valid gzip file of its own, so they can be assembled from multiple devices.

The pre-determined size doesn't have to be exact, but within a few megabytes of a desired value. For example, if I want 4096 MiB chunks, something like 4090 MiB is acceptable too.

This is already possible with bzip2 by "abusing" bzip2recover. Normally, bzip2recover was intended for merging undamaged parts of copies of damaged archives, but it can be used to break down any bzip2 into one file per bzip2 block. This way, multiple blocks can be concatenated into parts of any desired approximate size.

The reason I prefer gzip for this purpose is its much faster speed.

Last edited by exerceo; 06-21-2025 at 02:26 PM. Reason: edited title for clarity
 
Old 06-21-2025, 03:38 AM   #2
lvm_
Senior Member
 
Registered: Jul 2020
Posts: 1,610

Rep: Reputation: 559Reputation: 559Reputation: 559Reputation: 559Reputation: 559Reputation: 559
Can't be done. You'll have to use archiver with multi-volume support - 7z or rar, and if speed is so important - reduce compression quality. Or you may split a single gzip archive using split and combine it back into a named pipe attached at the other end to the decompressing process, this way it will wait for parts. Technique to prevent pipe from closing after each file is described e.g. here https://superuser.com/questions/7664...to-named-pipes

Last edited by lvm_; 06-21-2025 at 03:50 AM.
 
Old 06-21-2025, 04:13 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,429

Rep: Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200Reputation: 4200
I was thinking similar. Maybe lzip as well - I like the idea of lziprecover and ddrescue but haven't pursued it as I should have.
 
Old 06-21-2025, 04:31 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578

Rep: Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078
Quote:
Originally Posted by exerceo View Post
How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?

The gzip file itself should be of a predetermined size, not the content it expands to.
That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.

Even in case of 7zip you need to have all the parts available to be able to unpack: https://askubuntu.com/questions/1342...t-7zip-archive.
By the way we have a gzip recover too, but that still won't help on it.
 
Old 06-21-2025, 07:28 AM   #5
boughtonp
Senior Member
 
Registered: Feb 2007
Location: UK
Distribution: Debian
Posts: 4,012

Rep: Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886Reputation: 2886
Quote:
Originally Posted by exerceo View Post
My goal is for the gzip chunks to be re-asssemble-able from multiple devices
Why don't you describe what you're actually trying to achieve?

 
Old 06-21-2025, 11:21 AM   #6
exerceo
Member
 
Registered: Oct 2022
Posts: 132

Original Poster
Rep: Reputation: 30
Lightbulb gzip splitting

Quote:
Originally Posted by lvm_ View Post
Can't be done.
Quote:
Originally Posted by pan64 View Post
That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.
While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.

I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.

Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.

But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.

I am surprised nothing like this has been implemented in decades. It is clearly doable.
 
Old 06-21-2025, 11:39 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578

Rep: Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078
Quote:
Originally Posted by exerceo View Post
While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.[/code]
No, gzip by itself cannot split input files based on the compressed size. Although that looks achievable, it has not implemented [yet]. Most probably because you need to concatenate those uncompressed parts to get the original file, so those partial compressed archives will not contain anything usable (just together with the other parts).

Quote:
Originally Posted by exerceo View Post

I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.

Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.

But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.

I am surprised nothing like this has been implemented in decades. It is clearly doable.
reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.
 
Old 06-21-2025, 12:03 PM   #8
michaelk
Moderator
 
Registered: Aug 2002
Posts: 26,881

Rep: Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364
tar can produce multi-volume archives but they can not be compressed.

The zip format supports multi-volume archives of a fixed size. You do not need to assemble the individual files but they need to be all in the same directory to unzip. There used to be a known problem which might still exist if the size of the volumes were a multiple of the buffer size i.e 16KiB they would not unzip correctly. I have not created multi-volume zip files in decades.
 
Old 06-21-2025, 01:05 PM   #9
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,476
Blog Entries: 7

Rep: Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986
My 2 cents.

Code:
zcat file.gz | split -l 200 - file.part
#or
gunzip –c file.gz | split -l 200 - file.part
#or
gzip -c file | split -b 1024m - file.gz.part

cat file.part* > file.gz
Code:
file /usr/share/man/man1/bash.1.gz
/usr/share/man/man1/bash.1.gz: gzip compressed data, max compression, from Unix, original size modulo 2^32 351525

ls -l /usr/share/man/man1/bash.1.gz
-rw-r--r-- 1 root root 97104 Mar 11 16:53 /usr/share/man/man1/bash.1.gz
Let me try that with a man page.
Code:
zcat /usr/share/man/man1/bash.1.gz | split -l 1000 - file.part
And that gave me file.partaa to file.partal
Code:
file file.partaa
file.partaa: troff or preprocessor input, ASCII text

ls -l file.partaa
-rw-r--r-- 1 me me 30259 Jun 21 12:49 file.partaa
And that is readable:
Code:
man ~/file.partaa
BASH(1)                        General Commands Manual                       BASH(1)
 
NAME
       bash - GNU Bourne-Again SHell
 
SYNOPSIS
       bash [options] [command_string | file]
 
COPYRIGHT
       Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
 
DESCRIPTION
       Bash  is an sh-compatible command language interpreter that executes commands
       read from the standard input or from a file.  Bash also  incorporates  useful
       features from the Korn and C shells (ksh and csh).
...
Code:
cat file.part** > test.gz

file test.gz
test.gz: troff or preprocessor input, ASCII text

ls -l test.gz
-rw-r--r-- 1 me me 351525 Jun 21 12:52 test.gz
And that reads all the way through with man.

Code:
gzip -c test.gz > test2.gz

file test2.gz
test2.gz: gzip compressed data, was "test.gz", last modified: Sat Jun 21 17:52:47 2025, from Unix, original size modulo 2^32 351525

ls -l test2.gz
-rw-r--r-- 1 me me 97654 Jun 21 13:02 test2.gz
Code:
man test2.gz
BASH(1)                        General Commands Manual                       BASH(1)
 
NAME
       bash - GNU Bourne-Again SHell
 
SYNOPSIS
       bash [options] [command_string | file]
 
COPYRIGHT
       Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
 
DESCRIPTION
       Bash  is an sh-compatible command language interpreter that executes commands
       read from the standard input or from a file.  Bash also  incorporates  useful
       features from the Korn and C shells (ksh and csh).
 
       Bash is intended to be a conformant implementation of the Shell and Utilities
       portion  of the IEEE POSIX specification (IEEE Standard 1003.1).  Bash can be
       configured to be POSIX-conformant by default.
...

Last edited by teckk; 06-21-2025 at 01:08 PM.
 
Old 06-21-2025, 01:33 PM   #10
exerceo
Member
 
Registered: Oct 2022
Posts: 132

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by pan64 View Post
reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.
The problem is that it takes time to finish, and takes lots of space too.

Running split on a .gz file would slice through the internal structure of the gzip file, so you will not get back to the original data if you gzip -d -c the resulting files.

With current methods, you can not get back to the original uncompressed data in a single command, unless all USB sticks with all the gzip parts are inserted in the computer at the same time. Then you could concatenate (cat) all the parts and then |gzip -d -c >>original_file . But if you don't have enough USB ports, you have to cycle through the USB sticks to get back to the original file.

The only way to do so in a single pass is to have gzip part files that are valid themsleves. And this seems to be a feature gap.

Quote:
Originally Posted by pan64 View Post
so those partial compressed archives will not contain anything usable (just together with the other parts).
Actually, the structure of the TAR format (no centralized TOC, just a bunch of headers with their file contents) would indeed allow recovering parts of incomplete TAR files. But the reason I want valid gzip files is so I can cycle through USB sticks and get back to the original file in a single pass.


Quote:
Originally Posted by michaelk View Post
You do not need to assemble the individual files but they need to be all in the same directory to unzip.
Unfortunately, this makes it impossible to split it across different devices.

Quote:
Originally Posted by teckk View Post
My 2 cents.
I am looking to split the compressed data into valid non-truncated gzip files, not the uncompressed data, but thanks anyway for trying.

Last edited by exerceo; 06-21-2025 at 01:37 PM. Reason: plural
 
Old 06-22-2025, 04:07 AM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578

Rep: Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078
Quote:
Originally Posted by exerceo View Post
The problem is that it takes time to finish, and takes lots of space too.
Just think about it twice. It will take space anyway. You can only try to modify "when": before or after decompressing. And actually before decompressing it would need less space.
Decompressing a split archive won't require any additional space, just insert the next usb disk and continue the operation (as long as the full file is restored).
Quote:
Originally Posted by exerceo View Post

With current methods, you can not get back to the original uncompressed data in a single command, unless all USB sticks with all the gzip parts are inserted in the computer at the same time. Then you could concatenate (cat) all the parts and then |gzip -d -c >>original_file . But if you don't have enough USB ports, you have to cycle through the USB sticks to get back to the original file.

The only way to do so in a single pass is to have gzip part files that are valid themsleves. And this seems to be a feature gap.
no, that is wrong. Decompressor can handle that changing of usb sticks. It has been already implemented and working for decades (yes, decades. first with floppy drives).

Quote:
Originally Posted by exerceo View Post
Actually, the structure of the TAR format (no centralized TOC, just a bunch of headers with their file contents) would indeed allow recovering parts of incomplete TAR files. But the reason I want valid gzip files is so I can cycle through USB sticks and get back to the original file in a single pass.
You cannot restore anything from a random part/slice of a tar file, that is very unlikely. And not in one pass. You will need to reconstruct the whole tar archive to be able to use the content.
Again, if the goal is to get back the original file it is completely irrelevant if those parts are compressed before or after that split.
Quote:
Originally Posted by exerceo View Post

I am looking to split the compressed data into valid non-truncated gzip files, not the uncompressed data, but thanks anyway for trying.
Let's implement it.

Last edited by pan64; 06-22-2025 at 05:00 AM.
 
Old 06-23-2025, 07:14 AM   #12
exerceo
Member
 
Registered: Oct 2022
Posts: 132

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by pan64 View Post
Decompressing a split archive won't require any additional space, just insert the next usb disk and continue the operation (as long as the full file is restored).
It requires additional space during the first pass of the decompression.

Before getting back to the original uncompressed file, you'd have to reassemble the gzip file (pass 1), and then get the original uncompressed file (pass 2). After that, you can delete the reassembled gzip file. But with the way I described above, you'd skip the first pass and directly get back to the uncompressed data.

Quote:
Decompressor can handle that changing of usb sticks. It has been already implemented and working for decades (yes, decades. first with floppy drives).
Let me try.

Code:
$ split example_file.gz -b 1M
$ ls x*
xaa  xac  xae  xag  xai  xak  xam  xao  xaq  xas  xau  xaw  xay  xba  xbc
xab  xad  xaf  xah  xaj  xal  xan  xap  xar  xat  xav  xax  xaz  xbb  xbd
$ for file in x*; do gzip -d -c "$file" >> example_file.uncompressed ; done

gzip: xaa: unexpected end of file

gzip: xab: not in gzip format

gzip: xac: not in gzip format

gzip: xad: not in gzip format

gzip: xae: not in gzip format

gzip: xaf: not in gzip format

gzip: xag: not in gzip format

gzip: xah: not in gzip format

gzip: xai: not in gzip format

gzip: xaj: not in gzip format

gzip: xak: not in gzip format

gzip: xal: not in gzip format

gzip: xam: not in gzip format

gzip: xan: not in gzip format

gzip: xao: not in gzip format

gzip: xap: not in gzip format

gzip: xaq: not in gzip format

gzip: xar: not in gzip format

gzip: xas: not in gzip format

gzip: xat: not in gzip format

gzip: xau: not in gzip format

gzip: xav: not in gzip format

gzip: xaw: not in gzip format

gzip: xax: not in gzip format

gzip: xay: not in gzip format

gzip: xaz: not in gzip format

gzip: xba: not in gzip format

gzip: xbb: not in gzip format

gzip: xbc: not in gzip format

gzip: xbd: not in gzip format
Quote:
You cannot restore anything from a random part/slice of a tar file, that is very unlikely.
You can restore some files by dumping the file starting from the next TAR header because each TAR header contains the metadata (name, size, modified time) for one file.

7z and zip use centralized tables of content, meaning the file listing loads immediately but recovering portions and appending new files without rewriting the entire archive is not possible.

Quote:
Let's implement it.
Should it be an option for gzip or a separate tool named something like gzsplit? It would reuse much of the code from gzip.
 
Old 06-23-2025, 07:34 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,578

Rep: Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078Reputation: 8078
Theoretically a decompressor can read those disks one by one on the fly and reconstruct the original file. This can be dangerous, especially if one of the discs is damaged and you lose everything. gzip cannot do this, you need another tool. ("should it be an option" is not enough to make it work that way).
You still cannot restore anything from a random part/slice of a tar file, that is very unlikely, especially if a single file won't fit into the slice. You may restore occasionally a few files, if you could identify them at all, but that is not an acceptable solution.
Additionally if you want to make decompressable slices you will need to add an overhead to every slice and also you cannot use the full size (as it was discussed before), therefore at the end you may need more slices.
 
Old 06-23-2025, 08:19 AM   #14
michaelk
Moderator
 
Registered: Aug 2002
Posts: 26,881

Rep: Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364Reputation: 6364
tar and gzip are separate utilities with gzip being only a compression tool. 7zip and zip are both an archive and compression tool. I think to fully implement it you would need something new. I suppose you could implement it with a script of some sort but it would probably be messy.

The old DOS pkzip utility creates multi-volume floppy archives that would prompt for the next disk. Although it does have a centralized toc on the last disk.
 
1 members found this post helpful.
  


Reply

Tags
gzip



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How to Split Large Text File into Smaller Files in Linux LXer Syndicated Linux News 0 07-05-2022 01:41 PM
LXer: How to split a large archive*file into multiple small files using Split command in Linux LXer Syndicated Linux News 0 11-07-2016 05:20 PM
Split large file into smaller files mikes88 Programming 29 03-22-2012 10:14 AM
how to sort text file and split into smaller files michaeljoser Linux - Software 8 10-19-2007 01:50 AM
Compress and split a big sized file into smaller files hicham007 Programming 3 07-28-2005 08:56 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 05:41 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration