(cache)How to split a gzip into valid smaller gzip files?

exerceo · Today, 12:42 AM

How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?

Something like split -b 4095M example.img.gz would not work given that it would cut through the internal structure of the gzip file. My goal is for the gzip chunks to be re-asssemble-able from multiple devices, so each gzip part file has to be valid on its own.

The gzip file itself should be of a predetermined size, not the content it expands to.

For re-assembling, the following is useless because it requires all the files to be available at the same time, making it useless for splitting across devices:

Code:

cat example.img.gz.part* |gzip -d -c >> exaple.img

What I want is this:

Code:

gzip -d -c example.img.part*.gz >> example.img

I want each part of the gzip to be a valid gzip file of its own, so they can be assembled from multiple devices.

The pre-determined size doesn't have to be exact, but within a few megabytes of a desired value. For example, if I want 4096 MiB chunks, something like 4090 MiB is acceptable too.

This is already possible with bzip2 by "abusing" bzip2recover. Normally, bzip2recover was intended for merging undamaged parts of copies of damaged archives, but it can be used to break down any bzip2 into one file per bzip2 block. This way, multiple blocks can be concatenated into parts of any desired approximate size.

The reason I prefer gzip for this purpose is its much faster speed.

lvm_ · Today, 03:38 AM

Can't be done. You'll have to use archiver with multi-volume support - 7z or rar, and if speed is so important - reduce compression quality. Or you may split a single gzip archive using split and combine it back into a named pipe attached at the other end to the decompressing process, this way it will wait for parts. Technique to prevent pipe from closing after each file is described e.g. here https://superuser.com/questions/7664...to-named-pipes

syg00 · Today, 04:13 AM

I was thinking similar. Maybe lzip as well - I like the idea of lziprecover and ddrescue but haven't pursued it as I should have.

pan64 · Today, 04:31 AM

Quote:

Originally Posted by exerceo

How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?

The gzip file itself should be of a predetermined size, not the content it expands to.

That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.

Even in case of 7zip you need to have all the parts available to be able to unpack: https://askubuntu.com/questions/1342...t-7zip-archive.
By the way we have a gzip recover too, but that still won't help on it.

boughtonp · Today, 07:28 AM

Quote:

Originally Posted by exerceo

My goal is for the gzip chunks to be re-asssemble-able from multiple devices

Why don't you describe what you're actually trying to achieve?

exerceo · Today, 11:21 AM

Quote:

Originally Posted by lvm_

Can't be done.

Quote:

Originally Posted by pan64

That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.

While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.

Quote:

Originally Posted by boughtonp

Why don't you describe what you're actually trying to achieve?

I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.

Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.

But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.

I am surprised nothing like this has been implemented in decades. It is clearly doable.

pan64 · Today, 11:39 AM

Quote:

Originally Posted by exerceo

While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.[/code]

No, gzip by itself cannot split input files based on the compressed size. Although that looks achievable, it has not implemented [yet]. Most probably because you need to concatenate those uncompressed parts to get the original file, so those partial compressed archives will not contain anything usable (just together with the other parts).

Quote:

Originally Posted by exerceo

I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.

Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.

But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.

I am surprised nothing like this has been implemented in decades. It is clearly doable.

reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.

michaelk · Today, 12:03 PM

tar can produce multi-volume archives but they can not be compressed.

The zip format supports multi-volume archives of a fixed size. You do not need to assemble the individual files but they need to be all in the same directory to unzip. There used to be a known problem which might still exist if the size of the volumes were a multiple of the buffer size i.e 16KiB they would not unzip correctly. I have not created multi-volume zip files in decades.

teckk · Today, 01:05 PM

My 2 cents.

Code:

zcat file.gz | split -l 200 - file.part
#or
gunzip –c file.gz | split -l 200 - file.part
#or
gzip -c file | split -b 1024m - file.gz.part

cat file.part* > file.gz

Code:

file /usr/share/man/man1/bash.1.gz
/usr/share/man/man1/bash.1.gz: gzip compressed data, max compression, from Unix, original size modulo 2^32 351525

ls -l /usr/share/man/man1/bash.1.gz
-rw-r--r-- 1 root root 97104 Mar 11 16:53 /usr/share/man/man1/bash.1.gz

Let me try that with a man page.

Code:

zcat /usr/share/man/man1/bash.1.gz | split -l 1000 - file.part

And that gave me file.partaa to file.partal

Code:

file file.partaa
file.partaa: troff or preprocessor input, ASCII text

ls -l file.partaa
-rw-r--r-- 1 me me 30259 Jun 21 12:49 file.partaa

And that is readable:

Code:

man ~/file.partaa
BASH(1)                        General Commands Manual                       BASH(1)
 
NAME
       bash - GNU Bourne-Again SHell
 
SYNOPSIS
       bash [options] [command_string | file]
 
COPYRIGHT
       Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
 
DESCRIPTION
       Bash  is an sh-compatible command language interpreter that executes commands
       read from the standard input or from a file.  Bash also  incorporates  useful
       features from the Korn and C shells (ksh and csh).
...

Code:

cat file.part** > test.gz

file test.gz
test.gz: troff or preprocessor input, ASCII text

ls -l test.gz
-rw-r--r-- 1 me me 351525 Jun 21 12:52 test.gz

And that reads all the way through with man.

Code:

gzip -c test.gz > test2.gz

file test2.gz
test2.gz: gzip compressed data, was "test.gz", last modified: Sat Jun 21 17:52:47 2025, from Unix, original size modulo 2^32 351525

ls -l test2.gz
-rw-r--r-- 1 me me 97654 Jun 21 13:02 test2.gz

Code:

man test2.gz
BASH(1)                        General Commands Manual                       BASH(1)
 
NAME
       bash - GNU Bourne-Again SHell
 
SYNOPSIS
       bash [options] [command_string | file]
 
COPYRIGHT
       Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
 
DESCRIPTION
       Bash  is an sh-compatible command language interpreter that executes commands
       read from the standard input or from a file.  Bash also  incorporates  useful
       features from the Korn and C shells (ksh and csh).
 
       Bash is intended to be a conformant implementation of the Shell and Utilities
       portion  of the IEEE POSIX specification (IEEE Standard 1003.1).  Bash can be
       configured to be POSIX-conformant by default.
...

exerceo · Today, 01:33 PM

Quote:

Originally Posted by pan64

reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.

The problem is that it takes time to finish, and takes lots of space too.

Running split on a .gz file would slice through the internal structure of the gzip file, so you will not get back to the original data if you gzip -d -c the resulting files.

With current methods, you can not get back to the original uncompressed data in a single command, unless all USB sticks with all the gzip parts are inserted in the computer at the same time. Then you could concatenate (cat) all the parts and then |gzip -d -c >>original_file . But if you don't have enough USB ports, you have to cycle through the USB sticks to get back to the original file.

The only way to do so in a single pass is to have gzip part files that are valid themsleves. And this seems to be a feature gap.

Quote:

Originally Posted by pan64

so those partial compressed archives will not contain anything usable (just together with the other parts).

Actually, the structure of the TAR format (no centralized TOC, just a bunch of headers with their file contents) would indeed allow recovering parts of incomplete TAR files. But the reason I want valid gzip files is so I can cycle through USB sticks and get back to the original file in a single pass.

Quote:

Originally Posted by michaelk

You do not need to assemble the individual files but they need to be all in the same directory to unzip.

Unfortunately, this makes it impossible to split it across different devices.

Quote:

Originally Posted by teckk

My 2 cents.

I am looking to split the compressed data into valid non-truncated gzip files, not the uncompressed data, but thanks anyway for trying.