Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
Today, 12:42 AM
|
#1
|
Member
Registered: Oct 2022
Posts: 129
Rep:
|
How to split a gzip into valid smaller gzip files?
[ Log in to get rid of this advertisement]
How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?
Something like split -b 4095M example.img.gz would not work given that it would cut through the internal structure of the gzip file. My goal is for the gzip chunks to be re-asssemble-able from multiple devices, so each gzip part file has to be valid on its own.
The gzip file itself should be of a predetermined size, not the content it expands to.
For re-assembling, the following is useless because it requires all the files to be available at the same time, making it useless for splitting across devices:
Code:
cat example.img.gz.part* |gzip -d -c >> exaple.img
What I want is this:
Code:
gzip -d -c example.img.part*.gz >> example.img
I want each part of the gzip to be a valid gzip file of its own, so they can be assembled from multiple devices.
The pre-determined size doesn't have to be exact, but within a few megabytes of a desired value. For example, if I want 4096 MiB chunks, something like 4090 MiB is acceptable too.
This is already possible with bzip2 by "abusing" bzip2recover. Normally, bzip2recover was intended for merging undamaged parts of copies of damaged archives, but it can be used to break down any bzip2 into one file per bzip2 block. This way, multiple blocks can be concatenated into parts of any desired approximate size.
The reason I prefer gzip for this purpose is its much faster speed.
Last edited by exerceo; Today at 02:26 PM.
Reason: edited title for clarity
|
|
|
Today, 03:38 AM
|
#2
|
Senior Member
Registered: Jul 2020
Posts: 1,602
|
Can't be done. You'll have to use archiver with multi-volume support - 7z or rar, and if speed is so important - reduce compression quality. Or you may split a single gzip archive using split and combine it back into a named pipe attached at the other end to the decompressing process, this way it will wait for parts. Technique to prevent pipe from closing after each file is described e.g. here https://superuser.com/questions/7664...to-named-pipes
Last edited by lvm_; Today at 03:50 AM.
|
|
|
Today, 04:13 AM
|
#3
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,418
|
I was thinking similar. Maybe lzip as well - I like the idea of lziprecover and ddrescue but haven't pursued it as I should have.
|
|
|
Today, 04:31 AM
|
#4
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,515
|
Quote:
Originally Posted by exerceo
How to split a gzip file to chunks of a pre-determined size that are valid gzip files themselves?
The gzip file itself should be of a predetermined size, not the content it expands to.
|
That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.
Even in case of 7zip you need to have all the parts available to be able to unpack: https://askubuntu.com/questions/1342...t-7zip-archive.
By the way we have a gzip recover too, but that still won't help on it.
|
|
|
Today, 11:21 AM
|
#6
|
Member
Registered: Oct 2022
Posts: 129
Original Poster
Rep:
|
gzip splitting
Quote:
Originally Posted by lvm_
Can't be done.
|
Quote:
Originally Posted by pan64
That is just wrong. the size of the "result" is determined by the user, the parameters passed to it.
gzip produces a single compressed archive.
If you want to make smaller archives you need to split the original files.
Additionally the size of the result cannot be calculated without actually doing the compression (because it depends on the content and the algorithm too), therefore you can never predict the exact size of it.
You can use gzip file by file too, and play with them to construct your preferred archive, just it will take a very long time.
|
While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.
Quote:
Originally Posted by boughtonp
|
I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.
Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.
But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.
I am surprised nothing like this has been implemented in decades. It is clearly doable.
|
|
|
Today, 11:39 AM
|
#7
|
LQ Addict
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 24,515
|
Quote:
Originally Posted by exerceo
While it indeed is impossible to predict a compressed size without doing the compression work, it should be technically possible to check the size of the gzip while it is being created. Once it comes as close as possible below the preferred size, the compressor should close the file and start a new gzip file. My goal is that each gzip is a valid file of its own.[/code]
|
No, gzip by itself cannot split input files based on the compressed size. Although that looks achievable, it has not implemented [yet]. Most probably because you need to concatenate those uncompressed parts to get the original file, so those partial compressed archives will not contain anything usable (just together with the other parts).
Quote:
Originally Posted by exerceo
I want to store huge a huge tar file with compression across smaller flash drives so it can be reassembled later by gunzipping the individual gzip's back into the original uncompressed file. I want a pre-determined size in order to waste as little space as possible.
Splitting after gzipping would require two passes to get back to the original data. First, reassembling the gzip file and then decompressing it, because splitting a gzip at an arbitrary position will corrupt data near the edges. So you'd need two passes to get back to the original data.
But if you had the data split across intact gzip files, you could gzip -d -c each gzip file back into the original file in a single pass. You don't have to reassemble the gzip file. You can directly reassemble the original uncompressed file from the gzip pieces.
I am surprised nothing like this has been implemented in decades. It is clearly doable.
|
reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.
|
|
|
Today, 12:03 PM
|
#8
|
Moderator
Registered: Aug 2002
Posts: 26,864
|
tar can produce multi-volume archives but they can not be compressed.
The zip format supports multi-volume archives of a fixed size. You do not need to assemble the individual files but they need to be all in the same directory to unzip. There used to be a known problem which might still exist if the size of the volumes were a multiple of the buffer size i.e 16KiB they would not unzip correctly. I have not created multi-volume zip files in decades.
|
|
|
Today, 01:05 PM
|
#9
|
LQ Guru
Registered: Oct 2004
Distribution: Arch
Posts: 5,465
|
My 2 cents.
Code:
zcat file.gz | split -l 200 - file.part
#or
gunzip –c file.gz | split -l 200 - file.part
#or
gzip -c file | split -b 1024m - file.gz.part
cat file.part* > file.gz
Code:
file /usr/share/man/man1/bash.1.gz
/usr/share/man/man1/bash.1.gz: gzip compressed data, max compression, from Unix, original size modulo 2^32 351525
ls -l /usr/share/man/man1/bash.1.gz
-rw-r--r-- 1 root root 97104 Mar 11 16:53 /usr/share/man/man1/bash.1.gz
Let me try that with a man page.
Code:
zcat /usr/share/man/man1/bash.1.gz | split -l 1000 - file.part
And that gave me file.partaa to file.partal
Code:
file file.partaa
file.partaa: troff or preprocessor input, ASCII text
ls -l file.partaa
-rw-r--r-- 1 me me 30259 Jun 21 12:49 file.partaa
And that is readable:
Code:
man ~/file.partaa
BASH(1) General Commands Manual BASH(1)
NAME
bash - GNU Bourne-Again SHell
SYNOPSIS
bash [options] [command_string | file]
COPYRIGHT
Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
DESCRIPTION
Bash is an sh-compatible command language interpreter that executes commands
read from the standard input or from a file. Bash also incorporates useful
features from the Korn and C shells (ksh and csh).
...
Code:
cat file.part** > test.gz
file test.gz
test.gz: troff or preprocessor input, ASCII text
ls -l test.gz
-rw-r--r-- 1 me me 351525 Jun 21 12:52 test.gz
And that reads all the way through with man.
Code:
gzip -c test.gz > test2.gz
file test2.gz
test2.gz: gzip compressed data, was "test.gz", last modified: Sat Jun 21 17:52:47 2025, from Unix, original size modulo 2^32 351525
ls -l test2.gz
-rw-r--r-- 1 me me 97654 Jun 21 13:02 test2.gz
Code:
man test2.gz
BASH(1) General Commands Manual BASH(1)
NAME
bash - GNU Bourne-Again SHell
SYNOPSIS
bash [options] [command_string | file]
COPYRIGHT
Bash is Copyright (C) 1989-2022 by the Free Software Foundation, Inc.
DESCRIPTION
Bash is an sh-compatible command language interpreter that executes commands
read from the standard input or from a file. Bash also incorporates useful
features from the Korn and C shells (ksh and csh).
Bash is intended to be a conformant implementation of the Shell and Utilities
portion of the IEEE POSIX specification (IEEE Standard 1003.1). Bash can be
configured to be POSIX-conformant by default.
...
Last edited by teckk; Today at 01:08 PM.
|
|
|
Today, 01:33 PM
|
#10
|
Member
Registered: Oct 2022
Posts: 129
Original Poster
Rep:
|
Quote:
Originally Posted by pan64
reassembling split archive is a single command. Use compressor which can manage that automatically, so you will not need to do that extra step by hand.
|
The problem is that it takes time to finish, and takes lots of space too.
Running split on a .gz file would slice through the internal structure of the gzip file, so you will not get back to the original data if you gzip -d -c the resulting files.
With current methods, you can not get back to the original uncompressed data in a single command, unless all USB sticks with all the gzip parts are inserted in the computer at the same time. Then you could concatenate (cat) all the parts and then |gzip -d -c >>original_file . But if you don't have enough USB ports, you have to cycle through the USB sticks to get back to the original file.
The only way to do so in a single pass is to have gzip part files that are valid themsleves. And this seems to be a feature gap.
Quote:
Originally Posted by pan64
so those partial compressed archives will not contain anything usable (just together with the other parts).
|
Actually, the structure of the TAR format (no centralized TOC, just a bunch of headers with their file contents) would indeed allow recovering parts of incomplete TAR files. But the reason I want valid gzip files is so I can cycle through USB sticks and get back to the original file in a single pass.
Quote:
Originally Posted by michaelk
You do not need to assemble the individual files but they need to be all in the same directory to unzip.
|
Unfortunately, this makes it impossible to split it across different devices.
Quote:
Originally Posted by teckk
My 2 cents.
|
I am looking to split the compressed data into valid non-truncated gzip files, not the uncompressed data, but thanks anyway for trying.
Last edited by exerceo; Today at 01:37 PM.
Reason: plural
|
|
|
All times are GMT -5. The time now is 03:42 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|