A question about compressing a file twice

Is your feature request related to a problem? Please describe.

Hello,

I have a question regarding the compression capabilities of the Zstandard algorithm. Specifically, I am interested in the scenario where I compress files using the fast mode to achieve the best compression speed. Afterward, I compress the already compressed file again using the normal mode before archiving the files.

My question is whether this second compression process would be slower and if it would result in a better compression ratio compared to compressing the file only once.

Based on my understanding, the first compression pass would likely eliminate most of the repeated patterns in the files, which could potentially slow down the second compression. However, in my tests, I observed that the second compression process runs really quickly (likely due to the file being much smaller after the first compression). Surprisingly, it compresses the file further by an additional +35%.

I would appreciate it if you could provide some insights on whether it is theoretically possible for the second compression to increase the file size instead of reducing it. I'm curious to understand the underlying factors contributing to this observation.

Thank you for your assistance!

Best Regards
Jack

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The second compression pass is expected to be much faster precisely because data is already compressed during the first pass . As a consequence, the second pass is expected to provide little, if any, compression benefit.

The fact that you found 35% savings in the second pass contradicts the second statement. But this is just one sample, it should not be construed as a generality.
The general expectation is that the second pass brings almost nothing, but there are counter examples possible. Unfortunately, these counter examples are less easy to define. A general idea is that there might be so much redundancy in the source data that the first pass cannot get rid of them all, which generally means that the compression ratio is very high. It is also related to the specific set of parameters selected for the first and second passes.

Even when the second pass brings benefits, it generally means that, with proper parameters, a single pass would have been able to produce a better compression ratio. However, it's unlikely to be as fast.

If you are really into using 2 passes to compress your data because it seems to fit a pattern that benefits from it, I suggest trying lz4 for the first pass. It messes with data less, meaning that the outcome of the first pass will likely be better compressed by a second zstd pass. And it's also faster.

Hello Yann,

Thank you for your detailed explanation. I truly appreciate it.

It appears that my specific sample data doesn't accurately represent the common cases. Surprisingly, in my tests, the "zstd -c --fast=5" command runs faster than "lz4 -1" and achieves a better compression ratio. Additionally, for the second compression pass, when using "zstd -c compressed-filed" on a file compressed by zstd with the --fast option in the initial pass, it also runs faster and produces slightly better compression ratio.

To gain more insights, I plan to conduct experiments using a wider range of data sources and compare the performance of the two algorithms in both the first and second passes.

Thank you once again for your help. I hope you have a fantastic weekend.

Best Regards,
Jack

@shuhuajack

Hello @shuhuajack,

My comment here is to leave some other who is enthusiastic about dual compression. My test data is the HDD statistics accumulated from 2013 to Q4 2024 by Backblaze. For anyone concern, this is their URL https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data#overviewHardDriveData

Scenario: First I download the zip file which is around 26 - 28 GiB on my Windows 10, unzip to get around 190 GiB of CSV files of 4192 files. Then I use the polars library to convert the CSV to Parquet using their ZSTD compression level 3 as below. This results in the total of 24 GiB of parquet files on my disk (meaning at around 7.7 to 8x smaller ). Then I continue to use my 7-Zip 22.01 ZS (version below) to perform ZSTD compression level 3 (same as polars) and I can gain 20.6 GiB (around 15% smaller).

Why I agree that the second pass is run faster, I believe Cyan mentioned right that my first pass still has lot of room for compression because of its numeric analytics with a lot of empty field even within 1 MiB page size (default of Polars). This may not bring much in PDF, text-based format; but I believe it surely helps with sparse data or for numeric analytics.

Sample code:

df = pl.read_csv(filepath, has_header=True, infer_schema=True, infer_schema_length=1024, try_parse_dates=True, rechunk=True)
df.write_parquet(filename, compression='zstd', compression_level=3, partition_by=None)

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.13.1 (tags/v3.13.1:0671451, Dec  3 2024, 19:06:28) [MSC v.1942 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  1.3.0
altair               5.5.0
connectorx           0.4.0
fsspec               2024.10.0
gevent               24.11.1
great_tables         0.15.0
matplotlib           3.10.0
nest_asyncio         1.6.0
numpy                2.2.0
pandas               2.2.3
pyarrow              18.1.0
pydantic             2.10.4
sqlalchemy           2.0.36

P/s: The reproduction could be different between version to version, and the option you put in the code, the training dictionary it used to learn the compression.
I used 7-Zip 22.01 ZS v1.5.5 R3 (x64)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about compressing a file twice #3693

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

A question about compressing a file twice #3693

Description

Activity

Cyan4973 commented on Jul 7, 2023

shuhuajack commented on Jul 7, 2023

IchiruTake commented on Dec 22, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions