Warm tip: This article is reproduced from stackoverflow.com, please click
gzip tar

Why is tar-ing a folder containing some gzipped files as large as the unzipped files?

发布于 2020-03-29 12:47:30

Given the following folder structure (with the size in bytes in parenthesis):

- dir
  - f1.txt (1754)
  - f2.txt (9811)

When I run gzip -r dir, I get:

 - dir
   - f1.txt.gz (654)
   - f2.txt.gz (804)

Now when I do tar -cf dir.tar dir (where dir contains the compressed files), I expect the size of dir.tar to be roughly 654 + 804 = 1450. But it turns out that it is 10240, which is the size of the f1.txt + f2.txt! Why???

Questioner
stackoverflowed
Viewed
26
pmqs 2020-01-31 17:20

Let's work through an example to confirm what you are seeing.

Here I have a directory, x , with two files.

# ls -l x
total 12
-rw-r--r-- 1 root root 3902 Jan 30 17:00 log1.txt
-rw-r--r-- 1 root root 7518 Jan 30 17:00 log.txt

Compress the files

# gzip -9v x/*
x/log1.txt:  90.6% -- replaced with x/log1.txt.gz
x/log.txt:   84.5% -- replaced with x/log.txt.gz

Confirm that compression has worked

# ls -l x
total 8
-rw-r--r-- 1 root root  392 Jan 30 17:00 log1.txt.gz
-rw-r--r-- 1 root root 1195 Jan 30 17:00 log.txt.gz

Put the files into a tar, x.tar

# tar cvf x.tar x
x/
x/log1.txt.gz
x/log.txt.gz

and check the resulting size. I got 10240 as well.

# ls -l x.tar
-rw-r--r-- 1 root root 10240 Jan 31 09:02 x.tar

The reason is quite simple - the tar format works in fixed block sizes, so there will be a lot of padding will NULL bytes. See here for the gory details. For small file sizes like this these padding bytes will dominate. If you look at a hex dump of this tar file it contains mostly NULL padding bytes.

This is why it is better to put the uncompressed version of the files into the tar, then compress that.

Here is an example.

Put the uncompressed files into x.tar

# ls -l x
total 12
-rw-r--r-- 1 root root 3902 Jan 30 17:00 log1.txt
-rw-r--r-- 1 root root 7518 Jan 30 17:00 log.txt

# tar cvf x.tar x
x/
x/log1.txt
x/log.txt

# ls -l x.tar
-rw-r--r-- 1 root root 20480 Jan 31 09:06 x.tar

Now compress the tar file. 1761 bytes is a lot better.

# gzip -9v x.tar
x.tar:   91.7% -- replaced with x.tar.gz

# ls -l x.tar.gz 
-rw-r--r-- 1 root root 1761 Jan 31 09:06 x.tar.gz