# Archiving and Compression
While archiving refers to combining multiple files into one for easier transmission or storage, compression is all about making files smaller by removing redundant information.
Both processes are often used together in file archiving the process of storing and transmitting one or multiple files as efficiently as possible.
Although disk space is cheap nowadays file archiving still offers benefits with respect to transmitting large amount of individual files (think source files) or when working with truly large amounts of data (think logs). Additionally some long term storage solutions work way more efficiently when presented with a continuous stream of data rather then individual files.
# Compression
When it comes to compression there are two types:
Lossy in which information is removed so that the decompressing the compressed file results in a slightly different file than the original. For example slight variations of green in an image prior to and after compression.
Lossless in which no information is removed. The decompressed file will be identical to the one never compressed.
Most image formats include some kind of lossy compression. Be aware that compressing an already compressed file will not make it smaller unless you are willing to sacrifice more information which will eventually lead to unrecognizable results. Please do your own research if you are interested in more information about compression.
Linux provides several tools to compress files.
# gzip
gzip
is the most commonly used compression utility for Linux. It is based on
the DEFLATE algorithm, a lossless algorithm based on LZ77 and Huffmann coding.
gzip access_log
will compress the file access_log
and replace it with its compressed version.
By default gzip
will append .gz
to the filename.
gzip -l access_log.gz
will show you statistics about the compressed file.
gunzip
is the inverse command to gzip
.
TIP
Try figuring out how to gunzip a file
without using gunzip
and then check what gunzip
actually does.
gzip
can also be used as a filter
pg_dump my_db | gzip > my_db_backup.gz
will take the database dump from postgres and output the gzipped result to
my_db_backup.gz
# bzip2
Is very similar to gzip
but offers better compression at the expense of more
cpu time.
# Archiving
The traditional tool for archival in the *nix world is tar
which stands for
Tape ARchive. It was originally used to stream files to a tape for backup or
file transfer. It creates a single output file which can later be split back
into the original source files. This file is often called a tarball. tar
has
three basic features:
Operation | Flag |
---|---|
Create an archive | c |
List an archives content | t |
Extract file(s) from an archive | x |
which are often used in conjunction with compression. The z
flag will tell
tar
to compress the archive using gzip
while j
will make it use bzip2
.
For example
tar -czf access_logs.tar.gz access_log*
would package all access logs into a gzipped tarball.
Please be aware that *nix does not treat file extensions in any special way.
By convention uncompressed tarballs end with .tar
while .tar.gz
and .tgz
are used to denote gzipped tarballs and .tar.bz2
or .tbz
are commonly used
for bzip2 compressed tarballs.
To list the content of the tarball created above we would use
tar -tzf access_logs.tar.gz
TIP
Think of a way to do the same thing using gzip
and tar
separately.
Extracting the content of our file can be done like this
tar -xzf access_logs.tar.gz
In case you are wondering f
tells tar
to work with a file as in/output.
By default it would work with STDIN/STDOUT. It is important to keep it at the
end of the flag chain because tar
expects whatever comes next to be a file.
tar
is very powerful go and read its documentation for more advanced features.
# Zip
ZIP is the defacto standard for compression and archiving in the Microsoft
world although not as commonly used in Linux it is still well supported.
The commands zip
and unzip
do very much what you would expect them to do.
However they behave quite differently than one is used to from tar
/gzip
.
One of the main differences is that both commands expect the archive to act upon
as their first parameter. Also zip
will not by default recurse into
subdirectories. This means that adding logs
instead of logs/*
will only add
an empty directory to the archive. As an alternative zip
offers the r
flag
which results in tar
like behavior.
More details can be found in the respective manpages.