# Archiving and Compression

While archiving refers to combining multiple files into one for easier transmission or storage, compression is all about making files smaller by removing redundant information.

Both processes are often used together in file archiving the process of storing and transmitting one or multiple files as efficiently as possible.

Although disk space is cheap nowadays file archiving still offers benefits with respect to transmitting large amount of individual files (think source files) or when working with truly large amounts of data (think logs). Additionally some long term storage solutions work way more efficiently when presented with a continuous stream of data rather then individual files.

# Compression

When it comes to compression there are two types:

Lossy in which information is removed so that the decompressing the compressed file results in a slightly different file than the original. For example slight variations of green in an image prior to and after compression.
Lossless in which no information is removed. The decompressed file will be identical to the one never compressed.

Most image formats include some kind of lossy compression. Be aware that compressing an already compressed file will not make it smaller unless you are willing to sacrifice more information which will eventually lead to unrecognizable results. Please do your own research if you are interested in more information about compression.

Linux provides several tools to compress files.

# gzip

gzip is the most commonly used compression utility for Linux. It is based on the DEFLATE algorithm, a lossless algorithm based on LZ77 and Huffmann coding.

gzip access_log

will compress the file access_log and replace it with its compressed version. By default gzip will append .gz to the filename.

gzip -l access_log.gz

will show you statistics about the compressed file.

gunzip is the inverse command to gzip.

TIP

Try figuring out how to gunzip a file without using gunzip and then check what gunzip actually does.

gzip can also be used as a filter

pg_dump my_db | gzip > my_db_backup.gz

will take the database dump from postgres and output the gzipped result to my_db_backup.gz

# bzip2

Is very similar to gzip but offers better compression at the expense of more cpu time.

# Archiving

The traditional tool for archival in the *nix world is tar which stands for Tape ARchive. It was originally used to stream files to a tape for backup or file transfer. It creates a single output file which can later be split back into the original source files. This file is often called a tarball. tar has three basic features:

Operation	Flag
Create an archive	c
List an archives content	t
Extract file(s) from an archive	x

which are often used in conjunction with compression. The z flag will tell tar to compress the archive using gzip while j will make it use bzip2.

For example

tar -czf access_logs.tar.gz access_log*

would package all access logs into a gzipped tarball.

Please be aware that *nix does not treat file extensions in any special way. By convention uncompressed tarballs end with .tar while .tar.gz and .tgz are used to denote gzipped tarballs and .tar.bz2 or .tbz are commonly used for bzip2 compressed tarballs.

To list the content of the tarball created above we would use

tar -tzf access_logs.tar.gz

TIP

Think of a way to do the same thing using gzip and tar separately.

Extracting the content of our file can be done like this

tar -xzf access_logs.tar.gz

In case you are wondering f tells tar to work with a file as in/output. By default it would work with STDIN/STDOUT. It is important to keep it at the end of the flag chain because tar expects whatever comes next to be a file.

tar is very powerful go and read its documentation for more advanced features.

# Zip

ZIP is the defacto standard for compression and archiving in the Microsoft world although not as commonly used in Linux it is still well supported. The commands zip and unzip do very much what you would expect them to do. However they behave quite differently than one is used to from tar/gzip. One of the main differences is that both commands expect the archive to act upon as their first parameter. Also zip will not by default recurse into subdirectories. This means that adding logs instead of logs/* will only add an empty directory to the archive. As an alternative zip offers the r flag which results in tar like behavior.

More details can be found in the respective manpages.

← Pipes, Redirection, Viewing Files and REGEXP Managing Packages →