Reference Series Table of Contents For This Issue

How Computers Work, Part I
August 2001• Vol.5 Issue 3
Page(s) 156-161 in print issue

Unfolding File Compression & Reduction
The Making Of Miniatures
As quickly as technology evolves and changes, there are constants. One is the need for increased drive space. Today, with 30GB hard drives being common, it’s hard to imagine that a 600MB drive (or only 0.6GB) used to be spacious. Not only were older drives small, they were expensive. That’s why companies such as Microsoft, Star, and others developed cost-effective software solutions, such as Stacker and DoubleSpace, to increase drive space. These programs compressed contents on a hard drive to free space.

As the Internet moved into the mainstream, compression became imperative to reduce files for faster transfers over slow dial-up connections. Today, compression is as popular as ever and has moved off the computer platform, making products such as DVD (digital versatile disc) and DSS (digital satellite service) possible.



 Background. Most file compression techniques use an algorithm, which is a simple program that carries out compression. The type of compression performed, lossy or lossless, depends on the algorithm.

There are two types of compression: lossless and lossy. Lossless compression uses an algorithm that usually searches for long strings of data it replaces with smaller strings. Lossless compression can completely reconstruct a file. Lossy compression uses an algorithm that searches for data it can eliminate from the file. A lossless compression method is often executed after a lossy algorithm first eliminates unneeded data. Lossy compression generally results in higher compression ratios than lossless compression, but it may not be suited for files, such as programs, that can’t tolerate information loss.



 Lossless Compression Algorithms. Most of us have been frustrated waiting for a program or image to slowly creep down a phone line. In most cases, however, these files would take twice as long to transfer without lossless data compression. If you’ve downloaded files from the Web, you’re probably familiar with Zip files (which end with the .ZIP file extension). Zip is the most popular compression method for downloading programs from the Internet. Utilities, such as WinZip (http://www.winzip.com), ZipMagic (http:// www.ontrack.com), and PKZIP (http:// www.pkware.com), are staples on any Internet-savvy user’s system.



Archivers, such as the popular WinZip, enable users to put several files into one package for better organization and easier distribution of files.
These utilities use various spin-offs of an algorithm designed in 1977 by Abraham Lempel and Jacob Ziv. The LZ77 algorithm searches for redundant strings of bytes in a file and replaces these with a pointer to an earlier instance of the same string. The program’s specific compression algorithm determines the number of bytes in a string that a pointer can replace.

After using LZ77, many compression methods further compress a file using Shannon-Fano coding, which assigns fewer bits to represent common characters and more bits to represent less common characters. (A bit has a binary value of either 0 or 1.) This coding method ranks each character in order of the probability of its occurrence. It then divides characters so the total percentage of all characters within each part is roughly equal. It assigns a 0 to the first group and a 1 to the second group. This should result in a smaller number of characters encoded with a 0 than those with a 1. This number becomes the first bit for the code that represents each character. Each group is divided in the same manner, each division adding another bit to the code. The process finishes when each character has a unique code. The less probable the character, the more bits it has.

Huffman
can also be used instead of Shannon-Fano. This coding is similar to Shannon-Fano in that it assigns variable length codes to characters based on a character’s probability. Huffman, however, uses code words to create a coding tree, from which variable length codes of 0s and 1s are assigned. Huffman coding is used in some PKZIP versions.

In 1982, James Storer and Thomas Szymanski developed a more efficient LZ-based algorithm now known as LZSS. LZSS checks to ensure a pointer is smaller than the string it replaces. If not, the algorithm won’t output the character and will move on. In 1978, Lempel and Ziv patented a second compression algorithm known as LZ78, which places all redundant strings in a dictionary and assigns each string to a code word used to replace the string in the file.

Terry Welch made a significant improvement on LZ78 in 1984. The LZW algorithm is more efficient than the original LZ78 algorithm, though they are similar. The major difference is not in the encoding or decoding process, but in the flexibility of the dictionary used. LZW supports variable code word sizes and the ability to delete dictionary information. The algorithm is commonly found in the LZC algorithm, known for its use in the Unix compress utility.



 Lossless Compression Applications. Lossless compression algorithms are useful for data integrity and archiving files for Internet transmission, minimizing storage space, freeing hard drive space with drive compression utilities, and speeding up an Internet connection.

Archivers. Software such as PKZIP, WinZip, and ZipMagic are known as archivers. They compress data and package it into an archive. Archives consist of two or more files packaged together as a single file. Archiving makes it easier to distribute a program containing numerous files. PKZIP and other Zip archive utilities use the LZSS coding algorithm, in addition to Shannon-Fano in earlier editions, and Huffman coding in later editions. Most of these utilities can also handle other compression methods used by other programs, such as ARJ (http:// www.arjsoftware.com) and ZOO (http:// helpdesk.uvic.ca/how-to/exchange/internet /coding/zoo.html), which use compression algorithms similar to PKZIP.

Other utilities are more specific to different platforms. For example, StuffIt (http://www .aladdinsys.com), is a popular Macintosh compression and archive utility.

Drive compression.
When 600MB hard drives were expensive, this type of software compression was extremely useful. Several companies, including Star and Microsoft, created drive-compression programs. The DoubleSpace drive compressor that bundled with Microsoft’s MS-DOS 6.0 OS (operating system) was somewhat buggy. Star’s (now Previo) provided the first reliable drive compressor with Stacker. Microsoft’s later drive compressors, such as DriveSpace 2.0 for Windows 95 and DriveSpace 3.0 (included in Microsoft Plus! for Win95 and Windows 98) were more stable than initial versions. Cheyenne’s Infinite Disk was another program that constantly monitored a hard drive and expanded or contracted it as needed to provide a balance between performance and space.

DriveSpace 3.0 and other drive compression utilities created a virtual compressed hard drive known as a CVF (Compressed Volume File) residing on a host drive. Uncompressed data, including system files the OS needs to load, were kept on this host drive. On systems with more than 2MB, the host drive was designated as drive H: and was accessed as any other drive.

Few, if any, popular drive compression utilities are still available, primarily because of larger drives and lower prices per megabyte.

Data transfer compression. As mentioned earlier, transmitting compressed files saves considerable time. Modem manufacturers realized this and created a compression standard based on the LZW algorithm known as V.42bis. This modem standard provides support for on-the-fly compression and decompression of data.

Of course, some data is already compressed. Compressing and decompressing data again can actually slow down the transfer. If a file can’t be further compressed, the sending modem will send a signal to the receiving modem. The receiving modem won’t attempt to decompress the arriving data until it receives an escape signal. The sending modem sends an escape signal when it resumes data compression. The V.42bis standard is expensive to license. As a result, it wasn’t used in anything other than modems.



 Lossy Compression. Perhaps because the Web is a multimedia platform, lossy compression seems more popular than lossless compression. Lossy algorithms take advantage of limitations to the human senses to compress data further. In effect, lossy algorithms re-create a file that appears identical to the original but is very different at the bit level.

Lossy compression algorithms are limited in the type of files they can compress. For obvious reasons, you wouldn’t want to compress important program files using lossy compression. Multimedia files, however, are perfect for lossy compression. Still images, video, and audio can all be compressed into a higher ratio with the use of lossy compression algorithms. Depending on the algorithm, data loss may be minimal and not even noticeable. Lossy algorithms are so perfect for multimedia files, they’re used in a wide range of digital devices including satellite receivers, DVD players, and portable digital music players.

Unlike lossless algorithms, there are no common lossy algorithms. Instead, each algorithm is specialized to the information it removes. These algorithms are usually developed by special groups or committees that use them to develop a standard. Standards can then be implemented in several programs. For example, MPEG (Motion Picture Experts Group) developed algorithms used in the MPEG-1 standard, which is used by numerous developers of programs.

Multimedia lossy compression. MPEG sounds like a committee that votes on the Oscars. In reality, this group of researchers is defining the standards for compressed video and audio files. Because of its acceptance as a standard, MPEG has quickly become popular on the computer platform and in consumer electronics. The MPEG standard has already gone through several incarnations, each one optimizing the standard for different mediums.

MPEG-1 was the original MPEG standard optimized to read video and audio from a CD at speeds up to 1.15Mbps (megabits per second). MPEG-1 uses a complex algorithm that throws away some data. It then predicts the lost data from future and past frames during playback.

The MPEG decompressor (the MPEG committee only defines the standard for MPEG decompression) starts from a reference frame and can then use information provided by the encoder during compression to predict intermediate frames from past and future frames.

MPEG-1 can also encode audio tracks. If the tracks are part of a video, the decompressor can maintain timing information to synchronize the audio and video. It’s possible, however, to have an MPEG-1 encoded audio file without video or a video file without audio.

The MPEG-1 audio compression removes sound data the human ear can’t detect and checks for redundancies. The standard defines three levels, or layers, of audio compression, each with increasing complexity and efficiency. Layer I algorithms do basic file checks and create a slightly compressed format. Layer II does a slightly better job of eliminating unneeded information and compressing the file. Layer III provides the highest quality compression for audio files and requires the most complex algorithm and more computing power. Layer III has enjoyed popularity of late, despite finding itself a target of the record industry.

The record industry understandably fears that a small file size (an easy download) and high audio quality are a serious threat to sales. Several companies rushed to create a new audio standard for downloadable music that includes piracy protection. Until recently, there was little success slowing the growth of this explosive format, now commonly known as MP3.

Microsoft’s Media Player is capable of playing MPEG-1 video and lower quality audio. Media Player also supports MP3. Media Player is available to download from Microsoft at http://www.microsoft.com/windows/mediaplayer/en/default.asp.

MPEG-2 is a broadcast-quality video standard optimized for 4Mbps to 9Mbps transmissions. Satellite broadcasters such as DirecTV have adopted the MPEG-2 standard and the DVD format utilizes MPEG-2, as well. A slightly modified version of MPEG-2 is the video standard for HDTV(high-definition television). MPEG has also put forth the MPEG-4 standard and is at work on MPEG-7 and MPEG-21 standards.



 Image Compression. Bandwidth limitations for many users still means the Web is more of a graphical medium than an audio/video one. As a result, image compression is important to a user’s Web experience. Smaller image files mean faster downloads. Of course, we also want to be impressed by high-quality graphics. Several compression formats have tried to strike a balance between speed and file size.

GIF. The GIF(Graphic Interchange Format) is an important type of lossless image compression, and it still enjoys some popularity on the Web because GIF files are better suited to hand-drawn graphic files than other formats.

CompuServe introduced the GIF format in 1987. The GIF format uses the LZW algorithm. Unfortunately, this means commercial developers must license the LZW compression and decompression software from Unisys, which holds the patent on LZW compression.

GIF files can uniquely hold more than one image in a single file. This ability makes possible the animated GIFs seen on the Web. GIFs can also contain transparent backgrounds, which adds to their popularity in Web design.

PNG. PNG (Ping) files were created as a lossless alternative to the GIF format. Instead of the LZW algorithm, PNG is based on a variation of the LZ77 algorithm (thus eliminating licensing fees), which is similar to LZW, but instead of indexing redundant strings in a dictionary, it replaces the strings with a pointer to an earlier occurrence of the same string.

PNG is similar to GIF but supports high-quality images. Although some GIF features may not be implemented in the PNG format, it does support multiple images and animation capabilities.

JPEG. The lossy answer to GIF and PNG is JPEG (Joint Photographic Experts Group), a subcommittee of the ISO(International Organization for Standardization). JPEG files have essentially replaced GIF as ruler of the Internet file compression world primarily because JPEG images can store 24 bits per pixel color (16 million colors), as opposed to GIF’s limited 8 bits per pixel (256 colors). JPEG is a lossy algorithm, so some information is lost during compression. Although small variations in color can occur at standard levels of compression, the color variation is almost unnoticeable to the human eye. At higher compression ratios, JPEG files can appear blocky and blurred. To solve this, JPEG lets users control the level of compression, ensuring an appropriate balance between the picture quality and the file size.

JPEG isn’t an actual file format, but rather a family of defined algorithms. As such, several variations of the JPEG format exist. The majority of JPEG files are “baseline.” A variation of the baseline format includes progressive JPEG, which repositions data within the file. Users can watch as a progressive JPEG decompresses and displays with increasingly better quality. Other variations included a lossless JPEG compression format. This format hasn’t achieved much popularity because of the exceptionally low compression ratios when compared with baseline JPEGs. A newer JPEG algorithm, JPEG-LS, has potential to improve the older lossless JPEG standard.

JPEG offers good compression and quality for natural images, but GIF images are still best for drawn graphics. In addition, JPEG files don’t support transparent backgrounds because of the slight background variations. JPEG, however, has become the most popular graphical compression algorithm on the Internet.

TIFF. TIFF (Tagged Image File Format) is another popular standard. Developed by Aldus, it can be used in Windows, Linux, and Macintosh systems. The TIFF format takes advantage of several different algorithms, some lossless and some lossy. Both JPEG and LZW are supported by TIFF.

TIFF does have its problems. Certain image readers can’t read TIFF files written by other TIFF programs because of some internal difference within the TIFF applications. This can result in one application being unable to read a TIFF file produced by another application, even if they’re both on the same machine.



 Future Techniques. Research has not ended with the success of LZ and various lossy compression algorithms. Research is being directed into fields that may replace current standards. Wavelet and WinRar are promising compression methods, as is the newer Windows Media Audio and Video 8. Looking further ahead, fractal compression is an intriguing and controversial compression method that holds some promise.

Wavelet compression. Wavelet is a form for compressing images. Like JPEG, its users can adjust the amount of compression to balance file size vs. picture quality. At low compression ratios, wavelet files can offer greater compression than JPEG while maintaining similar picture quality.

Wavelet is a lossy format and is similar to JPEG. Where JPEG divides an image into smaller blocks before compressing, wavelet algorithms compress the entire image at once, which lets a wavelet algorithm take advantage of more redundancies while eliminating the blockiness JPEG can cause.

The Compression Engine is a compressor/ decompressor for wavelet images available from CE Engines (http://www.cengines.com). We tested wavelet against JPEG compression and liked the results. Using a standard 800 x 600-pixel image of 1,440,054 bytes, the average compression of the JPEG image was 227,320 bytes, an 84% compression of the original image. Wavelet compression (at a standard compression ratio) reduced the image size to 44,669 bytes, or 97% smaller than the original file. The picture quality at the default level was comparable to that of the JPEG image. At the highest compression ratios, wavelet compression reduced the file size to 2,285 bytes, or 99% smaller than the original. The picture quality, however, was very blocky and blurred the image tremendously. (NOTE: Compression amounts vary from file to file depending on the algorithm.)

RAR archives. Another compression and archive utility is WinRAR (http://www.rarsoft.com), which can perform many of the same functions as PKZIP. However, its original algorithm can provide better compression ratios, especially for multimedia files.

Fractal compression. Fractal compression is another method that promises nearly obscene compression ratios. Unfortunately, this method still appears to be a ways off.

In 1977, IBM scientist Benoit B. Mandelbrot published “The Fractal Geometry of Nature.” He coined the term “fractal” and defined it as a mathematical expression of an irregularly shaped object, such as a cloud. Before fractals, mathematical expressions could only produce rigid objects using straight lines and set angles.

Michael Barnsely’s book “Fractals Everywhere” introduced the collage theorem, which calculates what a mathematical expression must look like to represent a natural image. This has lead scientists to probe whether the process could be reversed to generate a mathematical expression from a natural image. One solution is known as the Grad Student Algorithm, which consists of putting grad students in a lab with an image and not letting them out until they find the correct mathematical expression to represent the image. This approach is popular with tenured professors, but understandably less so with grad students. To see examples of fractal images, check out the Images Created by Man & Chaos Web page at http://www.glyphs .com/art/fractals/frac_art.html.



 The Skinny. Compression is one of the underappreciated enablers of the digital age. Imagine a world of open source software without compression available to developers to share their work easily. We joke that the WWW should stand for World Wide Wait, but what if all those fancy Web images weren’t compressed? We shudder at the thought.  

by Chad Denton

View the graphics that accompany this article.
Compressed Images
How File Compression Works
The Squeeze Is On chart
(NOTE: These pages are PDF (Portable Document Format) files. You will need Adobe Acrobat Reader to view these pages. Download Adobe Acrobat Reader)





Want more information about a topic you found of interest while reading this article? Type a word or phrase that identifies the topic and click "Search" to find relevant articles from within our editorial database.




© Copyright by Sandhills Publishing Company 2001. All rights reserved.