LZW Compression and decompression

by Yi-Jou Chen

Table of Contents


INTRODUCTION

As we know, distributing the concurrent computations provides parallelism and enhanced resource utilization. Parallelism is achieved by the simultaneous execution of different portions of the visualization on each of the workstations. Enhanced resource utilization can be achieved by assigning computaional intensive portions of the visualization to the more powerful workstations.

In Data Explorer, distributing processing is accomplished by assigning the "execution groups", which are a set of tools that can be assigned to workstations. Once groups are created, they can be assigned to the workstations over which the visualization is to be distributed.

Furthermore, data transferred over networks generally contains significant redundancy, especially for some networks without high performance bandwidth. This critical factor often hurt the benefit of parallelism for complicated visualization computations.

The purpose of this project is to provide compression/uncompress modules to reduce the amount of data around the network using the LZW compression algorithm
[1]. The main reason for choosing LZW algorithm is its high performance and compression ratio.

GOALS

The goal of this project is to analyze the data structure of Data Explorer and apply the LZW compression and decompression by creating C modules for reducing the transmitting time of network programing. In order to really know the performance of compression module, two testing module, "Time" and "Irregular", must be created.

Another goal of the project is to know the compression ratio of Data Explorer's structure, data flow model, after the application of LZW algorithm. In addition, the comparisons between the compression ratio and the time consuming on network before and after applying the LZW must be studied.

APPROACH

There are three main stages to do this project, the analysis of data model, the creation of C modules, and performance testing.

Since the LZW algorithm originally is used to compress the text file or directory in file sytem, we must first analyze the whole data model in the Data Explorer and find the relation between those different data structure. The Data Explorer data model supports various types of scientific simulation and observational data. In general, there are two categories, regular and irregular data struture. Other concerns of data model include the data types, object types, standard components, relation between components, and shared data flow model.

Actually, most data structure are pretty neat in Data Explorer because they don't really occupy the size of memory which are expressed by their declaration. We must ignore such kind of data structure in compression moudle, otherwise the simple regular data structure will be expanded to complex irregular structure and will waste the compression and transmitting time.

Another important consideration is the attributes and relations among different data components because of the object sharing allowed by the data model in the data flow network. The benefit of efficient implementation of the data flow programming model should not be hurt by the compression. Therefore, we must keep the relation and attributes of the components after applying the compression module.

According to above analysis, we could integrate those concerns into LZW compression program. The LZW algorithm, as its main advantage, uses the "greedy" parsing algorithm, where the input string is examined character-serially in ONE pass, and the longest recognized string is one that exists in the string table. Another important advantage of LZW algorithm is no extra space for the table in the compressed data and the decompression module can do the reverse process and create its own table to restore the original data without any information from the compression module(except the input stream) in one pass.

In the compression module, there are two inputs, one is the input object, the other is a integer to decide which components should be compressed. Currently, four kinds of component can be compressed, "position", "connection", "color", and "data" because those four components often need large arrays. The input integer is calculated by the logic operation "and" from four integers,1, 2, 4, and 8(which are represented by "position", "data", "connection", and "color" respectively.) For example, the input integer 15 means the four components will be compressed.

In the UNIX file system, the LZW compression always is used to compress single file or the whole files in the same directory but the granularity of data compression is different to file system in the Data Explorer object structure, even the same hierarchical structure. In other words, some component might be compressed, some might be not, in the same object. There will be two integers in the head of compressed components to distinguish other non-compressed data.


LZW COMPRESSION MODULE DESIGN

The "LZWCompression" module written for DX 2.0 is based on the modified version of the LZW algorithm as described in IEEE Computer, June 1984. This section describes the feature how I implement the LZW algorithm and some detail about the concerns of DX's data structure and the relevant function in its library.

LZW Algorithm

The LZW algorithm is organized around a translation table that maps strings of input characters into fixed length codes. The LZW string table has a prefix property in that for every string in the table its prefix string is also in the table. That is if string wK, composed of some string w and some single character K, is in the table, then w is in the table. K is called the extension character on the prefix string w. The LZW string table contains strings that have been encountered previously in the message being compressed. It consists of a running sample of strings in the message, so the available strings reflect the statistics of the message.

Here is the LZW algorithm copied from
[1].
Compression Module

	Initialize table to contain single-character strings
	Read the first input character -> prefix string w
	Step:   Read next input character K
		if no such K(input exhausted): 
			code(w) -> output: EXIT
		if wK exists in string table: 
			wK->w; repeat Step.
		else wK not in string table: 
			code(w) -> output:
			wK -> string table;
			K->w; repeat Step.
			
Decompression Module
	
	Decompreesion:	First input code -> CODE -> OLDcode;
			with 	CODE = code(K), 
				K->ouput;
			     	K->FINchar;
    	Next Code:	Next input code -> CODE	->INcode;		
			if no new code: EXIT
			if CODE not defined(special case):
				FINchar -> output;
				OLDcode -> CODE;
				code(OLDcode, FINchar) -> INcode;
	Next Symbol:	if CODE = code(wk)
				K -> stack,
				code(w) -> CODE;
				Go to Next Symbol;
			if CODE = code(K)
				K -> output;
				K -> FINchar;
			Do while stack not empty
				stack top -> outpu;
				POP stack;
			OLDcode, K -> string table;
			INcode -> OLDcode;
			Go to Next Code;
			
 	        		
           Figure 1. Shared Components among Different Fields

DX Data Model

As mentioned before, the DX data model is different to the file system but has the same hierarchy. In addition to DX special data structure, we must use the library supported by Data Explorer when create our own modules. The Data Explorer library is object oriented, that is, Data Explorer objects are data structures in global memory that are passed by reference, and whose contents are private to the implementation. Furthermore, different objects could share the common components.(figure 1)

Some terminology of DX's data model must be introduced before describing the detail implementation.
Object: An object is a data structure stored in memory that contains
an indication of the object's type, along with additional 
type-dependent information.
Field: A field represents a mapping from some domain to some data space.
The information in a field is represented by some number of named 
components.
Array: Array objects hold the actual data, positions, connections, and so on.
Attribute: store the relation among different objects.	


Regular vs Irregular Array

Irregular Array:is the most general way to specify the contents of an array.
Regular Array:is a set of n-dimensional points lying on a line in n-space
	       with a constant n-dimensional delta between them, represents.
		
		Figure 2. Irregular vs Regular Array


As above description, we could know the regular array do not really occupy memory as its declaration. Therefore, before apply the compression procedure, we must check the array type of every member in the input object and ignore the regular array. In addition to the type of array, we must also get other information about the array, for example, data type, item number, rank, etc. There is a main difference between array here and the file in file system - no EOF pointer. Hence, we must get its size before compress it.

Furthermore, after compressed the array, we must store the description of original array because it must be restored after the execution of decompression module. In addition to the size of array, the attributes among the objects must keep unchanged. If one of those attributes is lost, the Data Explorer could not find the references and relations between two objects and would cause fatal error.

Constraints of DX library

In DX's data model, the deletion of components must notice two important conditions. First, it maybe just decrease the reference count. Second, there are might be other components dependent on this deleted objects. Those reference and dependency information is only accessed by the functions supported by DX's library. Consequently, some critical restrictions in following function cause the compression module can not touch the low level data access which could get better performance.

DXAddArrayData: the input object can replaced with the compressed object by this function and all attributes associated with the prior value will be copied to the new value and will supersede attributes already attached to the new value. Usually, after apply this function, the DXEndField must be called because it will checks to make sure that the number of elements in a component being set to depend on the related component does actually match the number of positions in the field. However, in this project, any component can not syntactically depend on any other component because the size of data is syntatically meaningless after compression. Therefore, the DXEndField function can not be called as usual.

Performance Testing Modules

There are several routines that can be used to measure the performance of the system. Calls to DXMarkTime() are made at key points in the system. Time marks are batched until a call to DXPrintTimes(). The printing of timing messages can be enabled by calling DXTraceTime().

In order to get large data array for performance testing, the Irregular module is created.This module can force the input object to expand its component from regular to irregular array.


RESULTS

Tow examples are tested in this project:

EXAMPLE 1: The main objects used the "Texture Mapping" and "Parametric Equation" 
to generate two identical 3-D earths and distributed these two objects to two
different workstations of HP machine, "grieg" and "wagner". In those remote
workstations applied the simple "translation" operation.  After those
computations are done, received and collected the two objects to become a
compound object in master(local) machine. In this case, only the color
components are compressed because it is the only one with irregular array.

EXAMPLE 2: The same flows and operations as EXAMPLE 1 but using "Quadric Surface" 
module to generate the main objects. The four components are compressed after
applying the "Irregular" module to expand its regular components.
			Table 1	

The static information are shown in the Table 1. We can see the LZW algorithms
get a very good compression ratio for the image data in EXAMPLE 1. On the other
hand, the performance of DX's normal data structure is not good enough as
expected.
			Table 2

The network performance is illustrated in the Table 2. The data is collected
from five executions in each example. The "NON" columns means the comparison
case which did not apply the LZW compression module, but under the same flows
and structures. The "MIN" and "MAX" represent minimum and maximum time in the 
five executions. As expected, the EXAMPLE 1 has good performance in each
categories. On the other hand, the EXAMPLE2 can not save time because the 
reduced time in network can not cover the time of compressing data.

GETTING THE MODULES

C source code

Examples


USING THE MODULES & EXAMPLES

Creating Execution Groups

By default, all tools in a visual program are in a single unnamed execution group. This execution group will be executed on the master workstation. New execution groups are created and existing execution groups are modified and deleted using the Execution Group dialog box, click
here to see an example.

After invoke the dialog box, a group name must be assigned and its click the "Add To" button to add a set of selected tools to be added to an existing execution group. You could also use "Show" button to display all the tools on the canvas that are members of the selected group to be selected.

In above example, I create four execution groups to test, two groups apply the compression module, the others are the comparison groups. In the two compression groups, they simply did the "scale" computation.

Assigning Execution Groups to Workstations
Once you have decomposed your visual program into execution groups, you can assign these groups to workstations (or hosts). If you do not specify a host for a particular execution group, the group will be executed on the master. Execution groups are assigned to a host using the "Execution Group Assignment" dialog box, illustrated in the right box of this example.

Using LZWcompressiong module

           Figure 3. Using LZW Modules
Like above picture compression and decompression module must be used in pair. In general, the compression module is added to the diagram before distributed objects to different workstations and decompression is executed after the remote workstation receive the objects.

CONCLUSIONS & FUTURE WORK

In general, the LZW algorithm is a fast compression method which can get good compression ratio in different kind of data. From this project, we can get the expected result in this point. However, we can get the satisfied improvement in the network performance. According to my study, there are several reasons can explain above results.

First, there are no low level routine supported by DX library to access and manage its data structure. The compression technique always need concern the system level routines to manage the data. The Data Explorer, as mentioned before, is object-oriented data model and with the information-hiding property. Besides, it has its own data control units to manage which size of data model can be applied.

Second, when we do the compression on networks, we must know the exact number of packets are sent or received for performance testing, or even apply the compression algorithm. In Data Explorer, it even did not support any routine for network programming.

Finally, It is very difficult to get precise performance data in this project because of DX's data flow model. In such kind of model, we can not exactly know the dependency and reference among objects. That is, we can not get the real time information about the total size of current objects.

Generally speaking, there might be a method to solve above dilemma, In the network programming, we can execute another process to intercept and compressed the packets before DX send to remote workstations. On the receiver cites, intercept the incoming data and decompress it before send to DX's execution module. The above scheme is embedded in Data Explore and the user can not specify the compression module interactively.

As suggested by advisor, the LZW compression module can be applied in many ways in Data Explore. One is to save the data generated by "Export" module. This will be the future work which can interactively interpret the DX's data format and decompress the data from the stored file. It also can save tremendous file size in file system when many users use the Data Explorer in the same file system because we always have "file system full" problems in CS 418 course.

References

[1] Terry A. Welch "A Technique for High-Performance Data Compression" IEEE COMPUTER, June 1984. pp. 8-19

[2] IBM Data Explorer 2.0. For more information, click here