ECE 491 Fall 2007
Ankur Kumar (email@example.com)
It is a common fact that humans are limited to only visualizing objects and datasets in three dimensions. However, visualizing objects in high-dimensions (HiD) can be often useful. Multivariate datasets can be difficult to visualize and graph. Several visualization schemes such as attribute mapping, Chernoff faces, glyphs, dimensional reduction using linking, and parallel coordinates. Parallel coordinates (PC) is a visualization scheme based on drawing all the dimensions parallel to each other, and each point is graphed as a polyline intersecting all the parallel dimensions at the coordinates of the point. This can be highly advantageous because it enables graphing several dimensions all parallel to each other, has Cartesian to PC dualities, and it can still preserve the geometry of the entire multivariate dataset for further analysis. Considering the human intuition, this geometry can be inferior to analyze. Intuition to visualize in parallel coordinates needs to be learned through examples of familiar datasets. The goal of this project was to investigate the visualizations of familiar datasets, perform clustering, and data mining techniques, etc. on real and synthetic datasets through code examples and reading research papers done on the topic. It is assumed that the reader has a knowledge of basic physics and computer science fundamentals. This website is organized as Background, which gives a basic overview of parallel coordinates, Programs, which uses the understanding of basic concepts of clustering and other algorithms to do analysis and visualizations on datasets, Conclusions, which explains what has been learned from this project and the limitations of the techniques applied in parallel coordinates, and lastly, the Appendix outlines all the references, code, animations, and datasets.
The purpose of this section is to give the reader a basic understanding of parallel coordinates before they delve into the Programs section of the report. While researching multidimensional geometry, Alfred Inselberg realized the lack of a good visualization technique to graph multivariate dataset. Pursuing this interest, he came up with a visualization scheme to graph multidimensional shapes such as hypercubes and graph multivariate datasets. This scheme was parallel coordinates. Professor Bruce Land, under whose supervision this project was worked on, explains the basics of parallel coordinates.Some of his work has been incorporated into this report.
Points and Lines
The primary duality between Cartesian coordinate system and parallel coordinates is regarding points and lines. This duality explains how a point in Cartesian coordinates is a line in PC, and a line in Cartesian coordinates is a point in PC. Another duality is that a line in Cartesian coordinates is a point in PC.
Consider the three dimensions (x1, x2, x3) in Cartesian coordinates, these dimensions are graphed parallel to each other in PC. The coordinate values of a point (p1, p2, p3) in Cartesian coordinates are plotted on each parallel dimension as a value on the respective axes. This means that not only the axes are parallel to each other, but there is a common scale on all these parallel axes. The figure below shows two sets of two points in Cartesian and parallel coordinates:
The plots shown above graphs two sets of two points in PC, which is at the top of the figure above, and in Cartesian coordinates, which is at the bottom of the above figure. The first set constitutes of the points (-0.43, -1.67, 0.13) and (0.29, -1.15, 1.19), and the second set constitutes of the points (6.19, 4.96, 0.33) and (5.17, 4.81, 0.726). In Cartesian coordinates, the points are graphed intuitively. However, in parallel coordinates, each dimension is graphed in parallel and the coordinate values of the points are connected via a polyline over these parallel dimensions. For instance, consider the point (6.19, 4.96, 0.33), in PC, the value 6.19 is found on the scale of the dimension x1. It is then connected to 4.96 in dimension x2 using a line. Then, a line is drawn from 4.96 to 0.33 in x3. This completes the graphing of a three dimensional point in PC. As one can see, a point in Cartesian coordinates is a polyline in PC. Also, a polyline in PC is a point in Cartesian coordinates.
The second duality is that a line in Cartesian coordinates is a point in PC. The following is an excerpt from Professor Bruce Land’s PC website:
The line segment x1=t; x2=1-.5*t; x3=0.5-.1*t; with 0<t<1 is drawn in normal 3D coordinates and in PC. The color is relative t value. Note that each point on the line becomes two line segments in PC (x1-to-x2 and x2-to-x3). The obvious convergence of all the line segments between the x1 and x2 axis is a direct consequence of the linearity of the points and partly defines the line. A less obvious convergence of line segments to the right of x3 completes the definition of the line. If the slope of the line relating one axis to another (e.g. x1 to x2) is m and the intercept is b in normal coordinates, then the position of the convergence in PC is the point [1/ (1-m), b/ (1-m)], assuming that the first axis (x1) is located at horizontal position zero and x2 is located at horizontal position 1. For the line given above the slope from x1 to x2, m=-0.5 and b=1. Thus the first convergence point should be [0.67, 0.67] as shown. The slope from x2 to x3, m=0.2 and b=0.3. Thus the second convergence point should be [1.25, 0.375], or just to the right of the x3 axis.
Other dualities include:
A simple example of a geometric shape that can be extended into high dimensions is a cube. A Cartesian coordinates is limited to only a 3-D cube, whereas parallel coordinates can graph an n-dimensional cube, where n can be any integer. This hypercube code can graph any n-dimension hypercube, providing, that the computer has enough memory. The figures below show a 3-D cube followed by a 6-D hypercube in parallel coordinates. As one can see, a hypercube is easy to visualize in parallel coordinates and can be easily identified as well.
As mentioned before, there is a duality between rotation and translation. A rotating 3-D cube is used to show this duality. In the rotating cube code, a 3-D cube is rotated in Cartesian coordinates, and this is expressed as a translation of the 3-D cube in parallel coordinates. This video clearly demonstrates this. The figure below shows a 3-D cube rotated around the blue line.
The next section, Programs, gives in-depth examples involving clustering, principal component analysis, grand tour, density plots, oscillators, attractors, and fractals. All the programs are coded in Matlab.
This program aims to numerically solve for the displacement of a mass-spring system, a simple harmonic oscillator, in 2-D. A harmonic oscillator is a system in which the mass, m, is displaced from its equilibrium position, and experiences a restoring force F proportional to the displacement and the spring constant, k, according to Hooke’s Law, which is stated below in the table of equations. An animation from Wikipedia is shown below:
In two dimensions, each dimension is treated independently, and Hooke’s Law in each dimension gets treated independently. To solve for the displacement in each dimension, x and y, Newton’s second law is employed with a delta time, dT, of 0.1. Since, the displacement is solved in 2-D the harmonic oscillator needs to be started with an initial displacement, ix, at an angle, theta. The table of equations is given below.
The spring-mass oscillator code accepts the following parameters: the spring constant (k), the mass (m), the initial displacement (ix), and the angle of initial displacement (theta) measure in radians. The figure below shows the results for the arguments [100, 1, 1, pi/3] for 100 units of time.
Another interesting problem is the coupled oscillator problem in 1-D. In this scenario, two masses are attached in between by a spring, and this system is attached by two springs to a fixed wall. Energy is transferred between these two oscillators. Analytically, the equations to solve this system can be rather cumbersome, but using Hooke’s and Newton’s laws, one can solve this system numerically. The system of coupled oscillators can be considered as two identical free oscillators that are mutually coupled. This system leads to the resulting differential equations:
The above equations can be difficult to solve. However, using the concept of a simple harmonic oscillator, the equations in the table below can be derived. The subscript 1 and 2 correspond to the two individual oscillator system coupled together.
The coupled oscillator code was run with the following parameters: the masses (m1, m2), the spring constants (k1, k2, k3), and the initial displacements (x1, x2). The figure below shows the displacement of both oscillators using the parameters [1, 1, 50, 66, 50, 40, -40]. This can be verified using this coupled oscillator applet.
Lorenz Attractor and Julia Set Fractal
The Lorenz Attractor was introduced by Edward Lorenz in 1963. It is a nonlinear, 3-D, and deterministic system. It is noted for its butterfly shape and it is known for its chaotic flow behavior. It is a dynamical system which evolves in a non-repeating pattern. The Lorenz Attractor is a good example of a dynamical system that should be graphed in parallel coordinates to visualize its properties. The Lorenz Attractor is given by the following equations:
With different values of σ, ρ, and β the structure of the Lorenz Attractor can be controlled and interesting patterns such as a torus, T (3,2), can be plotted both in Cartesian coordinates and parallel coordinates. The code for Lorenz Attractor requires solving a non-stiff differential equation. The differential equation is solved using the two Matlab scripts - lorenz.m and sol.m. The figures below show the Lorenz Attractor for various values of σ, ρ, and β in Cartesian coordinates, and parallel coordinates with the start point (1, 1, 1).
From the figures above, one can see that the Lorenz Attractor yields interesting results based on different values of σ, ρ, and β. A lot of time can be spent on this dynamic system to reveal fascinating structures. Another interesting problem to examine is a fractal, which is a complex dynamical system. The complex dynamical system called the Julia set was picked to examine because it can use a set of quadratic polynomials to calculate values of the fractal. From Wikipedia, if f is a polynomial, then J(f) is the boundary of the filled Julia set; that is, those points whose orbits under f remain bounded. The family of quadratic polynomials is a popular complex dynamical system and it can be expressed as the follows:
The parameter plane of quadratic polynomials, which is the plane of possible values of c, gives rise to the Mandelbrot set. For the purpose of this project, a simple backwards iteration algorithm for fc was coded to visualize the Julia set. Better algorithms can produce all of the parts of the Julia set image, however, implementing such algorithms is beyond the scope of this project. The backwards iteration algorithm was coded for and is described as follows:
Below are figures of some Julia set images calculated using backwards iteration algorithm in Cartesian coordinates and parallel coordinates.
From the fractals in PC shown above, one can see that parallel coordinates, in some sense, is restrictive when it comes to visualizing these fractals.
Graphing multi-dimensional data in parallel coordinates can lead to easy visualization. However, graphing the multi-dimensional data doesn’t explain the structure of the data for analysis purposes. The following sections explain the structure of the multivariate datasets for further analysis.
Clustering is one of the fundamental techniques of finding structure in a dataset. It groups similar data points into a cluster and then statistics can be run on those clusters, and knowledge can be discovered in these clusters. Clustering is often used to classify datasets into subgroups. These subgroups are formed because they satisfy some distance measure constraint, which can be a common property shared by all the members of the subgroup.
Several clustering algorithms exist, and all of them have their strengths and weaknesses. Clustering algorithms fall into the following categories: hierarchical, partitioning, evolutionary, and constraint-based. This project deals with partitioning algorithms, especially, k-means algorithm. In this set of algorithms, the dataset is partitioned into non-overlapping clusters. The k-means technique involves calculating the centroid of the dataset repeatedly until you have k centroids i.e. k clusters. The centroid, C, of an n-dimensional dataset with m entries can be calculated as shown below. The symbol d i denotes a single n-dimensional entry in the dataset.
The above formula calculates the centroid of the dataset; it is also the mean of the dataset. In an n-dimensional dataset of m entries, the centroid will have n-dimensions, and the sum of the squared error (SSE) of a cluster in the dataset can be calculated as shown below. SSE helps in understanding how distributed the members of a cluster are around the cluster’s centroid.
The issue with k-means algorithm is that results can strongly depend on the initial guess of the centroids. It is not known how many clusters are in the dataset, therefore, the value of k needs to be guessed or should be defined. The algorithm is sensitive with respect to outliers and can cause wild swings in the centroid leading to unbalanced clusters. Another limitation is that a local optimum computed for a cluster need not be a global optimum for the entire dataset. This is why a bisecting k-means algorithm was employed on the dataset. The algorithm is outlined as follows:
The advantage of bisecting k-means algorithm is that there is no initialization of the centroids. However, the desired number of clusters still needs to be defined. The bisecting k-means algorithm produces almost uniform clusters with a greater similarity amongst the members of the clusters. The clusters produced by bisecting k-means algorithm have less entropy than the clusters produced by the basic k-means algorithm. The total SSE of the entire dataset is the summation of every cluster’s SSE. The total SSE needs to be reduced for optimal clustering and this is achieved by the post-processing act of step 7.
Now that the clustering technique used has been explained, a financial dataset was used to show clustering in parallel coordinates. Financial data, in the format of date, opening price, high price, low price, closing price, and volume, of Lehman Brothers was selected to analyze by the clustering program. The time period of the data is from October 2006 to September 2007, which is numbered from 0 to 230. The dataset, which is available in the Appendix, was loaded using importfile.m. The number of desired clusters, k, was picked as 4. The results are shown below.
The total number of data points is 230, however, only 208 points were clustered. This can be attributed to the points that failed to be in the desired number of clusters because their inclusion would increase SSE. From the clusters, one can see that 4 time periods have been highlighted. If one were knowledgeable about the financial market during these time period, they can analyze all the dimensions of the dataset in one pass. For example, the worst episode of subprime mortgage crisis hit the market around mid-July 2007 to early-September 2007 and one can notice that the cyan cluster exemplifies the impact of this crisis on the stock value of Lehman Brothers. The stock has lost value than what it was in October 2006, which is marked by the blue cluster. However, one can notice that the cyan cluster has exceptional volume in comparison to other clusters. This can be viewed as traders bargain hunting and loading up on the Lehman Brothers’ stock for future appreciation of their money. The red cluster signifies the first show of the subprime mortgage crisis’ head in February 2007, and it is clear that the stock value of Lehman Brother’s decreased. Then, the stock value rose a bit as the yellow cluster conveys before the crisis stormed through the market once again in mid-July 2007. Overall, clustering can be very useful in finding structures in datasets and explaining certain aspects of the dataset.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is common technique used for finding the principle components of a dataset. This is often useful in dimension reduction, reconstruction of the original dataset, and also, in finding out which dimension affects the dataset the most. It can be used to find patterns in the dataset as well. Essentially, PCA is a transformation, which transforms the dataset to a new coordinate system such that the greatest variance by any projection of the dataset comes to lie on the first coordinate, this is the principal component. PCA is computed by calculating the covariance matrix of the n-dimensional dataset. Covariance is defined as the amount by which dimensions of a dataset vary from the mean with respect to each other. Eigenvectors are found for this covariance matrix. These Eigenvectors describe the patterns and characteristics present in the dataset. The basic PCA algorithm is outlined below.
After finding the principal components of the dataset, the reconstructed data can be shown to be:
The above equation will return the original data solely in terms of the eigenvectors chosen in step 4. If all the eigenvectors were selected, then the original data will be restored perfectly. The equations shown below support the retrieval of the original data.
After performing PCA, dimension reduction can be performed and the dataset can be reconstructed by using only some of the principal components. Following this, one can think about applying cluster analysis to group these principal components together to see if any of these dimensions are related to each other. This can prove useful in drawing associations in multivariate datasets.
The dataset used to demonstrate the PCA code was financial data of the following companies: Apple, Amazon.com, Boeing, Baidu.com, Bear Sterns, Dow Jones Industrial Average Index, Google, Garmin, Goldman Sachs, Intuitive Surgical, Lehman Brothers, Lockheed Martin, Merrill Lynch, Nokia, Petro China, Research in Motion, Schlumberger, USD-Yen, Valero, and Exxon Mobil from October 29 th 2006 to October 28 th 2007. The datasets are available in the Appendix. These companies were picked because of their activity during the time period mentioned and they span over five sectors of the market. First, the closing prices for all these companies are retrieved and the derivative is calculated using the central difference formula. The entire dataset has 20 dimensions with each company’s change in closing price being a dimension. Now, principal component analysis is conducted on this dataset. The reconstruction is done with 15 dimensions and most of the attributes of the original dataset is captured. The graphs shown below demonstrate PCA on the dataset mentioned above. The labels for the graph which reconstructs the original data by using only 15 principal components should be ignored.
After conducting PCA, basic k-means algorithm with k=5 was employed on the computed eigenvectors to cluster the principal components. The graphs of all these clusters are shown below.
Someone following the market during the time period mentioned can interpret why some of these companies are grouped together in clusters. Using basic k-means clustering technique may not be the appropriate technique to use to cluster these companies. However, one can still try to see if these clusters indicate anything. Since, the dataset is made up of the derivatives of the closing prices of the companies then it makes sense that the clustering is related to the change in closing price of these companies. One can examine the cluster graphs shown above and can realize that the companies with momentum such as Google, Baidu, and Intuitive Surgical form a cluster because their stock price swings a good amount based on market sentiment. While, cluster 3 swings in a moderate fashion based on market sentiment. At first, it was thought that these companies were clustered because of their year-to-date return, however, this was proven to be incorrect based on the year-to-date return values of these companies retrieved from Scottrade. In a way, higher the swing, then the investment is riskier. Cluster 1 shows companies that are less risky and their swing based on market sentiment is small when compared to companies with great momentum present in cluster 5.
An interesting problem that was noticed while following the market for a few months was effects of the change in USD-Yen. It seemed as if every time the Yen rose against the USD, the market would be in the red. This can be attributed to the carry trade, in which, investors borrow low-yielding currency such as the Yen, and lend high-yielding currency such as USD. Consequently, if the Yen rises against the USD, then investors experience a loss and the market goes down. Another trend that is common is the change in Dow Jones Industrial Average Index. If Dow Jones rises, then chances are that other stocks will rise too. The two graphs shown below show the effect of the change in USD-Yen and Dow Jones on some of the companies looked at. Blue is encoded as a positive change and yellow is encoded as a negative change in USD-Yen and Dow Jones. One can realize which companies beat the market on a bearish day i.e. these companies will have yellow components in the positive side of the scale. An example of this is Intuitive Surgical in both graphs shown below.
Financial data can be treated as a time-series data. If a Fast Fourier Transform (FFT) of the dataset is taken, then one can see what frequencies this dataset is made of. The amplitudes of the frequency spectrum indicate the higher frequency components present in the dataset. In the financial scenario, this can roughly be translated into the volatility of a company’s stock. One can see that overvalued companies such as Baidu, Google, Intuitive Surgical, and Petro China have the largest amplitudes in the frequency spectrum graphed below. This translates to volatility or great changes in the prices of these stocks over the time period. If one looks at the historic values of Google in the time period given, Google has increased from approximately $475 to $675 in the time period of October 2006 to October 2007. A peak in the amplitude frequency can be seen at Merrill Lynch in the graph below. This peak can be attributed to volatility as historic data suggests that Merrill Lynch has had an up and down year leading to peak in the FFT. Merrill Lynch started at $84, then shot up to $98, fell to $79, climbed to $94, fell to $70, climbed to $76, and then dropped to $66. Another company which has its value change a lot is Baidu, which appreciated from $86 to $355 in the time period given, thus, it has a large frequency amplitude due to this appreciation.
In high-dimension, it gets very hard for humans to visualize the structure of a massive multidimensional dataset. For example, in Cartesian coordinates a 3-D structure can be viewed from all different angles in all three dimensions and then the viewer can interpret the structure of the dataset. However, in parallel coordinates this can get very vague and is difficult to do, especially, when the dataset is massive with numerous points. This is where Asimov’s Grand Tour algorithm aids in accomplishing this task. Wegman and Luo applied the Grand Tour algorithm to Inselberg’s parallel coordinates to come up with a better understanding of the dataset especially after the dataset has gone through the procedures of clustering, regression, and data pruning. All permutations of all the dimensions of the dataset is considered in the Grand Tour rotation of the dataset. The Grand Tour algorithm provides a generalized rotation of the coordinate axes in a high-dimensional space. This is accompanied by full-dimensional plots allowed by parallel coordinate display and 2-D plots of two of the several dimensions in the dataset, also known as a scatterplot matrix. Subsequently, the data analyst can explore data which is both high-dimensional and massive in size.
As mentioned above, Grand Tour algorithm is the viewing of a dataset by rotating it through different angles. The Grand Tour algorithm, also a torus method, is extended to n-dimensional space and it is outlined below.
In the above matrix, the cosine and sine entries are in the i-th and j-th columns and rows.
This code for the Grand Tour algorithm performs as discussed by the outline above. However, midway through the code the dataset is changed to the “pruned” data as discussed by the next section. These lines of code have been marked in the given code and can be removed. The code for the Grand Tour uses a synthetic dataset, pollen, which is provided in the Appendix and can be found here. The pollen dataset has been graphed for convenience below.
This synthetic dataset has a six-dimensional internal structure, which reveals the word “ EUREKA,” as shown by this figure. This word is revealed when the Grand Tour is performed on it. However, pruning the dataset is not explained in any of the research papers read. Therefore, a scheme has been coded up using clustering to “prune” the dataset by finding clusters in the clustered dataset, and so on. This yields different results. But, when a scheme using density plots was developed to prune the dataset, shown in the next section, then the Grand Tour yielded different results, but still no “ EUREKA.” The research papers read don’t explain how they have reduced the 3848 points in thepollen dataset to 99 points in the pollen dataset to find “ EUREKA.” Limitations of the pruning technique in the next section will touch more on this on this retreival of these 99 points in the dataset. Here are the animations, 1, 2, 3, and 4, of the Grand Tour algorithm on the “pruned” dataset. Animation number 4 records the Grand Tour executed on dimension 2 vs dimension 3. As one can see, there is no “ EUREKA,” however, one can closely examine that the structure of the dataset remains the same even though it is being rotating through all angles in any 2 of the n dimensions. Grand Tour algorithm is useful for finding structure and visualizing the n-dimensional dataset from different angles to get a better understanding of the dataset.
As one can imagine, massive multidimensional datasets when graphed in parallel coordinates leads to a great deal of visual clutter. A good example is the pollen dataset used in the previous section. It is nearly impossible to visualize the relevant structure of the dataset, a technique such as clustering could help, but basic k-means clustering can pick arbitrarily initialized leading to futile results. Data pruning using brushing and hue saturation, also known as highlighting, is an excellent way of indentifying correlations among attributes or the presence of interesting clusters. Density plots are a way of visualizing pertinent information from a cluttered parallel coordinates by considering the overlaps of the graphed points in parallel coordinates (polylines, if you are thinking in Cartesian coordinates). Creating a density plot involves finding the areas through which the most number of points pass through and assigning a frequency value to this area. This density plot will show the most frequented areas of the plot the brightest, and this can help the data analyst visualize the densest structure of the dataset. There are several ways of highlighting pertinent information in a parallel coordinates plot. One common way is to use a Silicon Graphics workstation to accomplish the highlighting task. Since, this Silicon Graphics workstation enables the user to control the raster and the image through software, then one can easily add up all the pixel values and come up with the density plot almost instantly. Then, the high density regions can be stored and the low density regions can be pruned away. Using this technique, the pollen dataset was pruned by Wegman to retrieve “ EUREKA” from the dataset. Unfortunately, Silicon Graphics workstations are obsolete and the company is dead, therefore, a novel scheme had to be developed to accomplish this same task. A symposium paper by Artero, Oliveira, and Levkowitz mentioned in the Appendix was referenced to create density plot to retrieve structures within a dataset.
Before the algorithm is discussed the pixel resolution of the graphed plot needs to be known. A sample Matlab plot was used to figure out what was the pixel resolution of the plot. VisionX, a computer vision software package from Professor Reeves, was used in finding this pixel resolution. The pixel resolution of a Matlab plot is approximately 435 by 343 pixels. The discrete raster representation of the Matlab plot is the input into the algorithm discussed. The algorithm uses a frequency matrix which stores the frequency of every pixel being hit by the drawn lines in the original parallel coordinates plot. Then, once all these frequencies are found, the nonzero frequencies and their indices in the frequency matrix are manipulated in finding two pixels. Bresenham’s line algorithm, a famous computer graphics line drawing algorithm that can be found here, is used to draw a line through these two pixels. This line is stored in the output discrete raster representation and it should be clear that when the same pixel is passed through more than once, the pixel value is incremented in the discrete raster representation. In this way, less dense regions have low frequency values and are colored lighter than denser regions with higher frequency values. This frequency plot, known as "IPC frequency plot" in the algorithm displayed below, is converted to a density plot by applying a smoothing filter as shown below. This is just a regular 3 x 3 mean filter.
The algorithm discussed above is described in detail below:
The density plot is obtained by applying the smoothing filter prior to step 3.2. density.m is executed on the pollen dataset. The density plot for the pollen dataset is shown below. As the reader can clearly see, there is a very well defined structure of high density in the dataset and it has been highlighted a bright red compared to the rest of the dataset. Using a Silicon Graphics workstation, Wegman highlighted this structure, which is also shown below. One can examine how similar these structures are. Another dataset examined for its density and structure is the financial dataset used throughout this project and it is also shown below.
Once the density plot is calculated, the data analyst would like to know what points are in these dense regions. Well, this problem is not an easy one to solve. The complexity of the problem relies on the fact that the density plot used a discrete raster representation, an image, to show the dense regions. However, these dense regions are in an image file encoded in pixel values and they do not correspond to points in the dataset. A recovery has to be made. Since there are no research papers written on this aspect of the problem, and Wegman seemed to have done it somehow. Given the time constraints of this project, an intuitive algorithm, based on the reversal of the density plot algorithm mentioned, was devised as stated below:
The code for retrieval, which uses the stored x, y, and auxiliary matrix from density.m, is executed on the density plot of the pollen dataset. The resulting parallel coordinates plot is shown below. The points retrieved by the density plot as being a member of the dense region is colored red, while the rest of the dataset is colored gray. These points result in a pruning of the pollen dataset. The plot is shown below.
Looking at the plot above, there are several points missing. Experimenting with the threshold value can probably fix this. Also, coming up with a better algorithm to retrieve the data points in the dense structure from a discrete raster representation will enable a better grasp of the structure on which the Grand Tour algorithm can be run on. The Grand Tour algorithm discussed above was run on this pruned dataset, which is highlighted in red in the above figure, to see if the word "EUREKA" was displayed. Unfortunately, it wasn't. This means that the dataset is not adequately retreived, hence, a better algorithm needs to be implemented. The animations are provided in the Grand Tour section and the Appendix.
Parallel coordinates is an innovative way of visualizing multidimensional datasets by drawing all the axes parallel to each other. There are several dualities between the Cartesian coordinate system and parallel coordinates such as the duality between point and line. Through several examples shown, one can visualize hypercubes, dynamical systems, multidimensional structures in datasets, clusters, and even view the n-dimensional dataset from all viewing angles. Data visualization techniques have been learned from this project. Cluster analysis, PCA, Grand Tour, and creation of density plots enable the data analyst to visualize the multivariate dataset with ease and can easily determine the structure of the dataset. Further analysis such as regression can be executed on this determined structure. Using clustering techniques, one can group multivariate data points in a dataset in one pass without tracking trends from dimension to dimension. Principal Component Analysis, PCA, in parallel coordinates can provide the major components of the dataset to the data analyst for further analysis in one pass and in one graph. Popular systems such as the harmonic oscillator, fractals, and the Lorenz Attractor have been also graphed in parallel coordinates. However, by graphing them in parallel coordinates, one can’t really examine any structure in them because a change in parameters, especially, for fractals lead to almost similar parallel coordinates plots. This can be considered as one limitation of parallel coordinates. Another limitation of parallel coordinates is the cluttering of points in massive datasets. It can be difficult to analyze structure in a cluttered display of parallel coordinates. To remedy this issue, a density plot is created to find the densest regions of the parallel coordinates plot. However, this leads to problem of retrieving points in these dense regions as the density plot is merely a raster representation. An attempt was made in solving this problem, but, the word “ EUREKA” from the pollen dataset was not displayed after running the Grand Tour algorithm on the retrieved dataset. Grand Tour algorithm was used to visualize a multidimensional dataset from all viewing angles in time. Overall, parallel coordinates is the optimal tool to visualize any n-dimensional object or dataset in 2-D plane fairly easily, and data analysis can be done on a multivariate dataset in one pass.
A Tutorial on Parallel Coordinates by Alfred Inselberg
Basic Parallel Coordinates by Professor Bruce Land
High Dimensional Clustering using Parallel Coordinates and the Grand Tour by Wegman and Luo (pdf)
Uncovering Clusters in Crowded Parallel Coordinates Visualizations by Artero, Oliveira, and Levkowit (IEEE) (pdf)
Visualization and Knowledge Discovery for High Dimensional Data by Alfred Inselberg
Visualizing the Behavior of Higher Dimensional Dynamical Systems by Wegenkittl, Loffelmann, and Groller
Introduction to Clustering Procedures by SAS/STAT
Dynamics of Coupled and Driven Harmonic Oscillators by Imperial College Press (applet)
Financial dataset (zip)
Lehman Brothers (csv)
Rotating Cube: 1
Matlab Code (.m files) by Ankur Kumar
The code for points and lines can be found on Prof. Bruce Land's website
I would like to thank Prof. Bruce Land for his guidance throughout this project and for giving me this project to work on. I would like to thank Prof. Hutchins, Stephen Jhun, and Jon Del Gaizo for their help with the coupled harmonic oscillator concepts. Also, I would like to thank Franco Cedano for his aid in changing the header for this website. Overall, I would like to thank anybody involved in accompanying me during the pulling of all-nighters to code and understand this project.