Principal Component Analysis

Principal Component analysis is a multidimensional scaling method well suited for previewing the relative layout of the data before applying algorithms such as Self-organizing maps and k-means. The dimensionality reduction performed by this method helps us visualize the data and so plan further analysis.

By clicking on the button associated with the PCA subprogram, we pass the loaded data to the PCA-module. This module reduces the dimensionality of the dataset by calculating it's eigenvectors. The mathematical process of performing this reduction will not be discussed here. The data can now be visualized by a two or three-dimensional plot with the two or three most significant dimensions in the eigenvectors.

The two comboboxes are used to select the two dimensions of the eigenvectos you want to represent the data by. The dimensions of the eigenvectors are sorted by their eigenvalue representative in a decending order, and the major persentage of the variance are usually accounted for bythe first three components.

The button "Components" will open a gene graph window and a list of all the eigenvectors. The selected eigenvectors will be visualised in the graph window. A click on the "Variance/comp" button opens a graph window with the persentage of variance accounted for in each of the eigenvectors.

By clicking and dragging the mouse over the plot, a subset of the input data can be selected and inserted into a gene graph. Because the input data is ordered with respect to the relative distance between them, a small selection will contain genes that are relatively closely related. If the selected set contains too much variance, it can be further refined from the gene graph subprogram.

See the excelent article on PCA analysis by Soumya Raychaudhuri, Joshua M. Stuart, and Russ B. Altman:
PRINCIPAL COMPONENTS ANALYSIS TO SUMMARIZE MICROARRAY EXPERIMENTS: APPLICATION TO SPORULATION TIME SERIES

 

Planning further analysis by performing PCA

By carefully studying the PCA-plot, we might get a clue about how many clusters it is natural to group the data into. This might prevent forcing a small set of data groups into a large number of clusters and vice verca when applying k-means or a SOM algorithm to the data. It is however important to understand that the plot doesn't give an exact plot over the variance in the data-set. Mathematical analysis by using PCA will result in a matrix known as the eigenvectors of the input-set, these eigenvectors are associated with eigenvalues that are arranged from the highest to the lowest value. Only the most significant values are selected to represent the set by the plot. This means that some data are lost in lower dimensions, thus if the dataset is more that 3 dimensions it is allmost impossible to visualize it without loosing any information. Theoretically there can be some interresting plots generated by selecting other dimensions in the eigenvalues, but this function is left to be implemented sometime in the future.