Assignment a5_CrystalVisGenes By Dr. Carr Topics Minimal spanning trees Writing out data files for software Producing a minimal spanning tree for each cluster 1. Discussion: Am interesting Gene expression and coherence analysis 2. Reading the data 2.1 The basic data 2.2 Clustering 3. Write out a dataset for use in CrystalVision 4. Produces and write out a minimal spanning tree for each group 5. Install CrystalVision on your PC 6. Looking at the data 6.1 The scatterplot matrix view 6.2 3-D View with minimal spanning tree lines 6.3 Parallel coordinates, brushing General Goal Cluster Criticism Notes The files produced in 3. and 4. will be used in CrystalVision R does not come with the minimal spanning tree function in Splus. A mstree() function is available in the ade4() package You may want to print this script to follow the Crystal Vision directions in 5. and 5. Requires The ade4 package: obtain and install it Crystal Vision Demo Version See 4. Due Screeen shoots as indicated in 6.1, 6.2 and 6.3 References: Wen, X., S. Fuhrman, G. S. Michaels, D. B. Carr, S. Smith, J. L. Barker, and R. Somogyi. 1998. "Large-scale Temporal Gene Expression Mapping of Central Nervous System Development," Proceedings National Academy of Science. Vol. 95, pp. 334-339. Michaels, G. S., D. B. Carr, M. Askenazi, S. Fuhrman, X. Wen and R. Somogyi. 1998. "Cluster Analysis and Data Visualization of Large-Scale Gene Expression Data", Proceedings of the Pacific Symposium on Biocomputing, edited by Russ Altman, World Scientific Publishing Co, River Edge, NJ, USA 3:42-53. Carr, D. B., R Somogyi and G. Michaels. 1997. “Templates for Looking at Gene Expression Clustering,” Statistical Computing & Graphics Newsletter, Vol. 8 No. 1 pp. 20-29. 1. Discussion: Gene expression, and interest plot, and coherence analysis. This was used in a PNAS paper and was perhaps the last little gene expression data set to be before the microarray data totally dominated in published papers. I like to think the graphics were helpful in getting the paper accepted. One parallel coordinate plot the sorted two sets of cluster categories to minimize line crossing when they connect to each other and to gene function categories. The "string art" drew some visual attention and appeared in a GMU art and science exhibition. However the implications seem to largely have escaped apprecation. 1) The sorting was an effort to maximize the coherence between the data and the existing scientific ontology for gene function 2) The resultin parallel coordinate plots induces a new distance measure that could be used for higher order clustering 3) The is a possibility simple appearing forms in terms of human perpection actually match what happens in nature. I don't know if seeking coherence seeking and pursuing higher order analysis is a well-established research area. (Call it coherence analysis for short.) Likely there are examples in many areas such as in papers data fusion and data synthesis. There should also be connections to meta-analysis. Possibly there are connections to correspondence analysis. In any case I suspect there are an opportunities to develop new coherence seeking algorithms and to engage in higher order analysis. Earlier in the class I showed a bit of Amy Bravermans' talk on using clustering methods to compress JPL atmospheric data. I now call attention to her methodology for finding bivariate probability distributions (for mulltiple clusters from two regions) that were maximally consistent with the margin probabilities. This kind of algorithm is compatible with my rough notion of coherence analysis. I hope her methodology will helpful in the search for large scale climate patterns. 2. Reading the data From the class data directory get the gemIdGene.csv data ## Run genes = read.csv(file='gemIdGene.csv',row.names=1) names(genes) row.names(genes) 2.1 The basic data The row names are the names of 112 genes related to the central nervous system (CNS). This is set is tiny by today's standards. The first set of variables have names E11, E13, E15, E18, E21, P0, P7, P14, and A. This gives the gene expression levels in rats at different ages. E prefixes the day gestation. P prefixes day after birth. A indicates adult animals. The set of nine values for each gene have been normalized by dividing by the expression level. The minimum was not substracted so the minimum in not necessarily zero. 2.2 Clustering In the PNAS paper, coauthors generated "slopes" from the nine variables and added these to the original data. # mat=genes[,1:9] # slopes = t(apply(mat,1,diff)) # bigmat = cbind(mat,slopes) The composite was used to cluster the data into 6 groups. (I later showed that in terms of Euclidean distance adding the slopes was equivalent to differently weighting the original variables.) The cluster membership is in column 13 (labeled ES.clus) in the genes data.frame. In returning to this old data, I have not as yet been able to reproduce the exact clustering. The group variable is factor with 6 levels. attributes(genes[,13])$levels #[1] "constant" "other" "wave_1" "wave_2" "wave_3" "wave_4" Constant genes were active over the life the rat. The waves reflect different visual patterns over time. The task here is to use Crystal Vision to look for possible problems of internal data consistency. 3. Write out a dataset for use in CrystalVision The plan is to focus attention on the origin values and not on the slopes. The CrystalVision data file has an ascii file with the header. The top of the file will look something like. variables: 10 labels: E11 E13 E15 E18 E21 P0 P7 P14 A GRP 0.81 0.72 0.72 0.68 0.85 1 0.89 0.85 0.97 1 0.91 0.56 0.57 0.47 0.6 0.92 0.71 0.66 1 3 ... We could hand edit the header but instead use a script to write the whole file. This approach will generalize to preparing data for other software such as GLISTEN ##Run subs = c(1:9,13) # put the desired data in a matrix mat = genes[,subs] mat[,10] = as.integer(mat[,10]) # use factor levels as integers nams = names(genes)[subs] # get the variable names nams[10]='GRP' nc = length(subs) # Write the header # Note appending to the same file after the first write which # opens the file write(paste('variables:',nc),file='genePNAS.txt',ncol=1) write('labels:',file='genePNAS.txt',ncol=1,append=T) write(nams,file='genePNAS.txt',ncol=1,append=T) # Now add the data # The writing order increases the row index first # Here want a case at a time for transpose the matrix write(t(mat),file='genePNAS.txt',ncol=nc,append=T) ##End 4. Produces a minimal spanning tree for each group. Write out the pair of row numbers for each edge in the minimal spanning tree for each group. A minimal spanning tree with n points has n-1 edges. Note: If you have problems installing the ade4 package, and cannot produce the file genePNASlines.txt below you can download a version from the class web site and use it in crystal vision. library(ade4) # need to obtain a minimal spanning tree function grp = as.integer(genes[,13]) # Same group as above for (i in 1:max(grp)){ loc = seq(along=grp)[grp==i] #rows of the group i genes dat = genes[loc,1:9] # data for the ith group datDist = dist(dat) mst = unclass(mstree(datDist)) # points to connected minimal spanning tree mat = cbind(loc[mst[,1]],loc[mst[,2]]) # use row numbers from # the full data set write(t(mat),file='genePNASlines.txt',ncol=2,append=i>1) # The rows are in different order than in the file on the web site. # Some of the pairs are reverse # However, all the points to connect points should be the same } 5. Install CrystalVision on your PC Use class schedule link or ftp://galaxy.gmu.edu/pub/software/ get the CrystalVisionDemo (older version) and install it on you PC. The more recent version omits a feature used in this exercise. Perhaps there is a more recent release with the feature included. 6. Looking at the data Start CrystalVision and open the file genePNAS.txt that you created above. 6.1 The scatterplot matrix view Panel selection: Click in the panel in the lower left column of the matrix to enlarge and appear in the top right. (You can pick other panels to enlarge later) Alphablending: Turn off the alphablending for now. To do this locate the icon with an A and a red slash, top center of the screen, and click it. Dot size: Increase the size of the square dots. Use the slider at the top right. Now at least we can see points There are six rows of points corresponding to the six groups (clusters). Brushing: The next task is to use different colors for the different groups. Find the icon on the left with the paint brush and click it. The cursor should change shape. Select the color red by clicking the red bar below. We will leave the top row in the enlarged panel white. Paint the second row red. That is click and hold down with the brush slight to the top left of the second row of points, drag slightly past bottom right of the second row and release. Now change the third row to green. Click the green bar on the screen. Click and hold, drag and then release to create a box around the third row. Repeat this process until six rows of points have colors that correspond to the color bars. Then click on the pointer icon at the left to disable the brushing for the present. Screen shot: At the top right you can down size the CrystalVision window and the resize it by hand to something smaller than full screen size. Use ALT + Print Screen button to capture the image and the paste it into a word file. You can then return the CrystalVision Screen to full size. There was nothing wrong with the full size image but sometimes the image we want in the documents appears better when the original is smaller. 6.2 3-D View with minimal spanning tree lines The 3-D view requires that the 3 variables be selected. A little below the brush is an icon with a script v. Click this. It enables variable selection for different uses. Under the Select column toggle the NOs to Yes for variable E13, E21 and P14. Then click Okay at the lower right. Select the 3-D View: Among the icons across the top is an icon with three axis. Click this. A 3-D scatterplot appears. Adjust dot size: We have not picked really bright colors so they may be a bit hard to see. As a quick fix you may want to increase the point size a little. Spinning: To spin the plot, click in the image, drag the mouse and release the mouse while the mouse is still moving. This should spin the plot and give a 3-D effect. Adding lines: With six colors the plot looks kind of complicated. Lines can help us focus attention. It is time to add the minimal spanning tree lines, even though we are looking at only three of the variables used in defining the groups and the spanning trees. Under the Extras select import connections. Pick the genesPNASlines.txt file created earlier. To turn on the lines, click the line icon (yellow dots and with lines) between the 4D button and the Alphablending toggle button. Comments: When point in different groups appears close they likely have other coordinates that are discrepant causing them to be in different groups. When the point on the minimal spanning tree for a group appears to produce a long segments in a view of three coordinates (you can try different views) one or more of its coordinates may be suspect. When a mstree skeletal structure does not seem compact, perhaps more clusters should have been used. Later we take a look using the first three principle components using stereo plots and have reason to asked if adding more data would help refine some of the spread out clusters. Screen Shot: Make a screen shot of a 3-D view. 6.3 Parallel coordinates, brushing Here the goal are to bring out specialize view of emphasize the groups called Wave 2 and Wave 4. Select parallel coordinates: The parallel coordinates view button is just to the left of the 3-D (axis icon) button. It has 4 horizontal lines more vertical line segments. Select the Parallel coordinates view. Alphablending: Click the alphablending button to turn it back on. Slide the alphablending slider all the way to the right. When plotting on pixel it receives the color with full brightness mixing it with what ever is already there. (The purpose here not focus on representing density by having point and lines getting brigher due to overplotting.) Now click on the brush icon to turn on brushing. Then click on the blue bar to pick blue as the background color. Now click at the left of all line, just below the GRP axis line, drag to the right and release. This should turn all the points blue. The 4 cluster of points along the GRP line is Wave 2. Click on the white color bar. The click near the GRP axis just to the Wave 2 group, drag to the right of the wave2 group and release. The lines selected appear white. Note the low values for gestation days 11 and 13, the rapid rise in gene expression after that followed by continuing levels of higher expression during the rest of the animal life time. This characterizes wave 2. There are of course some exceptions. It is not a really tight cluster. Make a screen shot Repeat this process for wave 4, and make a screen shot to turn in. Describe in a few words the pattern you see for Wave 4. Technical note: Crystal Vision appears to plot points in storage order. Brushed points do not necessarly appear on top. With alphablending we can plot the points we want to see in full intensity white. When full intensity white mixes with anything the result is white. Thus we clearly see the white points even if other points are overplotted.