# Assignment a3_tours # Copyright 2004,2005,2006,2007, 2008 class use permitted # By Daniel B. Carr # Topics: 1. Grand Tour Computations for 4-D 1.1 Generate 4-D data 1.2 Givens rotation matrix function 1.3 4-d composite rotation matrix function 1.4 Base for linearly independent angle sequences 1.5 A short 2-D view tour of 4D data 2. Comments 2.1 Showing more than two coordinates, CrystalVision, XGobi, GGobi, ESTAT 2.2 Pixel tours, Random Tours, Starting Points, and other applications 2.3 Very large number of variables, dimension reduction, semi-supervised principal components, and borrowing from other disciplines 2.4 Visual analysis of virtual graphics and GGobi 2.5 Comment: Units of measure and variable scaling 3. View three coordinates of a tour using color anaglyph stereo 3.1 Natural homogenous coordinate projection matrix function 3.2 A short 3-D Color anaglyph stereo tour 4. A Random Tour for a 2-D view of 5-D data 4.1 Random matrix with two orthonormal columns function 4.2 Generate some 5-D data 4.3 The 2-D random tour 5. Viewing the 4D tour in 4D using stereo ray glyphs Note: An older Splus version did not have double buffering. I overplotted point in white to "remove" the before plotting new ones. Without the white I would see the interesting trails until the interior of the plot became saturated. Some variant using trails might be useful. For example over plot the last three time steps with low moderate and high contrast point respectively. The older dots might be made smaller as well. Ware indicates that in statics view of trails or tails the high contrast side appears as the leading side. # Due One screen shot each from 1,3,4, and 5. 1. Grand Tour Computations for 4-D ======================================= 1.1 Generate 4-D data The data is to be stored in a 4 column matrix called dat. ## Run dat = matrix(rnorm(120),ncol=4) ##End 1.2 Define a function the produces a Givens rotation matrix with the default dimensions for 4-D data The following is not careful the location negative sin() in the matrix which is not important in this context. ##Run givens = function(theta,i,j,d=4){ a = cos(theta) b = sin(theta) mat = diag(d) # creates a identity matrix mat[i,i] = a mat[i,j] = -b mat[j,i] = b mat[j,j] = a return(mat) } ##End 1.3 Define a function that produces a matrix that is the product of 4 choose 2 = 6 4-D Givens rotation matrices. Let the argument be a vector of six angles in radians. ##Run rot4dMat = function(angles){ mat = givens(angles[1],1,2) %*% givens(angles[2],1,3) %*% givens(angles[3],1,4) %*% givens(angles[4],2,3) %*% givens(angles[5],2,4) %*% givens(angles[6],3,4) return(mat) } ##End 1.4 Generate a set of 6 linear independent numbers base 2*pi ##Run base = 2*pi angleInc = .005 angles = angleInc*sqrt(c(2,3,5,7,11,13)) %% base ##End Note that integer multiples of the above will generate a path on the 4-dimensional sphere. In polar coordinates only D angles are need to specify a point on a D dimensional sphere. The use of D choose 2 angles is part of the methodology developed by Daniel Asimov (1985) for defining a path around a D dimensional sphere that would eventually get with a few degrees of all points on the sphere. The length of time to see all 2-D views within a few degrees of all 2-D views is prohibitively long even for six dimensional data. The time increase rapidly as the dimension increases. Dr. Wegman says he often sees interesting patterns early in the tour. He uses the random tour described in 2) below. 1.5 A short 2-D view tour The word “modify” below is an option, not a directive to be followed for grading. angleInc: Modify this to change the multivariate vector of angles being incremented in integer multiples nSteps: Modify this to run a longer or shorter sequence Interrupts: I don't know how to interrupt the animation yet. Speed: Animation can run too fast. Putting a pause before each drawing can slow the animation. Lines commented out below repeatedly call Sys.time() and allow the loop to proceed after a selected elapsed time. This works to slow the animation. Making angleInc small can produce lots of nearly identical views so the animation seems to go slow but does get as far in nSteps. I think it should be possible to adjust animation speed via the double buffering controls. When a screen device is double-buffered (the default) the screen is updated 100ms after last plotting call or every 500ms during continuous plotting. These times can be altered by setting options("windowsTimeout"). My experiments with this did not seem to produce changes. ##Run #options(windowsTimeout=c(50,500)) base = 2*pi angleInc = .002 angles = angleInc*sqrt(c(2,3,5,7,11,13)) %% base cex = 1.4 nSteps = 1000 cmat = matrix(c(0,0,100,255,255,255)/255,ncol=3,byrow=T) colors = rgb(cmat[,1],cmat[,2],cmat[,3]) oldPalette = palette(colors) par(pty='s') #z1 = Sys.time() for(step in 2:nSteps){ nangles = (step*angles)%%base mat =(dat%*%rot4dMat(nangles))[,1:2] # while(difftime(z2,z1,units='secs')< .03)z2 = Sys.time() # z1 = Sys.time() plot(mat,col=c("blue","red","purple","black"), ylab='',xlab='',main='',pch=16, xlim=c(-3.5,3.5),ylim=c(-3.5,3.5),xaxs='i',yaxs='i',cex=cex) } ##End 2. Comments ======================================================= 2.1 Showing more the two coordinates, projection pursuit and alternative software The tour above produces a sequence of 4D data coordinates. We visualized the sequence with two coordinates, and omitted the other two coordinates. This can be wasteful in that the action can be in views involving the omitted coordinates. As part of my NSF research funded in 1991, I designed software called ExplorN. This provided alternative views of multivariate data. The views included all variables (up to 20) in a parallel coordinate plot, all pairs of variables in a scatterplot matrix, and four selected variables in a stereo ray glyph plot. During a tour each of the views shows more than two coordinates. I hired Qiang Luo, Dr. Wegman's student to program the first implementation of ExplorN under my grant in the summer of 1992. This included an implementation Dr. Wegman's generalized touring algorithm. Today most of the ExplorN SGI capabilities are available on the PC in software called CrystalVision. CrystalVision is old but I still find it useful. The packet header example for one day of the Los Alamos Green Network data. The example showed live interaction with over 300000 cases (a large number had been omitted) and 12 variables. Today there are preferable alternatives in specific parts of the graphics. For I like many of the parallel coordinates features available in ESTAT from Penn State. Choices for touring include XGobi and GGobi. I generally prefer projecting pursuit methodology to grand and random tours. Projection pursuit methods seek projections that optimize a feature of merit calculated from the projected image. I used to include XGobi as part of my class to show projection pursuit. (The early XGobi versions were limited by showing only two coordinates of tour.) GGobi is the sequel to XGobi and has many desirable features. For the Online book see http://www.public.iastate.edu/~dicook/GGobi-book/GGobi.html The PC version was not stable so I did not pursue develop examples for class exercises. This month (March 2007), learned that current version is much improved. I again plan incorporate GGobi exercises in the class but it may not this semester. The purpose of the exercises here is to provide a better understanding of what is going behind the images. Some algorithms are instructive but can be improved. For example many multiplication by zero in multiplying a sequence of Givens rotation matrices can be avoided. In tours there are components of the motion that spinning Within the subspace being visualized. This like sticking a pin in a scatterplot and spinning it around and not helpful. Better algorithms can control what is called whip spin. The list of things to improve is endless. 2.2 Pixel Tours, Random Tours, Starting Points, and Other Applications A few several years ago Dr. Wegman and Qiang Luo developed pixel tours for multispectral images. Linear combinations can be useful for many purpose from detecting land mines to obtaining better views of painted over (defaced) archeological artifacts. One does necessarily know which linear combinations of spectral intensities will be useful. Generalized rotation provides way to investigated different linear combinations. With hyperspectral the data there can be many variables. Random tours provides an alternative to grand tours that is more tractable computationally. How good is the coverage of random tours? Dr. Wegman say he often finds interesting patterns very early. Sometime just looking at thinking about the data just a little be provides a big step forward. So, maybe random tours serve a useful purpose even if the coverage is limit. One applicatin of random tours has been to get away from local optima found using a project pursuit pursuit algorithm. After running the random tour awhile the optimization is turn back on and may well lead to a different local maximum. I had thought about selecting starting linear combinations based on roughly regular spacings on a p-dimensional sphere. In 3-D the Platonic solid vertices provide the only nice solution to equally space points. Years ago I heard from Carey Priebe that a mathematician (Sloane?) had developed good procedure finding k almost regular spaced points in p-dimensions. I think Di Cook (Iowa State) picked up on this idea and pursued it. There may be an option in GGobi. The idea of using linear combination continues to be use useful and this is behind the touring approach. In his Ph.D thesis one Dr. Wegman's students considered linear combinations of EKG signals (time series) to obtain a better view of the back of heart. The conventional views based on the standard lead placement left a this part of the heart poorly examined. Likely you can think of additional applications were viewing linear combinations of data could be helpful. 2.3 Very large number of variables, dimension reduction, semi-supervised principal components, and borrowing from other disciplines Today's data collection methods are producing data sets with thousands to million and more of variables. Some examples are multispectral images, multi-gene humans, multi-word documents collections, multi-propertied drug collections. The AIRS multispectral radiometer produces intensities for roughly 2400 spectral bands. Gene chips can yield Human gene expression levels for over 20000 genes. (I have heard that some trees have many more genes than humans.) Document collection can easily have counts on more than 60000 words. Drug properties can easily be in the millions. The rule thumb for classic multivariate analysis methods was to have at least three times more cases than variables. I doubt if this was really tested with a huge number of variables. Many of today's data sets don't come close to satisfying the rule of thumb. Gene expression data sets often involve less than a hundred people or tissues. Often we don't know what variables are relevant to a particular phenomena of interest. If we use linear combinations involving thousands of irrelevant variables the signal in the relevant variables will often be hidden by noise. Further, and some of the myriad patterns generated by noise are likely to appear to be related to the phenomena of interest. Class presentations will address some facets the challenge and point to methodologies the addressing the challenge. For example lasso regression had proved useful in removing variables in a regression context. One of the points here is that classic variable selection and dimension reduction methods such as stepwise regression and principle components can fail in both ways with a large number of variables. A second point here is that it can be advantageous to lift disciplinary blinders and see how people in other disciplines are addressed somewhat similar challenges. Document analysis methods can be useful in image analysis and so on. Of course methods need to be adapted to different circumstances but often enough key ideas still apply. At the moment I want to call attention to the genetics research at Stanford by Bair and Tibshirani. They have developed methodology called semi-supervised principle components. This for the situation with a dependent variable, a modest number of cases and thousands of variables. See E. Bair and R. Tibshirani, 2004, "Semi-Supervised Methods to predict Patient Survival from Gene Expression Data," PLOC, Vol. 2(4), p522. You can Goggle to get to Tibshirani's web site that contains links to papers and an R package implementation. Some this methodology may make today's touring seem no more like an intellectual curiosity than a powerful tool. The general idea of touring is still a good one and more focused tours may emerge. It is not yet time to remove humans from the analysts loop. Rather it is high time we took human capabilities and limitation seriously in developing tools for analysts. 2.4 Visual analysis of virtual graphics In my opinion that there are too many potential graphics to examine. I seek tools that will leverage my time by guiding to me potentially important views. While I am an not alone in this, Bill Cleveland, whom I highly respect, has a different view. He gave a talk at a MSRI (Mathematical Science Research Institute) workshop in 2006. indicating that it was important to look at all the data. Hopefully our opinions are not too far apart. Many years ago Paul Tukey, a fifth cousin to and sometimes collaborator with John Tukey gave a talk on scagnostics that inspired several attendees include myself and Leland Wilkinson. This lead me to develop an algorithm to find usually patterns that I called arms, in 2-D and 3-D scatterplots. Having forgotten the term scagnostics I used a John Tukey term, cognostics, as a generic label. My example application watched a computational fluid dynamics code on a parallel processor and as assessed virtual plots of things such as vorticity and partial derivatives of velocity. The code ran on one processor one time step behind the model. See Carr 1991. “Looking at Large Data Sets Using Binned Data Plots." Computing and Graphics in Statistics, Ed. A Buja and P. Tukey. Springer-Verlag pp. 7-39. To my delight Lee Wilkinson and collaborators have recently resurrected Paul Tukey’s ideas and pushing them forward. One of their papers is Wilkinson, Anand, and Grossman, 2006. "High-Dimensional Visual Analytics: Interactive Exploration Guided by Pairwise Views of Point Distributions." IEEE Transaction Visualization and Computer Graphics, Vol 12, No. 6. This month, I heard from Nicholas Levin-Koh that progress had been made toward making an implementation available as package in R. Further an invited conference talk by Heike Hofmann (Iowa State) describes some results from an implementation and extension in test version of GGobi. The scagnostics approach focuses on a data set with several variables. The point patterns of variable pairs are conceptually shown in a scatterplot matrix. The Wilkinson paper develops about ten features of interest to calculated for each plot in the scatterplot matrix. The continuous feature values are constrained to lie in the interval [0 1]. Two example features indicate the presence of outliers and the presence of striation. Both hexagon binning and computation geometry algorithms as used to make the computation fast. Pairs of feature values then lead to a new scatterplot matrix in which each point represent a plot from the previous scatterplot matrix. Mousing on a point can produce the scatterplot. Brushing can show where the brushed scatterplots appear in all the diagnostics plots. The same features can be applied to create a scatterplot matrix of points representing the diagnostic plots in from the diagnostics scatterplot matrix. Researchers can the answer higher order questions such as "which diagnostic plots seem to be the most extreme in terms outliers and striation.” 2.5 Units of Measure and Variable Scaling Taking linear combinations of variables measured in different units produce results that can be hard to interpret. A class will address different transformations of the data. For example one can "standardize" variables be subtracting the mean and dividing by the standard deviations. The resulting variables are unitless, so linear combinations are not mixing units of measure. At the same time we do seek understanding in terms of the measured variables or sometimes latent variables that have units of measure. 3. A Function for Natural Homogeneous Coordinate Projection ========== ##Run nhcProjMat = function(d=24,cop='M',heye=1.1){ # The default heye (half eye separation) is in inches # Thus d (distance) and the data should also be in inches. # # cop stands for center of projection # The default M stands for mid eye # options R and L are for Right eye and Left eye heyeN = switch(casefold(cop,upper=T), R= -heye, L= heye, 0) mat = diag(c(d,d,0,d)) mat[3,1] = heyeN mat[3,4] = -1 return(mat) } ##End 3.2 A short 3-D Color anaglyph stereo view of the first three touring coordinates As we take another look at the data for 1.1. We conveniently note that the data generation lets us get by without scaling into inches and that we can guess at a plausible global scale for the tour views. This script is intended to be instructive A much faster running script is possible. There are some redundant calculation in the script below since the y coordinates for the left and right eye projections are the same. ## Run cmat = matrix(c( 0,0,0, 1,1,1, .8,.8,.8, 1,0,0, 0,.55,.55),ncol=3,byrow=T) RedCyan=rgb(cmat[,1],cmat[,2],cmat[,3]) palette(RedCyan) projRmat = nhcProjMat(d=20,cop='R') # righteye projLmat = nhcProjMat(d=20,cop='L') # left eye windows() par(pty='s') cex = 1.4 nr = nrow(dat) one=rep(1,nr) cols = c(rep(4,nr),rep(5,nr)) # c(rep(4,nr),rep(5,nr)) for (step in 1:nSteps){ nangles = (step*angles)%%base ndat =dat%*%rot4dMat(nangles) # tours ndat[,4] = one # using the last coordinate for nhc pR2 = ndat %*% projRmat # projecting right eye pL2 = ndat %*% projLmat # projecting left eye pR2 = pR2[,1:2]/pR2[,4] # convert to Euclidean coordinates pL2 = pL2[,1:2]/pL2[,4] # convert to Euclidena coordinates plot(rbind(pR2,pL2),col=cols,xlab='',ylab='', xlim=c(-3.5,3.5),ylim=c(-3.5,3.5),axes=F,cex=1.5) } ##End 4. A Random Tour in a 2-D view ======================================== For a random tour we start with two random matrices, A and B, each containing two orthonormal vectors. The length of the vectors is same as the number of variables in the data set DAT being used. Just add more orthogonal columns to A and B to produce more Coordinates for viewing. Let XY1 = DAT %*% A # DAT n x d %*% A d x 2 yields XY1 n x 2 XY2 = DAT %*% B Then XY1 and XY2 are two different 2 column matrices of x and y values to plot Let theta be a value between 0 and pi/2 then XY = DAT %*% (cos(theta)*A + sin(theta)* B) This also gives two variables to plot and we can vary theta. At theta = 0, cos(theta) = 1 and sin(theta) = 0 so the A transformation is used. At theta = pi/2 cos(theta) = 0 and sin(theta)=1, so the B transformation is used. To continue the random tour 1) A = B 2) generate a new random B 3) step through angles again from 0 to pi/2 4) repeat 1-3 4.1 Random matrix with two orthonormal vectors function There are many ways to do this. The following is a quick implementation the Gram-Schmidt approach ##Run gsRand = function(d){ x = runif(d,-1,1) x = x/sqrt(sum(x*x)) # normalize y = runif(d,-1,1) y = y - sum(x*y)*x # regression residuals test = sum(y*y) if(abs(test) < 1e-12){ # try again y = runif(d,-1,1) y = y - sum(x*y)*x test = sum(y*y) } y = y/sqrt(test) # normalize return(cbind(x,y)) } A = gsRand(5) t(A)%*% A # check ## End 4.2 Generate some 5-D data ##Run dat = matrix(rnorm(200),ncol=5) ##End 4.3 2-D random tour with 4 sets of linear combinations ##Run windows() palette(c("blue","white")) par(pty='s') cex = 1.2 plot(c(-4.5,4.5),c(-4.5,4.5),type='n') # Compute a sequence of cosines and sines n = 100 angles = seq(0,pi/2,length=n) cAng = cos(angles) sAng = sin(angles) # first view of data B = gsRand(5) for (iter in 1:10){ A = B B = gsRand(5) for (i in 1:length(cAng)){ new = dat %*% (cAng[i]*A + sAng[i]*B) plot(new,xlim=c(-4.5,4.5),ylim=c(-4.5,5.5),cex=1.3) } } ##End Here are some observations and ideas for variations. The change in direction that occurs when replacing previous orthonormal vector with a new one is often evident. The trajectories (histories) can be viewed as layers in 3-D: Use the sequence number as the third coordinate Connect the dots for each point. A one step trajectories can be shown. We could cluster the data and assign colors to the trajectories based on cluster membership. As simpler view with less trajectories we could compute the cluster means and use them or select one or two points from each cluster and used them. If we are touring points, we could show a minimal spanning tree in each 2-D or 3-D view. 5. Viewing the 4D tour in 4D using stereo ray glyphs ============== Color anaglyph stereo is the worst form of stereo. Shuttering glass or polarized light stereo allows full use of color. This example really needs alpha blending to mix the overplotted colors so both eyes see the points. With light addition I would be inclined to make the colors lighter and plot on a dark background. ## Run dat = matrix(rnorm(200),ncol=4) ##End cmat = matrix(c( 0,0,0, 1,1,1, .8,.8,.8, 1,0,0, 0,.55,.55),ncol=3,byrow=T) RedCyan=rgb(cmat[,1],cmat[,2],cmat[,3]) palette(RedCyan) projRmat = nhcProjMat(d=20,cop='R') # righteye projLmat = nhcProjMat(d=20,cop='L') # left eye windows() par(pty='s') cex = 1.4 rayLength = .25 angleInc = .002 angles = angleInc*sqrt(c(2,3,5,7,11,13)) %% base nSteps = 3000 nr = nrow(dat) one = rep(1,nr) # global scaling is a issue # for rays as well. for (step in 1:nSteps){ nangles = (step*angles)%%base ndat =dat%*%rot4dMat(nangles) rayAngle = pi*ndat[,4]/6.4 # roughly scaled into -pi/2 to pi/2 dx = rayLength*cos(rayAngle) dy = rayLength*sin(rayAngle) # tours ndat[,4] = one # using the last coordinate for nhc pR = ndat %*% projRmat # projecting right eye pL = ndat %*% projLmat # projecting left eye pR = pR[,1:2]/pR[,4] # convert to Euclidean coordinates pL = pL[,1:2]/pL[,4] # convert to Euclidena coordinates plot(c(-3.3,3.3),c(-3.3,3.3),type='n',col=cols,xlab='',ylab='',axes=F) segments(pL[,1],pL[,2],pL[,1]+dx,pL[,2]+dy,col=4,lwd=2) segments(pR[,1],pR[,2],pR[,1]+dx,pR[,2]+dy,col=5,lwd=2) points(pL,col=4,pch=15,cex=.6) points(pR,col=5,pch=15,cex=.6) # text(pR,labels=1:30) } ##End