Assignment a3_qqplots Topics: 1. Linear interpolation and plotting missing values 1.1 A single segment case 1.2 Piecewise linear case with missing values 1.3 Plotting missing values 2. Construction of QQplots for comparing two data sets 2.1 Two sets with the same size 2.2 Two sets with different sizes 2.3 Comments on paired values and satellite images 3. Normal QQplots 4. QQplots comparing a data set to given distribution Install: Source the gridPlot functions from the schedule Turn in: 1.2 Answers, 1.3 Plot 2.1 Plot, 2.2 Plot 3. A t3 qqplot, one uniform qqplot and one weibull qqplot 4. The one plot 1. Linear interpolation using approx() 1.1 As single segment case Consider points (x,y) on the line segment from (3,5) to (4,17). If x is 3.2, what is y? If x is 3.5, what is y? If x is 3.7, what is y? R has a function called approx () for linear interpolation. A set of point pairs, (x,y), sorted in ascending x order provide the foundation for linear interpolation. The arguments to approx() in order are The vector of x elements in the x sorted pairs. The corresponding vectory elements A vector of where new x values where new y values are desired. ## Run x = c(3,4) y = c(4,17) newx = c(3.2,3.5,3.7) ans = approx(x,y,newx) is.list(ans) names(ans) newy = ans$y newy ## End The result, ans, is a list structure with components ans$x and ans$y. The component ans$x has the same values as the third argument which is newx in the about example. The component ans$y contains the interpolated y values. If these new y values are all that is needed, the list structure, ans, does not have to be saved in the working directory. One can save just the interpolated values. newy = approx(x,y,newx)$y 1.2 Piecewise linear curves and NA values The above example had only one line segment. The piecewise linear curve below has multiple line segments whose end points fall on a quadratic curve. ## Run x = 1:10 y = 2*x^2 + .5*x + 3 newx = seq(.3,9.7,length=20) newy = approx(x,y,newx)$y cbind(newx,newy) ## END Why are the first two values for newy set as NA? By default many functions in R will not work on data with NA values. These functions often have an argument, na.rm= or na.action= for the analyst to set. The values to these argument tells R which of the available options to use For example in range(x,y,z,na.rm=T) the na.rm=T will remove all the NA values before calculating the range. So the question here is, what does plot() do if some of the x or y values are NA? ## Run plot(newx,newy,type='b') ## End Are the first two point pairs included? 1.3 Adding labels and missing data to a plot Add the following labels to plot below: title: Interpolation x-axis: Independent variable y-axis: Dependent variable Use ?plot if you don't remember the labeling arguments to the plot() function ## Run plot(newx,newy,type="b", ? ? ? ) plotLimits = par("usr") # min(y) is the their element points(newx[1:2],rep(plotLimits[3],2),cex=1.3,col="red",pch=16,xpd=T) ## End I used the idea on the plot outline to show missing data for a scatterplot matrix in a 1984 publication. This examle made it clear that the missing chemistry values for rain samples were associated with a small amount of rain. This placed doubt the accuracy of the chemistry values that were reported for smaller amounts of rain fall. 2. Constructing QQplots comparing two samples A QQplot is a plot of paired quantiles coming from samples or probability distributions. Whatever the source, the basis for pairing quantiles is having the same cumulative probabilities. When the two sample sizes are the same, the cumulative probabilities are the same. We can simple pair the sorted values and plot. The smallest value of sample 1 goes with the smallest value of sample 2 and so on. 2.1 Comparing two samples of the same size Suppose we have two samples called set1 and set2. If the samples have the same number of observations, the cumulative probability values are the same. ## Run set1 = rnorm(100) # normal distribution set1Quants = sort(set1) set2 = runif(100,min=-3,max=3) # uniform distribution set2Quants = sort(set2) set1CumProb = ppoints(set1) set2CumProb = ppoints(set2) all(set1CumProb == set2CumProb) ##End In this case the quantiles (the sorted values) can be directly paired to produce a qqplot. No interpolation is needed. ##Run gPlot(set1Quants,set2Quants, main="QQplot", xlab='Sampled Standard Normal Quantiles', ylab='Uniform Quantiles') abline(c(0,1),lwd=2,col="blue") # reference line for the same distribution ## End I don't like the reference line on top of the data points. The grid may be a little distracting, so the white grid line can be made thinner. Also the reference line and points can be made to stand out. ##Run gPlot(set1Quants,set2Quants,type='n', main="QQplot", xlab='Sampled Standard Normal Quantiles', ylab='Uniform Quantiles',gridLwd=1) abline(c(0,1),lwd=2,col="blue") # reference line for the same distribution points(set1Quants,set2Quants,cex=1.2,pch=16) ## End 2.2 Two batches with different sample sizes When the samples sizes are different, we use interpolation to obtain quantiles corresponding to the same cumulative probabilities. A reasonable choice is often to chose the cumulative probabilities for the smaller sample as the interpolation locations for obtaining the matching quantiles from the larger sample. Why don't we pick the cumulative probabilities from the large sampler as the new interpolation locations for the smaller sample? Another choice, sometimes used with very large samples, is to use a fixed sequence cumulative probabilities, such as c(.01, .02, ... .99) and separate interpolations to get quantiles from the two distributions. I used this approach in CCmaps when I wanted real time response to sliders and was dealing with thousands of values. ## Run set1= rnorm(41) set1Quants = sort(set1) cumProb1 = ppoints(set1) #cumulative probabilities set2 = rnorm(72,mean=1) set2Quants = sort(set2) cumProb2 = ppoints(set2) #cumulative probabilities set2MatchedQuants = approx(cumProb2,set2Quants,cumProb1)$y gPlot(set1Quants,set2MatchedQuants,type="n", main="QQplot", xlab="Sampled Standard Normal Quantiles", ylab="Sampled Normal Mean 2 Quantiles") abline(c(0,1),col="#4488FF",lwd=2) # equality line points(set1Quants,set2MatchedQuants,col="red",pch=16,cex=1.2) # The greenish-blue line is the reference line for equalty of distributions # Some times the distributions are so far apart the line doesn't show. # # Showing a reference line fitting the quantiles is very helpful # We can talk about constant and slope of the fitted line compare the two lines # # In Splus I used robust regresson to fit the line # # Here I do a quick regression downweighting the tails # (There are more appropriate regressions) probs = if(length(set1Quants) < length(set2Quants)) cumProb1 else cumProb2 qx = approx(cumProb1,set1Quants,probs)$y qy = approx(cumProb2,set2Quants,probs)$y coef= lm(qy~qx,weights = 1-abs(.5-probs)) abline(coef,col="red",lwd=2) ## End Without setting the random number seed in the script, your samples may be substantially different than mine. If my comments don't match you plots very well, that is likely the result of sampling variability. In my plots (I tried several different smaples) the red lines fits the red dots pretty well. This suggests that the two samples differ primarily in terms location (mean) and scale (standard deviation). In other words the two samples do not differ much in terms of moments higher than two. They have the same skewness (based on the third moment)and kurtosis (based on the fourth moment. When running the above script sequence many times, the fitted line and the equality reference line often (but not always) had roughly the same slopes. Having the same slope means the same standard deviations. In the cases where the slopes were similar, the red line is shifted upward about 1 from the greenish blue equality line. This means the he sample 2 quantiles are about 1 bigger than the sample 1 quantiles. The means differ by 1. If the slopes are the same, visually assessing the shift in mean can be a little easier using an MD plot (mean difference) plot. 2.3 Comments on Pair values and Satellite images Plotting paired values is useful in many situations. Consider two carefully aligned satellite images of a rain forest taken a years apart. The alignment provides one good basis for plotting paired pixel values. One plot could be a scatterplot of pair values. Departures from the no change reference line with slope 1 are of interest. It could be the changes might be associated more with high early values than low early values. A second plot could be an image plot with color encoding that shows the difference (or possibly the ratio) of paired values. This could indicate the locations where there is change. A third plot could be a QQplot where the pairing is by cumulative probabilities. This could better convey the overall all change in distribution. Geospatial context is often important, but views without this context can bring out other patterns. A future assign will include comparison to two satellite images. 3. Normal QQplots R has a qqplot()function that plots sample quantiles again standard normal scores. It also as a qqline() function that draws a line through the point pairs for the first and third quartiles unless specific otherwise. The examples below sample from different members of distribution families. There is a window for each chose member of the family. You will need to drag the windows to see the differences. When you look at the QQplot decide if the left and right tails are thinner, about the same as, or thicker than the normal tails. A thick left tail will have a few points below the reference line on the left of the plot or perhaps one or two points that are far below the reference line. A thick right tails will have a few points above the reference line on the right of the plot or perhaps one or two points that are far about the reference line A normal QQplot can reveal multi-modal distributions. A density plot provides a more direct view. A graphics.off() command appear before the qqplots for each family to remove the windows for the previous family. # Normal QQplots n = 100 # Sample Size Normal Distribution_______________________________________ ## Run graphics.off() windows(width=6,height=6) m0s1 = rnorm(n,mean=0,sd=1) qqnorm(m0s1,ylab= paste("Standard Normal Sample, N=",n,sep='')) qqline(m0s1) windows(width=6,height=6) m100s16 = rnorm(n,mean=100,sd=16) qqnorm(m100s16,ylab= paste("Normal Sample, Mean=100 SD=16, N=",n,sep='')) qqline(m100s16) ## End t-distribution_______________________________________________ ##Run graphics.off() windows(width=6,height=6) df = 3 tSamp = rt(n,df=df) qqnorm(tSamp,ylab= paste("t-Distribution Sample, df=",df,", N=",n,sep='')) qqline(tSamp) windows(width=6,height=6) df = 7 tSamp = rt(n,df=df) qqnorm(tSamp,ylab= paste("t-Distribution Sample, df=",df,", N=",n,sep='')) qqline(tSamp) windows(width=6,height=6) df = 30 tSamp = rt(n,df=df) qqnorm(tSamp,ylab= paste("t-Distribution Sample, df=",df,", N=",n,sep='')) qqline(tSamp) ##End Gamma distribution__________________________________________________ ## Run graphics.off() windows() shape = 1 gam = rgamma(n,shape=shape) qqnorm(gam,ylab= paste("Gamma Sample, Shape=",shape,", N=",n,sep='')) qqline(gam) windows() shape = 2 gam = rgamma(n,shape=shape) qqnorm(gam,ylab= paste("Gamma Sample, Shape=",shape,", N=",n,sep='')) qqline(gam) windows() shape = 5 gam = rgamma(n,shape=shape) qqnorm(gam,ylab=paste("Gamma Sample, Shape=",shape,", N=",n,sep='')) qqline(gam) windows() shape = 30 gam = rgamma(n,shape=shape) qqnorm(gam,ylab=paste("Gamma Sample, Shape=",shape,", N=",n,sep='')) qqline(gam) ##End windows() shape = 30 gam = rgamma(n,shape=shape, rate=5) qqnorm(gam,ylab=paste("Gamma Sample, Shape=",shape,", N=",n," ,Rate = 5",sep='')) qqline(gam) ##End Uniform Distribution_______________________________ ##Run windows() un01 = runif(n) qqnorm(un01,ylab= paste("Uniform [0 1] Sample, N=",n,sep="")) qqline(un01) windows() un1020 = runif(n,min=10,max=20) qqnorm(un1020,ylab= paste("Uniform [10 20] Sample, N=",n,sep="")) qqline(un1020) ##End Weibull Distribution_________________________________ ## Run windows() shape=1 weib = rweibull(n,shape=shape) qqnorm(weib,ylab= paste("Weibull Sample, Shape=",shape," ,N=",n,sep='')) qqline(weib) windows() shape=20 weib = rweibull(n,shape=shape) qqnorm(weib,ylab=paste("Weibull Sample, Shape=",shape," ,N=",n,sep='')) qqline(weib) ## End A Mixture of two normals_____________________________________________ graphics.off() ## Run x1 = rnorm(50,mean=60, sd=15) x2 = rnorm(50,mean=120,sd=15) samp = c(x1,x2) qqnorm(samp,ylab="Mixture of Two Normals, N=100") qqline(samp) ## End #4. Comparing Theoretical Distribution to Normal scores_____________________________________ A lot can be learned by taking repeated samples from a distribution and making normal qqplots. In particular this conveys the sampling variability. On occasion a sample from a normal distribution will suggest that the population has a thick tail. Sometimes a sample from a normal distribution will provide a poor indication of population mean or standard deviation. Sampling variability is one of the reasons that in science it is for there to be confirmatin from different researcher using different data. Sampling variabily makes it hard to develop a notion of the underlying relationship of the population quantiles to standard normal quantiles based on samples. It is easier to use theoretical quantilte from the target population to learn about the expected relationship. ##Run windows() cumProb = ppoints(n) x = qnorm(cumProb,mean=100,sd=16) y = qgamma(cumProb,shape=1) gPlot(x,y,main='QQplot', xlab="Standard Normal Quantiles", ylab="Theoretical Gamma Shape 1 Quantiles", pch=16,col="red",cex=1.2) coef= lm(y~x,weights = 1-abs(.5-cumProb)) abline(coef,col="red",lwd=2) ##End Here we clearly see the thin tail on the left and the thick tail on the right.