Assignment: a3_quantiles Topics: 1. Setup: Install plotFunctions 2. Quantiles and cumulative probability plots for a continuous data sample 3. Quantile and cumulative probability plots for a continuous theoretical distributions 4. Scaling for superimposed plots 5. The empirical distribution function Due: 2.2-2.3 Quantile plot with all the arrows 2.4 Plot 3. Gamma plot 4. Second plot 5. Plot Notes: 2. The primary purpose of presenting quantile and cumulative probability plots is to provide background for comparing distributions using QQ Plots. Quaniles are also related to the later topics of box plots and legends for choropleth maps. 3. Note that we can have cumulative probabilities and quantiles for both data and and theoretical distributions. QQplots will often be used to compare sample quantiles with theoretical quantiles, expecially standard normal quantiles. If fact varying parameters of a distributional family to make the theoretical quantiles match the sample quantiles provides a visual way of estimating distribution parameters. If there is a straight line fit to the points in a qqplot varying pure location and scale parameters can produce a good fit. Often there is distribution mismatch due to descripancies in moments higher than 2. For example sample might be multimodal and the theoretical family tried unimodal. This will not produce a straight line fit and playing with location and scale parameters will not help much. 4. When looking at a single distribution, analysts tend to prefer density plots to quantile plots. Consequently we do not dwell on this use of quantile plots. 5. The empirical cumulative distribution plays an important role in the development of statistical theory. An example seems justified even though we won't use it in this class. 1. Install Plot Functions Download the plotFunctions file from the class web site to a directory. With the R Console active and use the Source R code item under the File menu to install the functions in the script file. The two functions are plotGrid() and plotOutline(). They are used in this and some of the subsequent assignments. The functions are useful in plotting sequences that start with a plot command that just sets up the plot. plot(rx,ry,type='n',axes=F,...) plotGrid() #points() #lines() or other plot components plotOutline() Using plotGrid() with default arguments fills the plot with light gray, adds white grid lines and black grid line labels. Using plotOutline() with the default arguments outlines the plot in black. Entering a function name in R will list the function definition and reveal the arguments that control colors and line widths. The key script lines in plotOutline() could be move into plotGrid() and save a step. Sometimes lines in the plot can overplot the plot outline and look sloppy so I historically chose to plot the outline last. R provides control for the ends of the thick lines so with work there is a away to avoid the problem and outline can be used with the background grid. 2. Quantile and Cumulative Probability Plots for a Continuous Data Sample What are quantiles? Consider a sample of numbers {8, 3, 11, 9, 2} We call the sorted or ordered numbers (2, 3, 8, 9, 11) quantiles We also all them order statistics. To be more specific about individual quantiles we associate each quantile with a cumulative probability. The cumulative probabilities we will use for five quantiles are the fractions (.1, .3, .5, .7, .9). Pairing the corresponding elements yields the following( q, p ) pairs. q p ( 2, .1) ( 3, .3) ( 8, .5) ( 9, .7) (11, .9) We can now say the 2 is the .1 quantile, 3 is the .3 quantile, and so on. We can use different but equivalent words say 2 is the 10th percentile. We can use additional words for special quantiles. For example 8,the .5 quantile, is also all the median. Cleveland uses the general notation q = Q(p). For the about sample 2 = Q(.1) ans 8 = Q(.5). 2.1 Calculation of cumulative probabilities____________________ The study of order statistics lead to a simple function for providing a cumulative probabilities corresponding to the order statistics. The cumulative probability for the ith order statistic from a sample of size n is cumProb = (i - a)/(n+1-2*a) where [0 <= a <= 1] Sometimes there is a best "a" for a family of distributions. For example a=.375 is best from the normal family (at least from one perspective.) In the data exploration context we have data and rarely know the probability family. Hence we want something reasonable and choose something simple. We use the formula with a=.5 as advocated by Cleveland. The formula the reduces to (i-.5)/n where "i" is the rank or position of the quantile or order statistic in the sorted list and "n" is the sample size. order cumulative statistics rank probability or or or quantiles position f value q i p = (i-.5)/n --------------------------------------- 2 1 .1 3 2 .3 8 3 .5 9 4 .7 11 n=5 .9 Cleveland calls the cumulative probabilities f values, possibly because they are fractions between zero and one. Since not many statisticians use the label f value, I tend to use the letter p and the label cumulative probabilites. It would be nice if our batch of data is an independent identically distributed random sample from some population. This may not be the case. Technically there may not be probability foundation for our batch of data. We can still use the language of probabilty to label our procedures. We can calculate cumulative probabilities using R. ##Run samp = c( 8, 3, 11, 9, 2) q = sort(samp) # quantiles n = length(q) i = 1:n # positions p = (i-.5)/n # cumulative probabilities cbind(q,i,p) # table cbind(q,p) # q and p pairs ##End 2.2 Cumulative Probability and Quantile Plots______________________ If we plot the (q,p) pairs and put the cumulative probabilities on the Y axis, then we have a Cumulative Probability plot. Here we assume the situation of continuous data and the absence of gaps, so choose to connect the dots. ##Run plot(q,p,type="b", xlab="Quantiles", ylab = "Cumulative Probabilties", main="Cumulative Probability Plot" ) ##End If we plot pairs but put the quantiles on the Y axis, we have a Quantile plot. Again we assume we are working with continuous data and connect the points. ##Run plot(p,q,type="b", xlab = "Cumulative Probabilities", ylab="Quantiles", main="Quantile Plot" ) ##End In a quantile plot, we can read a quantile for a given cumulative probability. We starting at the probability, going straight up to the quantile line. and then go straight left to read from the Y-axis. ##Run plotBounds = par()$usr # plot limits xMin,xMax,yMin,yMax in data units xMin = plotBounds[1] yMin = plotBounds[3] arrows(p,rep(yMin,length(q)),p,q,length=.09,col="blue") arrows(p,q,rep(xMin,length(p)),q,length=.09, col="blue") ##End 2.3 Using linear interpotation to compute more quantiles_______________ We can select additional cumulative probabilities, and compute the matching quantiles using R's linear interpolation function called approx(). Use of approx() is illustrated below. ## Run pNew = c(.15,.2,.25,.35,.4,.45,.55,.6,.65,.75,.80,.85) result = approx(p,q,pNew) # p must be in ascending order result ## End The returned object, result, is a list with components $x and $y. $x is just a copy of x values we gave to the function (the values from pNew in this case) $y are the new quantiles we seek ## Run qNew = result$y arrows(pNew,rep(yMin,length(q)),pNew,qNew,length=.09,col="red") arrows(pNew,qNew,rep(xMin,length(p)),qNew,length=.09, col="red") ##End 2.4 Plotting at Selected Cumulative Percents In general we can specify the cumulative probability where we want(p,q) pairs plotted and obtain the quantiles by interpolation. It is desirable to cover most of if not all of the "domain" of cumulative probabilities. It is also important not to go outside of the domain. The function approx is designed for interpolation, not extrapolation. Language note: We are treating cumulative probabilities as values for the x-axis and often use the term "domain" for the range of x-values in the context of y=f(x). If the sample sizes is n=100, the minimum cumulative propability is (1-.5)/100 = .005 and the maximum is (100-.5)/100 = .995. Thus the cumulative probability domain for a sample of size 100 is [.005 .995]. Cleveland suggests that if we must extrapolate to obtain quantiles outside such intervals we should continue to use the minimum data value for all small cumulative probabilities and the maximum data value for all large cumulative probabililties. For our class purposes it will typically suffice to use pSuffice = seq(.01,.99,length=99) for most samples larger than 100. ##Run samp = rnorm(1000) q = sort(samp) p = ppoints(q) # An R function providing (i-.5)/n pSuffice = seq(.01,.99,length=99) pq = approx(p,q,pSuffice) plot(pq$x,pq$y, type='n',axes=F, xlab="Cumulative Probabilities", ylab="Quantiles", main= "Interpolated Quantiles Based on Sample") plotGrid() lines(pq$x,pq$y,type='o',lwd=2) plotOutline() ## End 3. Theoretical Quantile Plots________________________________________________ Rather than base our quantile plots on data we can chose to use a particular member of a probability family. We select cumulative probabilities, specific the quantile function for the familty and privde the specific parameters as argments In R The quantile function for a family starts with the letter "q". The cumulative probability function for a family starts with the letter "p". The probabilty density function for a family starts with the letter "d". The random number generator function for a family starts with the letter "r" For example R has a two parameter gamma family quantile function ##Run windows() p = ppoints(100) # An R function providing (i-.5)/n for n >= 10 # The function use a=.375 for n<10. This is okay. q = qnorm(p) plot(p,q, type='n',axes=F, xlab="Cumulative Probabilities", ylab="Standard Normal Parameters: Mean = 0, Standard Deviation = 1", main="Quantile Plot for a Normal Distribution") plotGrid() lines(p,q,type='o',lwd=2) ## End ##Run windows() p = ppoints(100) # An R function providing (i-.5)/n q = qnorm(p,mean=100,sd=16) plot(p,q, type='o',axes=F, xlab="Cumulative Probabilities", ylab="Normal Distribution Parameters: Mean = 100, Standard Deviantion = 16", main="Quantile Plot for a Normal Distribution") plotGrid() lines(p,q,type='o',lwd=2) plotOutline() ## End ##Run windows() p = ppoints(100) # An R function providing (i-.5)/n q = qgamma(p,shape=2,rate=2) # quantiles from a two parameter gamma family # cumulative probabilities are the required # argument plot(p,q, type='n',axes=F, xlab="Cumulative Probabilities", ylab="Gamma Parameters: Shape=2 and Rate=2", main="Quantile Plot for a Gamma Disbribution") plotGrid() lines(p,q,type='o',lwd=1,cex=1.2) ## End 4. Scaling for superimposed quantile plots___________________________________________ ##Run n1= 250 samp1 = rnorm(n1,mean=100,sd=16) cumProb1 = ppoints(n1) n2 = 300 samp2 = rnorm(n2,mean=120,sd=16) cumProb2 = ppoints(n2) # find the range of the combined quantiles to set the ylimits for # for the plot. quantileR = range(samp1,samp2) plot(c(0,1),quantileR,type='n',xlab='',ylab='Quantiles', main='Overlaying Quantiles for Two Samples With Different Means', axes=F) mtext(side=1,line=1.7,"Cumulative Probabilities") # better spacing x axis plotGrid() lines(cumProb1,sort(samp1),col='black',lwd=3) lines(cumProb2,sort(samp2),col='blue',lwd=3) ##End ##Run n1= 100 samp1 = rnorm(n1,mean=100,sd=16) cumProb1 = ppoints(n1) n2 = 80 samp2 = rnorm(n2,mean=100,sd=8) cumProb2 = ppoints(n2) quantileR = range(samp1,samp2) plot(c(0,1),quantileR,type='n',xlab='',ylab='Quantiles', main='Superimposed Sample Quantiles With Different Standard Deviations', axes=F) mtext(side=1,line=1.7,"Cumulative Probabilities") # better spacing x axis plotGrid() lines(cumProb1,sort(samp1),type='o',col='black',lwd=2) lines(cumProb2,sort(samp2),type='o',col='blue',lwd=2) ##End 5. Empirical cumulative distribution function (ecdf)_______________________________ A cumulative distribution function (cdf) is defined on the interval from (-infinity, infinity). The function values (cumulative probabilities) go from 0 to 1. The empiricial cdf is a step function with jumps at the observations. If there are no ties, and the sample size is n, the jumps are all 1/n. By convention the function is continuous from the right. ##Run n = 25 dat = rt(n,df=2) q = sort(dat) jump = rep(1/n,n) p = cumsum(jump) q=c(q[1],q) # add 0 prob at first quantile (technically just to the left) p=c(0,p) plot(q,p,type='s',) plot(q,p,type='n',xlab='',ylab='Cumulative Probabilities', main='Empirical Cumulative Distribution Function',axes=F) mtext(side=1,line=1.7,"Sorted Data") # better spacing x axis plotGrid() lines(q,p,type='s',col='black',lwd=3) dx = .03*diff(range(q)) arrows(q[n+1],1,q[n+1]+dx,1,lwd=3,length=.1,col='red') arrows(q[1],0,q[1]-dx,0,lwd=3,length=.1,col='red') ##End