File a12_randomForests By Quotes from http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm Adapted from randomForest documentation A little creativity in 8. Section 1. Introduction to random forests How random forests work The out of bag error estimate Variable importance 2. Look at the iris data 3. Build a random forest Show classification rates Observe two different individual trees Note classes in an MDS plot 4. Variable importance 5. Prototype case for each species and a scatterplot 6. Predictor outlier measures for cases based on proximity 7. Voting margins for cases 8. Showing low dimensional prediction regions with multivariate graphics 9. Imputing missing data - just mentions functions 10. Selected random forest arguments 11. Partial dependency plots Setup Use the install options under the package menu to install randomForests Due Plots from 2,3,4.3,5,6,7,8,11 Note The assigment emphasize just on classfication problme Random forest can do much more 1. Introduction to random forests Random forest provide a powerful way to model data and make predictions based on new data. Random forest methodology should be a part of a data analyst's set of tools. Research continues and methodology advances. In terms of supervised classification my understanding is that 1) Support vector machines are still held in high regard. 2) Random forests are competitive with support vector machines and that more people find it easier to understand and properly use random forests. 3) Rulefit, which builds on the random forest methodology, has some merits over random forest. This understanding, even if it was correct when it was formed may already be dated. Suffice it to say that random forest modeling is still likely to be among the very best data modeling approaches and learning about random forests is worthwhile for those that will have reason to build classification or regression models. As a brief introduction I choose to quote from the beginning of the web site. Please refer to this site if the following sections are not clear or perhaps seem inaccurate. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm 1.1 "How random forests work To understand and use the various options, further information about how they are computed is useful. Most of the options depend on two data objects generated by random forests. When the training set for the current tree is drawn by sampling with replacement, about one-third of the cases are left out of the sample. This oob (out-of-bag) data is used to get a running unbiased estimate of the classification error as trees are added to the forest. It is also used to get estimates of variable importance. After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data." 1.2 "The out-of-bag (oob) error estimate In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests." 1.3 "Variable importance In every tree grown in the forest, put down the oob cases and count the number of votes cast for the correct class. Now randomly permute the values of variable m in the oob cases and put these cases down the tree. Subtract the number of votes for the correct class in the variable-m-permuted oob data from the number of votes for the correct class in the untouched oob data. The average of this number over all trees in the forest is the raw importance score for variable m. If the values of this score from tree to tree are independent, then the standard error can be computed by a standard computation. The correlations of these scores between trees have been computed for a number of data sets and proved to be quite low, therefore we compute standard errors in the classical way, divide the raw score by its standard error to get a z-score, ands assign a significance level to the z-score assuming normality." 2. Look at the Iris data___________________________________ 4 continuous predictors: Sepal length and width - millimeters Petal length and width - millimeters 1 categorical dependent variable: species: setosa, versicolor, virginica ## Run library(randomForest) data(iris) # look at 5 random cases from each species subs = c(sample(1:50,5),sample(51:100,5),sample(101:150,5,)) iris[subs,] species = as.numeric(iris$Species) # stored as a factor irisColor = c("red","blue","#008000") # one color per species caseColor = irisColor[species] # one color per case windows(width=8,height=8.5) pairs(iris[,1:4],gap=0,pch=21,cex=1.2, col=caseColor, # outline bg=caseColor) # fill ## End 3. Build a random forest______________________________________ ## Run # 500 trees created by default irisRf = randomForest(x=iris[,-5],y=iris[,5], keepForest=TRUE, proximity=TRUE) # Note the classification errors counts irisRf # Show cased in a 2D MDS view # Over a lot of overplotting # Setosa species colored red is well separated # Candidate blue and green misclassifications are pretty obvious. windows(width=8,height=8.5) MDSplot(irisRf,fac=iris$Species,k=2,palette=irisColor) # The first tree of 500 # status -1=terminal node, 1=leaf node # prediction for leave nodes 1 = setosa, etc. getTree(irisRf,k=1,labelVar=FALSE) # The 100th tree of 500 getTree(irisRf,k=100,labelVar=FALSE) ## End 4. Variable Use Counts and Importance__________________________ 4.1 Counts of tree branches using the variables Note that Petal length and width are used more often ##Run cnt = varUsed(irisRf) names(cnt) = colnames(iris[,-5]) cnt ##End 4.2 Variable Importance Remember from 1.3 that out of bag cases are used in assessing importance. If a variable is important and it oob values are permuted using the tree for prediction we would expect declines in tree prediction accuracy and and declines branching split purity based on the variable. Tree prediction accuracy decline is assessed by Classification: Increase in percent of misclassifications Regression: Increase in squared residuals Note: The declines for each tree are averaged and normalized by the standard error. (If the standard error is 0, the normalizatiom does not occur) Variable branch splitting purity decline is assessed by Classification: Gini Index Regression: Residual sum of squares set.seed(4543) irisTempRf = randomForest(iris[,-5],iris[,5],ntree=1000, keep.forest=FALSE,importance=TRUE) importance(irisTempRf) 4.3 Variable Importance Dot Plot varImpPlot(irisTempRf) 5. Prototype case for each species and a scatterplot_______________ The procedure for the classCenter function below For each class Pick the case that has most of its nNBR nearest neighbors from it own class Compute the median for numeric variables of the own class neighbor cases Compute the most categorical variables use the most frequent For a second protoype, repeat using nNMr closed neighbors not previously used. ## Run irisP = classCenter(iris[,-5],iris[,5],irisRf$prox) species = as.numeric(iris$Species) irisColor = c("red","blue","green") plot(iris[,3],iris[,4],pch=21,xlab=names(iris)[3],ylab=names(iris)[4], bg=irisColor[species],main="Iris Data with Prototypes") points(irisP[,3],irisP[,4],pch=21,cex=2,bg=irisColor) ## End 6. Predictor outliers for cases________________________________ Remember from 1.1 that proximity is based on count pairs in leaf nodes Here the outlier measures are in a numeric vector with one value per case. The outlier measure for a case is computed as n / sum(squared proximity), normalized by subtracting the median and divided by the MAD, within each class. ##Run plot(outlier(irisRf),type="h", col=caseColor,lwd=2) ##End 7. Voting margins for cases____________________________________ For random forest margin methods for classificaiton are not like regression For EACH case the margin is The proportion of votes for the correct class MINUS the highest proportion of votes among the wrong classes When the difference is positive, majority rule predicts the right class. ## Run set.seed(1) data(iris) windows(width=8,height=8.5) x = seq(along=iris$Species) y= margin(irisRf,iris$Species) gPlot(x,y,main="Random Forest Margin Plot for Iris Data", pch=21,bg=caseColor) # use identify # left clicks to label the 6 lowest points # right click to access the stop option. identify(x,y) ## End 8. Showing Low dimensional prediction regions with multivariate graphics There are 4 continuous predictor variables. We can observe prediction regions using 4D graphics + species color 4D graphics options include parallel coordinate plots scatterplot matrices casement display stereo or rotating ray glyphs other glyphs Issues include overplotting selection variables to emphasize distinguishing data from non-data domains Current choices Overplotting: Casement Display Resolution Emphasis: Petal Length and Width Real data domain highlighting: Not done ## Run # get ranges for predictors irisMin=apply(iris[,1:4],2,min) irisMax=apply(iris[,1:4],2,max) irisR = irisMax-irisMin # Select resolution of points accros the range # Petal length and width are the imporant variables # Give them more resolution gridSl = seq(irisMin[1],irisMax[1],len=5) gridSw = seq(irisMin[2],irisMax[2],len=5) gridPl = seq(irisMin[3],irisMax[3],len=10) gridPw = seq(irisMin[4],irisMax[4],len=10) # Generate predictor matrix and predict grid4D = expand.grid(list(sl=gridSl,sw=gridSw,pl=gridPl,pw=gridPw)) mat4D = as.matrix(grid4D) colnames(mat4D) = names(iris)[1:4] irisPredict = predict(irisRf,mat4D) predictCaseColor= irisColor[as.numeric(irisPredict)] # Construct casement display plotting coordinates # nest Sepal Length in Petal Length # Scale range of centered sepal length to # range of petal length/12.5 # Handle width similarly incX = scale(grid4D$sl,scale=12.5*irisR[1]/irisR[3]) incY = scale(grid4D$sw,scale=12.5*irisR[2]/irisR[4]) xNew = mat4D[,3]+incX yNew = mat4D[,4]+incY xNewR = range(xNew) xNewR = 1.045*(xNewR-mean(xNewR))+mean(xNewR) yNewR = range(yNew) yNewR = 1.045*(yNewR-mean(yNewR))+mean(yNewR) windows(width=8,height=8.6) plot(xNewR,yNewR,type='n', xaxs='i',yaxs='i',las=1, xlab="Petal Length refined by Sepal Length", ylab="Pepal Width refined by Sepal Width", main="Prediction domains for three Iris Species") tmp = par()$usr rect(tmp[1],tmp[3],tmp[2],tmp[4],col="#A0A0A0") points(xNew,yNew,pch=22,col="#B0B0B0", bg=predictCaseColor,cex=2.1) mtext(side=3,line=.3,"Setosa=Red, Versicolor=Blue, Virginica=Green") ## End 9. Imputing missing values in data.frames___________________________ This is not discussed here but is of potential use See na.roughfix() rfImpute 10. Selected Random Forest arguments________________________________ ntree: The number of trees to grow mtry: The number of variables sampled as candidates at each split classification default: sqrt(p) regression default: p/3 sampsize: Size() of sample to draw. Classication: if a vector of length # of classes size for the respective classes nperm: The number of permutations in assessing predictor importance 11. Partial Dependency Plots________________________________________ Cases with value in an interval for a predictor variable are assigned to classes based on voting. The log of the class voting fraction for a particular class can be compared against the average log of the class vote fractions for all classes. When the difference of the two values is greater than zero, a predictor variable value being in the interval is partially supporting the class assignment. Below the mid-range of Petal.Width is supportive of the versicolor assignment. ## Run set.seed(345) partialPlot(irisRf,iris,Petal.Width, "versicolor") ##End