File a7_CV_LanlPacketHeaders By Daniel B. Carr Uses: Tasks Get CrystalVision and install it. Directions are in a7_CV_genes Read the background description Get the data from the class web site and use the script to produce data for use in Crystal vision Select the x=time y=source port plot Enlarge the points (slider top right) Turn off the Alphablending (icon with A and slash) Brush the band at the bottom changing point color to green. (Click Brush icon top left and color bar below) Select using rubberband box) Brush an interval at the top changing the point color to magenta (similar to above) Click pointer icon to exit brushing mode Click the variable button (v icon) Under the Select column to the following to Yes timeHr,Sip12,Sport,and Dport Turn in The 4 kinds of screen shots (Alt PrntScrn keys) noted below Paste into Word (Cntl V keys)and email as an attachment Scatterplot matrix (Scatterplot icon) Parallel coordinates (parallel coord. icon) 3-D plot (3 axis icon) Ray plot (3 axis icon + icon with "4D") __________________Background___________________________ Data Los Alamos Green Network Session Data May 1,2002 Omits Green Network session going to destination port 80 (web). Hour 0 only: just 878 sessions Full day 307325 sessions IP addresses: Four fields separated by periods 117.24.126.13 Field are called octets Octet range [0 255] 255=2^8 - 1 Possible addresses 2^(8*4)=4.29 billion Classes of networks Class A addressed by the first octet leaves 2^(8*3)=1.67 million addresses Class B addressed by first two octets leaves 2^(8*2) = 65 thousand addresses Class C (often subnets) leaves 2^8 = 256 addresses Packets: Packets are units of information set between computers. Their structure follows certain protocols The packets headers are contain information about Source and destination addresses Source and destination ports Session starting and ending sequences The sequence of packets within the session The packet header information is collected with a sniffer. The packet data send with a sequence of packets may be encrypted and is typically not kept for analysis. Session Session consist of a sequence of packets between computers. There is are special beginning packet sequences with a data packet sequence in the middle. The session closing handshaking sequence of packets proper closing may not happen for many reason. The session many not have been properly started as indicated below. Either source or destination computer may stop sending packets for a variety of reasons. In the Lanl green network data over 89% of the session were not properly closed. Backscatter Hackers send messages that may spoof the source address. The destination computer will send a packet to the computer with the spoofed address. This is to move forward in the handshaking process to begin dialogue. The spoofed source computer is not expecting the packet since it didn't start the handshaking sequence. The receipt of such packets is called backscatter. Under some assumptions it is possible to to estimate the amount of spoofing that occurs Source Ports 2^16 values [0 65535] Different operating systems increment the source ports in different ways. This can be used to make inferences about operationing systems 1024-4999 is the default ephemeral port range for Linux 2.2, Windows and some BSD systems. Linux 2.4 uses 32768 - 61000. AIX and Solaris use 32768 - 65535. BSD/OS and HPUX use 49152 - 65535. For more info see: http://www.ncftpd.com/ncftpd/doc/misc/ephemeral_ports.html Graphics challenges Value ranges can lead to problematic overplotting using an ordinary scale. Suppose there are 1024 pixels (distinct plotting locations) Port Numbers 2^16 / 2^10 = 64 possible distinct values plotted on a pixel Time: in milliseconds for a day 24 hours x 60 minutes x 60 seconds x 1000 = 8.64 million Roughly 86000 distinct values assigned to a pixel 4 octet IP addresses: 4.29 billion Roughly 429 million distinct value assigned to a pixel. Rapid change Source port numbers may change go through full cycles faster than the sampling intervals. The observed samples may appear random. Hacker evolution When intrusion patterns are easy to spot and the word get out, hacker develop more sophisticated methods. Much of this is outside the background of the instructed and will be bound the scope of this course. Data Mining It is interesting to see the patterns that appear over twenty four hours. Some of the interesting things such as negative session during happen more in the clean up stage before getting to the intended graphics. CrystalVision limitations Brushing with the full data (300,000) cases is slow The 3-D plots are suggestive but the axes are not labeled. Interpretation must be from construction memory The 4-D ray plots Lack labels Scale the four variable into angles ranging 0 to 360 with the max overplotting the min. The assignment of variables to X,Y,Z,and Angle is limited to the data order in the file. #__________________Reading and Fixing data______________________ hr0 <- scan(file='GreenNet1May02hr0.csv', what=list(SBytes=0,SPacket=0,SIP='A',SrcPort=0,Dst='A', Broken='A',Src='A',DPort=0,DIP='A',DPackets=0, DBytes=0,Start='A',End='A'),skip=1,sep=',') # Scan seems to run faster than some of the import options. # This is only the first 878 record so speed is not much of an issue. # However the full day was over 300,000 records hr0 <- as.data.frame(hr0,str=F) # create a dataframe # the argument stops the conversion # of string to factors # Look at first and last few cases # and at selected variables nr <- nrow(hr0) cases <- c(1:4,(nr-3):nr) cols <- c(1:3,6,12:13) hr0[cases,cols] # first and last cases for selected column # SBytes SPacket SIP Broken Start End # 1 1500 1 204.121.16.41 * 00:00:04.479 00:00:04.479 # 2 80 2 204.121.6.32 ^ 00:00:05.139 00:02:00.708 # 3 4571 6 204.121.6.6 ^ 00:00:05.419 00:00:05.289 # 4 288 6 134.167.3.110 * 00:00:05.497 00:02:46.755 #875 168 4 211.187.11.84 ^ 00:59:54.494 01:00:08.310 #876 3732 88 211.187.11.84 ^ 00:59:55.269 01:00:03.806 #877 404 7 204.121.16.75 ^ 00:59:56.407 00:59:57.796 #878 232 5 211.187.11.84 ^ 00:59:59.903 01:00:02.498 # Discussion #SIP is the source IP address. #For now will convert this into two numbers 0 to 65535 using pairs #of fields # Broken indicates that the session did not close with # the proper acknowldegements between computer # Start and end time are in hours,minutes and seconds # We will convert these to hours are real numbers and # compute the session duration # ____________________Grab Source IP Octets__________________________________ # # There are many way to do this. The following # was my third try much faster than the first two. tmp <- hr0$SIP # check the first four letters for a period v1 <- ifelse(substring(tmp,1,1)=='.',1,5) v2 <- ifelse(substring(tmp,2,2)=='.',2,5) v3 <- ifelse(substring(tmp,3,3)=='.',3,5) v4 <- ifelse(substring(tmp,4,4)=='.',4,5) # find location of the first period # loc <- pmin(v1,v2,v3,v4) sip1 <- as.numeric(substring(tmp,1,loc-1)) tmp <- substring(tmp,loc+1,nchar(tmp)) v1 <- ifelse(substring(tmp,1,1)=='.',1,5) v2 <- ifelse(substring(tmp,2,2)=='.',2,5) v3 <- ifelse(substring(tmp,3,3)=='.',3,5) v4 <- ifelse(substring(tmp,4,4)=='.',4,5) loc <- pmin(v1,v2,v3,v4) sip2 <- as.numeric(substring(tmp,1,loc-1)) tmp <- substring(tmp,loc+1,nchar(tmp)) v1 <- ifelse(substring(tmp,1,1)=='.',1,5) v2 <- ifelse(substring(tmp,2,2)=='.',2,5) v3 <- ifelse(substring(tmp,3,3)=='.',3,5) v4 <- ifelse(substring(tmp,4,4)=='.',4,5) loc <- pmin(v1,v2,v3,v4) sip3 <- as.numeric(substring(tmp,1,loc-1)) sip4 <- as.numeric(substring(tmp,loc+1,nchar(tmp))) # check sip4[cases] # [1] 41 32 6 110 84 84 75 84 # Checks with above # ____________________Grab Destination IP Octets__________________________________ # # There are many way to do this. The following # was my third try much faster than the first two. tmp <- hr0$DIP # check the first four letters for a period v1 <- ifelse(substring(tmp,1,1)=='.',1,5) v2 <- ifelse(substring(tmp,2,2)=='.',2,5) v3 <- ifelse(substring(tmp,3,3)=='.',3,5) v4 <- ifelse(substring(tmp,4,4)=='.',4,5) loc <- pmin(v1,v2,v3,v4) dip1 <- as.numeric(substring(tmp,1,loc-1)) tmp <- substring(tmp,loc+1,nchar(tmp)) v1 <- ifelse(substring(tmp,1,1)=='.',1,5) v2 <- ifelse(substring(tmp,2,2)=='.',2,5) v3 <- ifelse(substring(tmp,3,3)=='.',3,5) v4 <- ifelse(substring(tmp,4,4)=='.',4,5) loc <- pmin(v1,v2,v3,v4) dip2 <- as.numeric(substring(tmp,1,loc-1)) tmp <- substring(tmp,loc+1,nchar(tmp)) v1 <- ifelse(substring(tmp,1,1)=='.',1,5) v2 <- ifelse(substring(tmp,2,2)=='.',2,5) v3 <- ifelse(substring(tmp,3,3)=='.',3,5) v4 <- ifelse(substring(tmp,4,4)=='.',4,5) loc <- pmin(v1,v2,v3,v4) dip3 <- as.numeric(substring(tmp,1,loc-1)) dip4 <- as.numeric(substring(tmp,loc+1,nchar(tmp))) # check cbind(dip4[cases],hr0$DIP[cases]) # [,1] [,2] #[1,] "2" "203.199.33.2" #[2,] "12" "137.94.215.12" #[3,] "152" "63.240.211.152" #[4,] "1" "204.121.3.1" #[5,] "32" "204.121.6.32" #[6,] "32" "204.121.6.32" #[7,] "194" "206.246.121.194" #[8,] "32" "204.121.6.32" # dip4 convert to character string in cbind # values check. #______________Look at broken connections_______________ # "*" are broken connections table(hr0$Broken) # * ^ # 363 515 363/nr # 41 percent are broken during hour 0. # About 90 percent of the connections are broken # during the whole day. Things get worse. # Convert to logical tmp <- ifelse(hr0$Broken=='*',T,F) tmp hr0$Broken <- tmp # The first two octets of the green network are # 204.121. green.source <- ifelse(sip1==204&sip2==121,T,F) table(hr0$Broken,green.source) # FALSE TRUE #FALSE 350 165 # TRUE 322 41 # Row is the broken status # Column is the source status # Broken and Green is 41. fewer percent were broken # when the green network was the source. # Convert Start to fractional hours # Calculate session duration in fractional minutes) b <- as.character(hr0$Start) e <- as.character(hr0$End) nb <- nchar(b) tb <- 60*as.numeric(substring(b,1,2))+ #time begin as.numeric(substring(b,4,5))+ as.numeric(substring(b,7,nb))/60 ne <- nchar(e) te <- 60*as.numeric(substring(e,1,2))+ # time end as.numeric(substring(e,4,5))+ as.numeric(substring(e,7,ne))/60 dur <- te-tb bad <- dur < 0 sum(bad) # [1] 7 start.hour <- tb/60 # Green Network IP Address 204.121..... slocal <- sip1==204 & sip2==121 sum(slocal) # 206 # Note that all of the Green Network web traffic (dport 80) # was supposed to be removed dlocal <- ifelse(dip1==204 & dip2==121,1,0) sum(dlocal) # [1] 672 sip12 <- 256*sip1+sip2 sip34 <- 256*sip3+sip4 dip12 <- 256*dip1+dip2 dip34 <- 256*dip3+dip4 broken <- ifelse(hr0$Broken,1,0) # ________________Write out data set for Crystal Vision______________ # Variables: # Time, Duration, Sip12,Sip34,Dip12,Dip34, # sPort,dPort, sPack,sByte,dPack,dByte, # broken,dlocal # Put the data (numeric) in a matrix dat <- cbind(start.hour,dur,sip12,sip34,dip12,dip34, hr0[,c(4,8,2,1,10,11)],broken,dlocal) names(dat) <- c('TimeHr', 'DurMin', 'Sip12', 'Sip34', 'Dip12', 'Dip34', 'sPort', 'dPort', 'sPac', 'sByte', 'dPac', 'dByte', 'Broken', 'dLocal') # Strings for the Crystal Vision file header head <- c(paste('variables:',ncol(dat),sep=''), 'labels:', names(dat)) write(head,file='LanlGreenHr0.txt',ncol=1) # one string per line # append the data with one row per line (white space delimiters) write(t(round(dat,6)),file='LanlGreenHr0.txt',ncol=ncol(dat),append=T)