Computation of the kth Nearest Neighbor Estimate of Entropy of Molecules Using Parallel Processing
E. James Harner, (West Virginia University), jharner@stat.wvu.edu,
Jun Tan, (West Virginia University), jtan@stat.wvu.edu,
Shengqiao Li, (West Virginia University and NIOSH), shli@stat.wvu.edu, and
Harshinder Singh, (West Virginia University and NIOSH), hsingh@stat.wvu.edu
Abstract
Entropy is a statistical measure of the random fluctuations in molecules and its estimation is important for investigating the stability of molecular conformations, for modeling the binding of ligands to proteins, and for studying issues relating to drug designs. Singh et al. (American Journal of Mathematical and Management Sciences, 2003, 23, 301-321) introduced a nonparametric approach for estimating entropy using the kth nearest neighbor distances between sample points which extends the first nearest neighbor approach of Kozachenko and Leonenko (Problems of Information Transmission, 1987, 23, 95-101). Entropy of a molecule depends on random fluctuations in the internal coordinates. The high dimension of the internal coordinates and a large number of observations on the coordinates cause computational challenges for computing the kth nearest neighbor distances required for obtaining the kth nearest neighbor estimate of entropy of a large molecule. We are experimenting with computing the kth nearest neighbor distances for the sample points on a high-performance computer having a large number of parallel processors (nodes). On each processor, we use the ANN method (Arya et al. (1994), Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 573-582) for computing kth nearest neighbor distances. ANN builds a tree structure and searches for nearest neighbors on it. In order to search for nearest neighbors concurrently using high-performance computing with many processors, we have developed a program to duplicate the search trees on multiple processors (slave nodes) and each node computes the kth nearest neighbor distances for part of the data. When the dimension of the coordinates and the number of observations becomes large, the gain in performance becomes obvious. Using the ANN method, the whole data set needs to be loaded into the memory to build the tree. Therefore, the maximum size of the data set t! hat ANN can handle is limited by the size of the available memory. We are also developing a program that will break the data set into parts and build tree structures on the slave nodes. To find the kth nearest neighbor of a point, the program will search all the trees concurrently and then merge all the results to get the kth nearest neighbor for the data point. The proposed program will enhance the capability of handling extremely large data sets, such as those obtained using molecular dynamics simulation, and it will improve the efficiency of computing the kth nearest neighbor distances for the sample points.