Visual Clustering and Classification:
The Oronsay Particle Size Data Set Revisted
- Journal Version -
By
Adalbert F. X. Wilhelm
Edward J. Wegman
and
Jürgen Symanzik
A
(Unix) compressed postscript version (2.2MB - extends to 53.3MB) and a
pdf version (523KB)
of the text are available.
All the figures from the text are available as GIF files below.
Legends for Figures
-
Figure 1:
An excerpt from a map of the Oronsay island, Inner Hebrides.
It shows the locations
of the two archeological sites labeled Caisteal nan Gillean I
and Cnoc Coig and the four transects TG, CC, ECC, and TUS
where ``modern'' sand samples have been collected.
Reprinted with permission from Fieller et al. (1987).
-
Figure 2(a) and
Figure 2(b):
(a) Scatterplot of summed up weights (vertical) versus
group (horizontal). While most weights scatter around 60g,
values around 70g have been observed for most samples from groups
1 and 22. Group 5 has the smallest sum (51.8g) marked with a
``filled circle'' but also the largest sum (82.4g) marked with a
``filled box''.
(b) A parallel coordinate plot of group 5 reveals that the ``filled circle''
sample has an unusual small value for variable 9, i. e., particle size
[0.125, 0.18) mm, but it also has higher values for
variables 1 and 7. The ``filled box'' sample has a shape that matches the
shape of the remaining samples of group 5.
-
Figure 3(a) and
Figure 3(b):
(a) A dotplot of group for the 149 modern samples from known sampling
locations. Within each of the groups 6 to 20 two training samples have
been selected. The ``+'' symbols for all observations in groups 18 to 20
have been obtained through linked brushing in (b) a projection
of the grand tour. Here, two clusters are clearly distinguishable
that separate Caisteal nan Gillean samples (the ``+'' samples) from Cnoc Coig samples.
-
Figure 4(a),
Figure 4(b),
Figure 4(c), and
Figure 4(d):
(a) When considering only the Cnoc Coig samples (groups 6 to 17),
a dotplot reveals that the cluster brushed with a ``+'' symbol
in (b) a projection of the grand tour contains mid beach (group 15)
and upper beach (group 17) samples. (c) A dotplot of the final
clustering of the Cnoc Coig samples shows the correct classification
of dune samples (groups 8, 12 to 14) and the
misclassification of two entire groups of upper beach samples
(groups 7 and 11) and 1 mid beach sample (from group 10)
based on (d) a projection of the grand tour where a
homogeneous group of points has been marked with an ``×''
symbol, assuming these are all dune samples.
This projection also shows how the ``+'' cluster
brushed in (b) separates into two subclusters.
-
Figure 5(a) and
Figure 5(b):
(a) When considering only groups 6 to 17 for the reevaluation
of Figure 4c in Fieller et al. (1984) a dotplot shows the symbols used
to mark the 12 known groups at the Cnoc Coig site. (b) A projection
showing a local optimum based on the projection-pursuit-guided
grand tour and a manually drawn dividing line shows a separation
between beach samples (left of the line) and dune samples
(right of the line). However, upper beach locations from the
CC transect (group 7) and from the TG transect (group 11) also
fall right of the line.
-
Figure 6(a) and
Figure 6(b):
Symbols used to mark the 5 groups from
the Caisteal nan Gillean site for the reevaluation of Figure 4d in
Fieller et al. (1984)
are a small ``+'' (group 18),
a large ``+'' (group 19),
a ``×'' (group 20),
a ``filled circle'' (group 5),
and a ``empty box'' (group 21).
Projection (a) shows a circular arrangement
and projection (b) shows a linear arrangement, each obtained
as a local optimum based on the projection-pursuit-guided
grand tour. These and many similar projections
show more differences than similarities between
archaeological samples (groups 5 and 21) and modern samples (groups
18 to 20).
-
Figure 7(a) and
Figure 7(b):
(a) A dotplot shows the symbols used for all 226 samples.
(b) This projection separates between sites
(the big ``+'' and ``×''
symbols are Cnoc Coig samples, the small ``+'' and ``×'' symbols
are Caisteal nan Gillean samples) and sands within sites (the ``+'' symbols are
beach samples and the ``×'' symbols are ``dune-like'' samples).
We see that archaeological Caisteal nan Gillean samples fall close to modern
Caisteal nan Gillean samples (beach and dune). Archaeological Cnoc Coig samples
are clearly distinguishable from modern Cnoc Coig beach.
Sands above CC Midden (group 1) and Sands below CC Midden (groups 2 and 3)
are close to modern Cnoc Coig dunes.
CC Shell Midden (group 22) and CC Soil Pit (group 4)
have some overlap with the other
archaeological Cnoc Coig samples but they have very little in common
with modern Cnoc Coig dunes.
-
Figure 8:
Original parallel coordinate plot of all 149 modern samples from
Cnoc Coig and Caisteal nan Gillean.
The Cnoc Coig data is in black, the Caisteal nan Gillean data in gray (red). Data from the two
sources strongly separate with the Cnoc Coig sand generally being much finer than
the sand from the Caisteal nan Gillean site.
-
Figure 9:
Parallel coordinate plot of Cnoc Coig data after completing the
BRUSH-TOUR strategy. The data are divided into six clusters with red and
magenta being ``dune-like'' sand. In the present image, all points are
rendered in grayscale. The reader is referred to the webpage for full
color illustrations.
-
Figure 10:
Sequence of decompositions for Cnoc Coig sand data. Red and magenta
are basically the ``dune-like'' sands, other colors represent the
``beach-like'' sands. The strongest splits tend to occur early in the
BRUSH-TOUR strategy. Thus, the ``dune-beach'' split was the most evident.
-
Figure 11:
Parallel coordinate display of Cnoc Coig known (training) data and Cnoc Coig
groups 2 and 3 (sands below CC Midden) after partial grand tour. The group
2 and 3 data are given
in black. The group 2 and 3 data generally follow the red-magenta
``dune-like'' sand data. However the group 2 and 3 data clearly depart
significantly in certain dimensions, notably along the .50-.71 mm axis in
this illustration.
-
Figure 12:
Simplified parallel coordinate display of all Cnoc Coig data after
partial grand tour. ``Dune-like'' sands are shown in medium gray (red),
``beach-like'' sands
are shown in light gray (green), and ``unknown'' sands shown in black. The
unknown class
is distinct from both ``dune-like'' sand and ``beach-like'' sand. This is
particularly clear in the .25-.355 and the .50-.71 axes.
-
Figure 13:
Simplified scatterplot matrix of all Cnoc Coig data after partial
grand tour. The coloring is as in Figure 12. A density plot
of the circled scatterplot is shown in the upper right. In the density plot, the
tallest bump/mode corresponds to the red ``dune-like'' sand, the two smaller
bumps/modes to the left-center correspond to the green ``beach-like''
sand, and the smaller bumps/modes on the right corresponds to the black
``unknown'' sand. The ``unknown'' sand is most like the ``dune-like''
sand, but still rather distinct.
-
Figure 14:
Dotplots of all variables for all 149 modern samples
from known sampling locations.
Bright colors represent many points and dark colors only a few
points. Measurements on the extreme particle sizes >2.0mm, 1.4 - 2.0mm,
1.0 - 1.4mm, .71 - 1.4mm, .063 - .09mm, and <.06mm are highly
quantized.
Large gaps in the data are apparent for variables .355 - .50mm and .25 -
.355mm.
-
Figure 15:
Boxplots for all 149 modern (training) samples.
Variables >2.0mm, 1.4 - 2.0mm,
1.0 - 1.4mm, ..., .063 - .09mm, and <.06mm are displayed from
left to right. Only two variables, .18 - .25mm and
.09 - .125mm, show no outliers. These two distributions are also only
slightly skewed. Data on all the other variables is skewed to the right,
but data on particle size .125 - .18mm is skewed to the left.
-
Figure 16:
Clear classification between sites Cnoc Coig (top) and Caisteal nan Gillean (bottom) when
selecting
clusters of variable `[0.25, 0.355) mm'. Also the
variables `[0.355, 0.5) mm', `[0.125, 0.18) mm', and
`[0.09, 0.125) mm' allow
a clear separation between Cnoc Coig and Caisteal nan Gillean.
-
Figure 17:
We apply the classification rule of training data to the entire
data set of 226 samples. (a) Selecting the right-hand cluster in variable
.25 - .355mm highlights all training samples at Caisteal nan Gillean, one sample of
group 5 and also all but
one samples of group 21. (b) However, 16 points
fall between the previously established clusters. Those points are
classified by using the classification rule based on the training data
for variable .355 - .50mm. (c) The resulting classification
is correct for groups 18 to 21, but misses two samples in group 5 and
misclassifies five samples in group 4.
-
Figure 18:
The dune samples at Cnoc Coig split into two groups for variables
.18 - .25mm and .125 - .18mm. Group 12 (marked red, i. e. bigger
light dots) is
different from groups 8, 13, and 14 (all marked blue, i. e. bigger
dark dots). From the
position of the two groups it can be concluded that dune sands of
group 12 are much finer since these samples have heavier weights from
the finer sieves and lighter weights from the coarser sieves.
-
Figure 19:
The distribution for particle size `[0.09, 0.125)mm' seems to be a
mixture of four individual distributions: one for the Caisteal nan Gillean
samples, one for groups 7, 8, 11, 13, and 14 (in a), one for groups
6, 10, 12, and 17 (in b), and one for groups 9, 15, and 16 (in c).
-
Figure 20:
Individual separation of group 12 by sequentially
selecting subclusters (the dashed areas) in the respective highlighting of
particle sizes
`[0.09, 0.125)mm', `[0.25, 0.355)mm', and `[0.335, 0.50)mm'. No further
clustering can be found in the dotplots of the other variables
.125 - .18mm and .18 - .25mm (right under the bar chart for group).
Last Update January 25, 1999