|
Georg F.
Weiller & Adrian Gibbs (1995), CABIOS 9-4
DIPLOMO: The tool for
a new type of evolutionary analysis
Abstract
A package of computer programs called DIPLOMO (DIstance PLOt MOnitor)
has been developed for making pairwise comparisons of different
estimates of the distances between a set of taxa by plotting them
against each other in a simple scatter plot. Taxa with similar relative
distance characteristics are thereby grouped graphically. Groupings
of different taxa may be directly identified, and the distance characteristics
of chosen groups visualised and compared using devices to give them
different colours or symbols. The program is particularly useful
for detecting and analysing subtle trends in gene sequence evolution.
This is done by comparing different components of change, for example
synonymous versus non-synonymous nucleotide changes, transversions
versus transitions and changes in different genes of the same set
of taxa, etc. The program has a wide range of other uses, for example
comparing different methods of sequence analysis, assessing which
components of genetic change correlate best with phenotypic change
or with geographical separation. This paper describes the DIPLOMO
package, and illustrates typical DIPLOMO analyses using lentivirus
gene sequence data.
Introduction
The evolutionary relationships
between taxa are most often represented as phylogenetic trees,
and many different algorithms for tree construction have been
developed (Swofford and Olsen 1990, Weiller et al. in press).
One severe limitation of tree diagrams is that each tree represents
only one particular comparison metric. This might, for instance,
be a particular distance measure such as the percentage of different
nucleotides, a particular scoring scheme for character changes
such as in parsimony programs, or a particular algorithm that
estimates the maximum likelihood of a given tree topology. However,
a set of real data often includes a variety of different and sometimes
conflicting evolutionary signals. These could, of course, be represented
in the form of individual trees, each appropriately focussed on
a particular signal. Still, the resulting tree diagrams would
not be easy to compare, particularly if they had different topologies.
Multivariate analysis methods, like multi-dimensional scaling
or principal coordinate analysis (Everitt and Dunn 1992), are
better suited to detect and separate different evolutionary signals
in data, but it is usually difficult or even impossible to interpret
the exact nature of the signals detected. In addition, some of
these methods require Euclidean distances (Higgins 1992) and this
limits the choice of metrics that can be used for ordination.
Here we report an alternative, which is to calculate a variety
of different distance matrices from a set of taxa using different
distance measures. The values of the resulting distance matrices
can then be plotted against each other, a pair at a time, in a
simple scatter plot ( Figure 1 ). The resulting
graph shows how well the two distance measures correlate and whether
the comparisons of all taxa in the data set correlate similarly
or fall into distinct subsets. In the following we refer to this
type of scatter plot as a 'distance-plot'.
Distance-plots
The principle of distance-plots is simple yet very versatile. Depending
on the distance data compared, it reveals how various methods of
phylogenic inference correlate and provides insight into evolutionary
processes and trends. For instance, if the distance matrices A and
B in Figure1 were calculated from the same
data but using different algorithms for distance estimation, the
resulting distance-plot would provide information on the relative
characteristics of the algorithms, and whether the algorithms perform
similarly on all taxa. Uses for this type of distance-plot include
comparisons of distances estimated using different scoring matrices
or methods of multiple hit correction and distances estimated from
RFLP or RAPD banding patterns using different algorithms. Of particular
interest is to compare a distance matrix that has been used to reconstruct
a phylogenetic tree with a distance matrix obtained from the resulting
tree (patristic distance), as this indicates how well the tree represents
the original distance data.
A second type of comparison involves distances estimated from different
data samples of the same taxa, such as the sequences of two different
genes from the same set of organisms. The resulting distance-plot
will show whether the differences in one gene correlate with differences
in the other. Other comparisons that can be made easily are between
morphological, anatomical, physiological, temporal and geographical
distances.
A third type of comparison involves dissecting distance estimates
into their component parts. For instance, one can divide the nucleotide
substitutions between a set of genes into silent and non-silent
substitutions and obtain slope lines that give an indication of
the relative selection pressures for such changes in the various
taxa. Similar clues are obtained from comparisons of mutations in
different codon positions. Other methods of dissecting the evolutionary
process include comparisons of transitions and transversions, individual
substitutions like A-G versus C-T, or the changes in different parts
of a set of aligned sequences, like the variable versus conserved
regions of rRNA genes or exposed versus buried amino acids of a
globular protein.
Despite their simplicity, the application of distance-plots has
various practical limitations. First, it is initially not always
clear which distance measures should be compared in order to best
separate signals in the data set. In practice several rounds of
trial and error are needed to explore the data. Second, if the data-points
representing some comparisons are found to be atypical of the total
data set, it can be very complicated and time consuming to establish
which comparison has given rise to an individual data-point in the
plot, and even more so when a group of points is atypical. Nevertheless,
individual points or groups of points must be identified, if one
is to associate distinct characteristics with distinct subgroups
or phyla in the data set. It is particularly important for this
association to be established quickly and easily if many plots are
to be analysed during data exploration, as each result is only setting
the stage for the next question. Third, we might wish to correlate
more than two distance measures. Although multi-dimensional plots
are conceptually easy, they cannot be visualised satisfactorily.
This can be partially circumvented by switching between several
two-dimensional plots, and by using distinctive colours and symbols
to consistently represent the comparisons of particular taxa, so
that their location in the various plots is readily seen.
The DIPLOMO program (an acronym for DIstance PLOt MOnitor) has been
designed to make it easy to create, analyse and compare two dimensional
distance-plots. Special consideration has been given to providing
an appealing, intuitive and responsive user interface, allowing
the investigator to quickly verify the various conclusions developed
during the exploration of his/her data.
The DIPLOMO program
DIPLOMO produces a scatter plot from the values of two different
distance matrices in its 'plot-window'. The program does not produce
the matrices, as this would severely limit the scope of possible
analyses, hence a 'DIP-file' containing all matrices must be provided
(see below). The 'view' options allow the scaling or magnification
of particular plot regions. The 'label' options enable the user
to independently specify the symbols (letters or graphic signs)
and/or colours that are used to represent comparisons between certain
taxa or groups of taxa. When the user selects particular data-points
by 'dragging' a selection frame over an area in the plot-window
using the mouse or keyboard, the corresponding taxa are immediately
identified in the 'monitor-window'. A variety of identification
modes can be chosen to accommodate single data-points as well as
groups. The label option can also be used to mark the selected data-points,
and as all labels are retained when different distance matrices
are plotted, it is easy to follow the properties of individual taxa
when compared by a variety of different distance measures. The entire
state of the plot window can be saved in a file, therefore a series
of previously labelled and saved plots can be instantly redisplayed,
and interrupted DIPLOMO sessions resumed.
One option of DIPLOMO provides an elementary statistical analysis
including correlation and linear regression of all comparisons that
have been selected or given a specific label. These data can also
be exported to a generic file for further analysis using, for instance,
spread sheet or statistical analysis programs. In addition, regression
lines representing labelled data-points can be superimposed in the
plot window. The analysis can be temporarily restricted to particular
taxa, using the 'hide-OTU' option and the corresponding pruned distance
matrices exported for further phylogenetic analysis using, for example,
tree building or ordination programs.
Although DIPLOMO has a comprehensive help-text file, all documentation
is also available from within the program and quickly accessible
using the help-index window. A context sensitive help system provides
additional help on every current operation.
DIPLOMO files
The 'DIP-file' (file extension '.DIP') is the main input file to
DIPLOMO. This simple ASCII file contains the names of all taxa and
the distance matrices to be compared. All distance matrices in a
DIP-file contain the same number and order of taxa and only differ
in the actual distance values. Any relationship that can be represented
as a distance can be used to produce a matrix and included in a
DIP-file, which, as the format is very simple, will require little
editing.
We are currently developing 'DisCalc' (Weiller et al. in prep.),
a computer program that produces DIP-files by compiling distance
matrices from sequence alignments and the 'RAPDistance' package
(Armstrong et al. in prep.) that produces DIP-files from RAPD and
RFLP banding patterns. 'DIG-files' (DIplomo Group) contain the group
definitions. These are symbolic taxon names which are used by DIPLOMO
for defining groups of taxa. Groups can be nested so that group
structures, like trees, can be examined. As the same group definitions
can be used by more than one DIP-file, they are stored separately.
DIG-files can be created by any ASCII editor or, more conveniently,
with the group-editor in DIPLOMO.
'DIV-files' (DIplomo View) store a snap-shot of a DIPLOMO session.
They are created and utilised by DIPLOMO to recall previous DIPLOMO
sessions. The supplied utility program DIV2PS uses DIV-files to
reproduce the DIPLOMO plot in Postscript format for publication
quality prints in black and white or colour.
The file 'diplomo.cfg' contains the default program configuration
and can be customised using an ASCII editor.
Example applications
The DIPLOMO screen shown in Figure 2 gives
an example using the env gene sequences of 40 HIV/SIV lentiviruses.
The two distance measures used to produce the graph in this example
are the proportion of amino acid differences (Y-axis) and the proportion
of nucleotide differences in the third codon position (X-axis) respectively.
This type of comparison, similar to a comparison of non silent and
silent substitutions, shows how strongly changes in the amino acid
sequence are selected against in the individual taxa. The taxa involved,
are given in Table 1 along with the group
definitions and labelling strategy used. Immediately it can be seen
that the ratio of amino acid to nucleotide changes differs in the
various virus groups, and is greatest in HIV1 (red '1') and SIVmac
(light blue 'M') isolates, followed by HIV2 (dark blue '2') and
some SIVsm-related isolates (light blue '+'). It is smallest in
the SIVagm group (dark blue 'A') and in comparisons between different
groups.
There are two possible explanations for these differences. First,
it is possible that env proteins of some viruses were positively
selected for their ability to escape immune surveillance and thus
for changes in their amino acid sequence. An alternative explanation
is that an organism living in one environment for an evolutionary
long time becomes fully adapted to that environment by selection,
and asymptotically approaches a minimal rate of evolutionary change.
If the organism's environment changes, such as when a virus moves
to a new host, its genes are freed from the stabilising selection,
and the organism may experience rapid evolution. Both possibilities,
or some amalgam of them, will be revealed by a suitable DIPLOMO
analysis; they are evidence of a 'punctuated equilibrium' (Gould
et al. 1987) or the 'incessant evolutionary dance' between parasites
and their hosts (Haldane 1949)
In terms of Gould's hypothesis, the DIPLOMO plot suggests that HIV1
is a recently arisen viral population as has been concluded by most
HIV investigators. DIPLOMO places HIV2 comparisons slightly to the
right of the HIV1 comparisons, suggesting that HIV2 is evolutionary
slightly more stable in, and adapted to, the human population than
HIV1, and therefore probably older. By contrast there are the SIVagm
isolates, which are believed to have been present for at least several
thousands of years in their present host, as the four extant species
of African green monkeys harbour four distinctive subtypes of SIVagm
(Allan 1992). The SIVsm-related isolates (light blue) are even more
instructive as they fall into two groups. While most of the comparisons
(light blue '+') fall between HIV2 and SIVagm, comparisons within
the SIVmac group ('M') and the comparison between sivsmmpb and sivsmmh4
('S') fall on the diagonal typical for HIV1 isolates, suggesting
that these isolates, like HIV1, have recently moved to the host
from which they were isolated, and this agrees with the evidence;
SIVmac viruses probably arose recently from SIVsm virus infections
in primate centres in the USA, where macaques were housed with and
infected from sooty mangabeys, the natural host of SIVsm (Novembre
et al. 1992), and the sivsmmpb isolate arose when a mangabey virus
was used to infect a rhesus macaque (Dewhurst et al. 1990).
It is clear that the optimal adaptation between host and parasite
for long term survival of both involves a controlled pathogenicity
of the parasite, and indeed, the pathogenicity of the viruses in
this example correlates well with the diagonal representing them.
The SIVmac and sivsmmpb isolates are the most pathogenic lentiviruses,
killing their host within weeks of infection and having little likelihood
of long term survival. HIV1 is less destructive and HIV2 is even
less virulent, whereas the other virus isolates are not known to
kill their natural hosts.
The example above demonstrates how DIPLOMO can be used to seek clues
of molecular evolutionary trends such as 'punctuated equilibra'
and of defining groups of isolates with similar genotype (3rd codon
position) to phenotype (amino acid) behaviour, which in this instance
seems to be related to their pathogenicity. An indication of the
influence of the host's immune system can be obtained by comparing
the distance-plot in Figure 2 with similar
distance-plots of the same viruses, using genes that are less exposed
to the immune system. Distance-plots using the same taxa and distance
measures but different genes (gag or pol) give similar results,
with the separation of the various groups less pronounced in gag
and smallest in pol (distance-plots not shown). These differences
are most clear when one compares the proportion of amino acid changes
of different genes with each other directly. Figure
3 gives a gag versus pol (a) and an env versus pol (b) distance-plot.
Both distance-plots group the data similarly and this grouping is
reminiscent of the previous example. The main difference is the
relative positions of HIV1/sivcpz comparisons (red 'C') which have
shifted from the SIVagm towards the HIV1 diagonal, and the comparisons
between isolates from different SIV-related groups (blue/light blue
'+') which have shifted from the SIVagm towards the HIV2 diagonal.
Thus genes of some isolates differ in their relative rates of change.
These results illustrate the sensitivity of the DIPLOMO analytical
method, and its ability to detect subtle evolutionary trends, however
as the purpose of this paper is to introduce the DIPLOMO program,
rather than give a comprehensive analysis of lentivirus evolution,
we leave further interpretation of these distance-plots to a later
publication.
System and methods
DIPLOMO is an object oriented, event driven C++ program with a graphical
user interface. Nevertheless the program is designed to be ported
to a variety of platforms with different operating systems. This
is possible by use of the ZincÔ Interface Library (Zinc
Software Incorporated, Pleasant Grove, Utah) which is now available
for MS-DOS text and graphic mode, Microsoft Windows, Windows-NT,
OS2, Apple Macintosh, UNIX curses and X-Windows. The currently available
version of DIPLOMO was developed and compiled with Borland C++ 3.1
and can be used in microcomputers with MS-DOS in text and graphic
mode, and under MS-Windows.
The minimal requirements for this version is an IBM compatible personal
computer with 480 K RAM. To analyse large data sets, 600K free RAM,
a fast microprocessor, hard disk, VGA compatible graphic card with
colour monitor, and mouse are recommended. The MS-DOS version of
DIPLOMO works with up to 180 taxa. The number of distance matrices
depends on the free disk space and is practically unlimited.
Program availability
For non-commercial use, DIPLOMO can be obtained free of charge either
on floppy disk by sending an empty disk and a self addressed return
envelope to GFW or on internet via anonymous ftp or gopher from
life.anu.edu.au (IP 150.203.38.74). The DIPLOMO package is in directory
'pub/software/diplomo' file dipxxx.exe (where xxx is the version
number). This file is a self extracting archive and must be transferred
in binary mode.
References
- Allan, J. S. (1992) Viral evolution and AIDS. J NIH Res, 4,
51-54.
- Dewhurst, S., Embretson J. E., Anderson, D. C., Mullins J.
I. and Fultz P.N. (1990) Sequence analysis and acute pathogenicity
of molecular cloned SIV-smmpbj14. Nature 345, 636-40.
- Everitt, B. S. and Dunn, G. (1992) Applied multivariate data
analysis. Oxford University Press, New York
- Gould, S. J., Gilensky N. L., and German R. Z. (1987) Asymmetry
of lineages and the direction of evolutionary time. Science
236, 1437-1441.
- Haldane, J. B. S. (1949) Disease and evolution. Ric Sci Suppl,
19, 68-76.
- Higgins, D. G. (1992) Sequence ordinations: A multivariate
analysis approach to analysing large sequence data sets. CABIOS,
8, 15-22.
- Novembre, F. J., Hirsh, V. M., McClure, H. M., Fultz, P. N.
and Johnson P. R. (1992) SIV from stump-tailed macaques: Molecular
characterization of a highly transmissible primate lentivirus.
Virology, 186, 783-787.
- Swofford, D. L., and Olsen, G. J. (1990). Phylogeny reconstruction.
In Molecular systematics. Edited by D. M. Hillis and C. Moritz,
p 411-501, Sinauer Assoc., Sunderland, Mass.
- Weiller, G. F., McClure, M. A. and
Gibbs, A. J. in press. Molecular phylogenetic analysis .
In Molecular basis of virus evolution. Edited by A. J. Gibbs,
C. H. Calisher and F. Garcia-Arenal, Cambridge University Press.

The distance matrices 1 and 2 contain different pairwise distances
estimates of the taxa A to Z. Each comparison of two taxa is plotted
at y and x coordinates representing the corresponding distance value
of matrix 1 and 2 respectively.

The plot window displays a distance-plot of the env gene of 40 HIV/SIV
viruses. The X and Y axes represent the proportion of 3rd codon
position and amino acid differences respectively. A linear regression
line is fitted to comparisons between SIVagm isolates. The monitor
window displays the taxon names of all comparisons represented in
red.

The proportions of different amino acids in env versus pol (a) and
gag versus pol (b) are compared. The individual taxa and their representation
is as in Fig. 2 and given in Table
1 .
Table
Table 1 - Taxa and
their representation.
Taxa:
hiv2ben hiv2d194 hiv2d205 hiv2gh1 hiv2isyr hiv2nihz hiv2rod hiv2st
hivcam1 hivd31 hiveli hivhan hivhxb2r hivjrcsf hivjrfl hivlai hivmal
hivmn hivndk hivnl43 hivny5cg hivoyi hivrf hivsf2 hivyu2 hivz2z6
sivag3 sivag155 sivag677 sivagty sivcpz sivmm1a1 sivmm32h sivmm142
sivmm239 sivmm251 sivmne sivsmmh4 sivsmmpb sivstm
Groups:
$HIV1 - { hivcam1 hivd31 hiveli hivhan hivhxb2r
hivjrcsf hivjrfl hivlai hivmal hivmn hivndk hivnl43 hivny5cg hivoyi
hivrf hivsf2 hivyu2 hivz2z6 }
$HIV2 - { hiv2ben hiv2d194 hiv2gh1 hiv2isyr hiv2nihz
hiv2rod hiv2st }
$SIVagm - { sivag3 sivag155 sivag677 sivagty
}
$SIVmac - { sivmm1a1 sivmm32h sivmm142 sivmm239
sivmm251 sivmne }
$SIVsmm - { sivsmmh4 sivsmmpb }
$SIVsm-rel - { $SIVmac $SIVsmm sivstm }
$SIV-rel - { $HIV2 $SIVagm $SIVsm-rel hiv2d205}
$HIV-rel - { $HIV1 sivcpz}
Labels:
$HIV-rel:$HIV-rel - red
$HIV-rel:$HIV1 - '1'
$HIV-rel:sivcpz - 'C'
$SIV-rel:$SIV-rel - dark blue
$HIV2:$HIV2 - '2'
$SIVagm:$SIVagm - 'A'
$SIVsm-rel:$SIVsm-rel - light blue
$SIVmac:$SIVmac - 'M'
$SIVsmm:$SIVsmm - 'S'
$HIV-rel:$SIV-rel - 'magenta'
all other comparisons '+' and black
The top section gives the
names of all viruses used in Figs. 2 and
3. These were grouped as shown in section
2. Note that all group names start with a '$' sign and groups
can contain other groups. The third section gives the colours
and symbols used to represent comparisons between the individual
taxa in Figs. 2 and 3.
The order of these assignments is significant, for example, all
comparisons between SIV-related viruses were first coloured dark
blue, but the colour for the SIVsm-related subset was subsequently
changed to light blue.
Download a copy of the program now
Back to Back
|