ANU Home | HORUS | Staff Home | Students | RSBS
The Australian National University
Research School of Biological Sciences
    
Site Search
     
Advanced
Printer Friendly Version of this Document

Georg F. Weiller & Adrian Gibbs (1995), CABIOS 9-4

 

DIPLOMO: The tool for a new type of evolutionary analysis

Abstract

A package of computer programs called DIPLOMO (DIstance PLOt MOnitor) has been developed for making pairwise comparisons of different estimates of the distances between a set of taxa by plotting them against each other in a simple scatter plot. Taxa with similar relative distance characteristics are thereby grouped graphically. Groupings of different taxa may be directly identified, and the distance characteristics of chosen groups visualised and compared using devices to give them different colours or symbols. The program is particularly useful for detecting and analysing subtle trends in gene sequence evolution. This is done by comparing different components of change, for example synonymous versus non-synonymous nucleotide changes, transversions versus transitions and changes in different genes of the same set of taxa, etc. The program has a wide range of other uses, for example comparing different methods of sequence analysis, assessing which components of genetic change correlate best with phenotypic change or with geographical separation. This paper describes the DIPLOMO package, and illustrates typical DIPLOMO analyses using lentivirus gene sequence data.

Introduction

The evolutionary relationships between taxa are most often represented as phylogenetic trees, and many different algorithms for tree construction have been developed (Swofford and Olsen 1990, Weiller et al. in press). One severe limitation of tree diagrams is that each tree represents only one particular comparison metric. This might, for instance, be a particular distance measure such as the percentage of different nucleotides, a particular scoring scheme for character changes such as in parsimony programs, or a particular algorithm that estimates the maximum likelihood of a given tree topology. However, a set of real data often includes a variety of different and sometimes conflicting evolutionary signals. These could, of course, be represented in the form of individual trees, each appropriately focussed on a particular signal. Still, the resulting tree diagrams would not be easy to compare, particularly if they had different topologies.


Multivariate analysis methods, like multi-dimensional scaling or principal coordinate analysis (Everitt and Dunn 1992), are better suited to detect and separate different evolutionary signals in data, but it is usually difficult or even impossible to interpret the exact nature of the signals detected. In addition, some of these methods require Euclidean distances (Higgins 1992) and this limits the choice of metrics that can be used for ordination.


Here we report an alternative, which is to calculate a variety of different distance matrices from a set of taxa using different distance measures. The values of the resulting distance matrices can then be plotted against each other, a pair at a time, in a simple scatter plot ( Figure 1 ). The resulting graph shows how well the two distance measures correlate and whether the comparisons of all taxa in the data set correlate similarly or fall into distinct subsets. In the following we refer to this type of scatter plot as a 'distance-plot'.

Distance-plots

The principle of distance-plots is simple yet very versatile. Depending on the distance data compared, it reveals how various methods of phylogenic inference correlate and provides insight into evolutionary processes and trends. For instance, if the distance matrices A and B in Figure1 were calculated from the same data but using different algorithms for distance estimation, the resulting distance-plot would provide information on the relative characteristics of the algorithms, and whether the algorithms perform similarly on all taxa. Uses for this type of distance-plot include comparisons of distances estimated using different scoring matrices or methods of multiple hit correction and distances estimated from RFLP or RAPD banding patterns using different algorithms. Of particular interest is to compare a distance matrix that has been used to reconstruct a phylogenetic tree with a distance matrix obtained from the resulting tree (patristic distance), as this indicates how well the tree represents the original distance data.
A second type of comparison involves distances estimated from different data samples of the same taxa, such as the sequences of two different genes from the same set of organisms. The resulting distance-plot will show whether the differences in one gene correlate with differences in the other. Other comparisons that can be made easily are between morphological, anatomical, physiological, temporal and geographical distances.
A third type of comparison involves dissecting distance estimates into their component parts. For instance, one can divide the nucleotide substitutions between a set of genes into silent and non-silent substitutions and obtain slope lines that give an indication of the relative selection pressures for such changes in the various taxa. Similar clues are obtained from comparisons of mutations in different codon positions. Other methods of dissecting the evolutionary process include comparisons of transitions and transversions, individual substitutions like A-G versus C-T, or the changes in different parts of a set of aligned sequences, like the variable versus conserved regions of rRNA genes or exposed versus buried amino acids of a globular protein.
Despite their simplicity, the application of distance-plots has various practical limitations. First, it is initially not always clear which distance measures should be compared in order to best separate signals in the data set. In practice several rounds of trial and error are needed to explore the data. Second, if the data-points representing some comparisons are found to be atypical of the total data set, it can be very complicated and time consuming to establish which comparison has given rise to an individual data-point in the plot, and even more so when a group of points is atypical. Nevertheless, individual points or groups of points must be identified, if one is to associate distinct characteristics with distinct subgroups or phyla in the data set. It is particularly important for this association to be established quickly and easily if many plots are to be analysed during data exploration, as each result is only setting the stage for the next question. Third, we might wish to correlate more than two distance measures. Although multi-dimensional plots are conceptually easy, they cannot be visualised satisfactorily. This can be partially circumvented by switching between several two-dimensional plots, and by using distinctive colours and symbols to consistently represent the comparisons of particular taxa, so that their location in the various plots is readily seen.
The DIPLOMO program (an acronym for DIstance PLOt MOnitor) has been designed to make it easy to create, analyse and compare two dimensional distance-plots. Special consideration has been given to providing an appealing, intuitive and responsive user interface, allowing the investigator to quickly verify the various conclusions developed during the exploration of his/her data.

The DIPLOMO program

DIPLOMO produces a scatter plot from the values of two different distance matrices in its 'plot-window'. The program does not produce the matrices, as this would severely limit the scope of possible analyses, hence a 'DIP-file' containing all matrices must be provided (see below). The 'view' options allow the scaling or magnification of particular plot regions. The 'label' options enable the user to independently specify the symbols (letters or graphic signs) and/or colours that are used to represent comparisons between certain taxa or groups of taxa. When the user selects particular data-points by 'dragging' a selection frame over an area in the plot-window using the mouse or keyboard, the corresponding taxa are immediately identified in the 'monitor-window'. A variety of identification modes can be chosen to accommodate single data-points as well as groups. The label option can also be used to mark the selected data-points, and as all labels are retained when different distance matrices are plotted, it is easy to follow the properties of individual taxa when compared by a variety of different distance measures. The entire state of the plot window can be saved in a file, therefore a series of previously labelled and saved plots can be instantly redisplayed, and interrupted DIPLOMO sessions resumed.
One option of DIPLOMO provides an elementary statistical analysis including correlation and linear regression of all comparisons that have been selected or given a specific label. These data can also be exported to a generic file for further analysis using, for instance, spread sheet or statistical analysis programs. In addition, regression lines representing labelled data-points can be superimposed in the plot window. The analysis can be temporarily restricted to particular taxa, using the 'hide-OTU' option and the corresponding pruned distance matrices exported for further phylogenetic analysis using, for example, tree building or ordination programs.
Although DIPLOMO has a comprehensive help-text file, all documentation is also available from within the program and quickly accessible using the help-index window. A context sensitive help system provides additional help on every current operation.

 

DIPLOMO files

The 'DIP-file' (file extension '.DIP') is the main input file to DIPLOMO. This simple ASCII file contains the names of all taxa and the distance matrices to be compared. All distance matrices in a DIP-file contain the same number and order of taxa and only differ in the actual distance values. Any relationship that can be represented as a distance can be used to produce a matrix and included in a DIP-file, which, as the format is very simple, will require little editing.
We are currently developing 'DisCalc' (Weiller et al. in prep.), a computer program that produces DIP-files by compiling distance matrices from sequence alignments and the 'RAPDistance' package (Armstrong et al. in prep.) that produces DIP-files from RAPD and RFLP banding patterns. 'DIG-files' (DIplomo Group) contain the group definitions. These are symbolic taxon names which are used by DIPLOMO for defining groups of taxa. Groups can be nested so that group structures, like trees, can be examined. As the same group definitions can be used by more than one DIP-file, they are stored separately. DIG-files can be created by any ASCII editor or, more conveniently, with the group-editor in DIPLOMO.
'DIV-files' (DIplomo View) store a snap-shot of a DIPLOMO session. They are created and utilised by DIPLOMO to recall previous DIPLOMO sessions. The supplied utility program DIV2PS uses DIV-files to reproduce the DIPLOMO plot in Postscript format for publication quality prints in black and white or colour.
The file 'diplomo.cfg' contains the default program configuration and can be customised using an ASCII editor.

 

Example applications

The DIPLOMO screen shown in Figure 2 gives an example using the env gene sequences of 40 HIV/SIV lentiviruses. The two distance measures used to produce the graph in this example are the proportion of amino acid differences (Y-axis) and the proportion of nucleotide differences in the third codon position (X-axis) respectively. This type of comparison, similar to a comparison of non silent and silent substitutions, shows how strongly changes in the amino acid sequence are selected against in the individual taxa. The taxa involved, are given in Table 1 along with the group definitions and labelling strategy used. Immediately it can be seen that the ratio of amino acid to nucleotide changes differs in the various virus groups, and is greatest in HIV1 (red '1') and SIVmac (light blue 'M') isolates, followed by HIV2 (dark blue '2') and some SIVsm-related isolates (light blue '+'). It is smallest in the SIVagm group (dark blue 'A') and in comparisons between different groups.
There are two possible explanations for these differences. First, it is possible that env proteins of some viruses were positively selected for their ability to escape immune surveillance and thus for changes in their amino acid sequence. An alternative explanation is that an organism living in one environment for an evolutionary long time becomes fully adapted to that environment by selection, and asymptotically approaches a minimal rate of evolutionary change. If the organism's environment changes, such as when a virus moves to a new host, its genes are freed from the stabilising selection, and the organism may experience rapid evolution. Both possibilities, or some amalgam of them, will be revealed by a suitable DIPLOMO analysis; they are evidence of a 'punctuated equilibrium' (Gould et al. 1987) or the 'incessant evolutionary dance' between parasites and their hosts (Haldane 1949)
In terms of Gould's hypothesis, the DIPLOMO plot suggests that HIV1 is a recently arisen viral population as has been concluded by most HIV investigators. DIPLOMO places HIV2 comparisons slightly to the right of the HIV1 comparisons, suggesting that HIV2 is evolutionary slightly more stable in, and adapted to, the human population than HIV1, and therefore probably older. By contrast there are the SIVagm isolates, which are believed to have been present for at least several thousands of years in their present host, as the four extant species of African green monkeys harbour four distinctive subtypes of SIVagm (Allan 1992). The SIVsm-related isolates (light blue) are even more instructive as they fall into two groups. While most of the comparisons (light blue '+') fall between HIV2 and SIVagm, comparisons within the SIVmac group ('M') and the comparison between sivsmmpb and sivsmmh4 ('S') fall on the diagonal typical for HIV1 isolates, suggesting that these isolates, like HIV1, have recently moved to the host from which they were isolated, and this agrees with the evidence; SIVmac viruses probably arose recently from SIVsm virus infections in primate centres in the USA, where macaques were housed with and infected from sooty mangabeys, the natural host of SIVsm (Novembre et al. 1992), and the sivsmmpb isolate arose when a mangabey virus was used to infect a rhesus macaque (Dewhurst et al. 1990).
It is clear that the optimal adaptation between host and parasite for long term survival of both involves a controlled pathogenicity of the parasite, and indeed, the pathogenicity of the viruses in this example correlates well with the diagonal representing them. The SIVmac and sivsmmpb isolates are the most pathogenic lentiviruses, killing their host within weeks of infection and having little likelihood of long term survival. HIV1 is less destructive and HIV2 is even less virulent, whereas the other virus isolates are not known to kill their natural hosts.
The example above demonstrates how DIPLOMO can be used to seek clues of molecular evolutionary trends such as 'punctuated equilibra' and of defining groups of isolates with similar genotype (3rd codon position) to phenotype (amino acid) behaviour, which in this instance seems to be related to their pathogenicity. An indication of the influence of the host's immune system can be obtained by comparing the distance-plot in Figure 2 with similar distance-plots of the same viruses, using genes that are less exposed to the immune system. Distance-plots using the same taxa and distance measures but different genes (gag or pol) give similar results, with the separation of the various groups less pronounced in gag and smallest in pol (distance-plots not shown). These differences are most clear when one compares the proportion of amino acid changes of different genes with each other directly. Figure 3 gives a gag versus pol (a) and an env versus pol (b) distance-plot. Both distance-plots group the data similarly and this grouping is reminiscent of the previous example. The main difference is the relative positions of HIV1/sivcpz comparisons (red 'C') which have shifted from the SIVagm towards the HIV1 diagonal, and the comparisons between isolates from different SIV-related groups (blue/light blue '+') which have shifted from the SIVagm towards the HIV2 diagonal. Thus genes of some isolates differ in their relative rates of change. These results illustrate the sensitivity of the DIPLOMO analytical method, and its ability to detect subtle evolutionary trends, however as the purpose of this paper is to introduce the DIPLOMO program, rather than give a comprehensive analysis of lentivirus evolution, we leave further interpretation of these distance-plots to a later publication.

 

System and methods

DIPLOMO is an object oriented, event driven C++ program with a graphical user interface. Nevertheless the program is designed to be ported to a variety of platforms with different operating systems. This is possible by use of the ZincÔ Interface Library (Zinc Software Incorporated, Pleasant Grove, Utah) which is now available for MS-DOS text and graphic mode, Microsoft Windows, Windows-NT, OS2, Apple Macintosh, UNIX curses and X-Windows. The currently available version of DIPLOMO was developed and compiled with Borland C++ 3.1 and can be used in microcomputers with MS-DOS in text and graphic mode, and under MS-Windows.
The minimal requirements for this version is an IBM compatible personal computer with 480 K RAM. To analyse large data sets, 600K free RAM, a fast microprocessor, hard disk, VGA compatible graphic card with colour monitor, and mouse are recommended. The MS-DOS version of DIPLOMO works with up to 180 taxa. The number of distance matrices depends on the free disk space and is practically unlimited.

 

Program availability

For non-commercial use, DIPLOMO can be obtained free of charge either on floppy disk by sending an empty disk and a self addressed return envelope to GFW or on internet via anonymous ftp or gopher from life.anu.edu.au (IP 150.203.38.74). The DIPLOMO package is in directory 'pub/software/diplomo' file dipxxx.exe (where xxx is the version number). This file is a self extracting archive and must be transferred in binary mode.

References

  1. Allan, J. S. (1992) Viral evolution and AIDS. J NIH Res, 4, 51-54.
  2. Dewhurst, S., Embretson J. E., Anderson, D. C., Mullins J. I. and Fultz P.N. (1990) Sequence analysis and acute pathogenicity of molecular cloned SIV-smmpbj14. Nature 345, 636-40.
  3. Everitt, B. S. and Dunn, G. (1992) Applied multivariate data analysis. Oxford University Press, New York
  4. Gould, S. J., Gilensky N. L., and German R. Z. (1987) Asymmetry of lineages and the direction of evolutionary time. Science 236, 1437-1441.
  5. Haldane, J. B. S. (1949) Disease and evolution. Ric Sci Suppl, 19, 68-76.
  6. Higgins, D. G. (1992) Sequence ordinations: A multivariate analysis approach to analysing large sequence data sets. CABIOS, 8, 15-22.
  7. Novembre, F. J., Hirsh, V. M., McClure, H. M., Fultz, P. N. and Johnson P. R. (1992) SIV from stump-tailed macaques: Molecular characterization of a highly transmissible primate lentivirus. Virology, 186, 783-787.
  8. Swofford, D. L., and Olsen, G. J. (1990). Phylogeny reconstruction. In Molecular systematics. Edited by D. M. Hillis and C. Moritz, p 411-501, Sinauer Assoc., Sunderland, Mass.
  9. Weiller, G. F., McClure, M. A. and Gibbs, A. J. in press. Molecular phylogenetic analysis . In Molecular basis of virus evolution. Edited by A. J. Gibbs, C. H. Calisher and F. Garcia-Arenal, Cambridge University Press.

Figures

Fig. 1 - The distance-plot principle


The distance matrices 1 and 2 contain different pairwise distances estimates of the taxa A to Z. Each comparison of two taxa is plotted at y and x coordinates representing the corresponding distance value of matrix 1 and 2 respectively.

 

Fig. 2 - DIPLOMO screen


The plot window displays a distance-plot of the env gene of 40 HIV/SIV viruses. The X and Y axes represent the proportion of 3rd codon position and amino acid differences respectively. A linear regression line is fitted to comparisons between SIVagm isolates. The monitor window displays the taxon names of all comparisons represented in red.

 

Fig. 3 - Distance-plots produced by DIV2PS


The proportions of different amino acids in env versus pol (a) and gag versus pol (b) are compared. The individual taxa and their representation is as in Fig. 2 and given in Table 1 .

Table

Table 1 - Taxa and their representation.

Taxa:
hiv2ben hiv2d194 hiv2d205 hiv2gh1 hiv2isyr hiv2nihz hiv2rod hiv2st hivcam1 hivd31 hiveli hivhan hivhxb2r hivjrcsf hivjrfl hivlai hivmal hivmn hivndk hivnl43 hivny5cg hivoyi hivrf hivsf2 hivyu2 hivz2z6 sivag3 sivag155 sivag677 sivagty sivcpz sivmm1a1 sivmm32h sivmm142 sivmm239 sivmm251 sivmne sivsmmh4 sivsmmpb sivstm

Groups:
$HIV1 - { hivcam1 hivd31 hiveli hivhan hivhxb2r hivjrcsf hivjrfl hivlai hivmal hivmn hivndk hivnl43 hivny5cg hivoyi hivrf hivsf2 hivyu2 hivz2z6 }
$HIV2 - { hiv2ben hiv2d194 hiv2gh1 hiv2isyr hiv2nihz hiv2rod hiv2st }
$SIVagm - { sivag3 sivag155 sivag677 sivagty }
$SIVmac - { sivmm1a1 sivmm32h sivmm142 sivmm239 sivmm251 sivmne }
$SIVsmm - { sivsmmh4 sivsmmpb }
$SIVsm-rel - { $SIVmac $SIVsmm sivstm }
$SIV-rel - { $HIV2 $SIVagm $SIVsm-rel hiv2d205}
$HIV-rel - { $HIV1 sivcpz}

Labels:
$HIV-rel:$HIV-rel - red
$HIV-rel:$HIV1 - '1'
$HIV-rel:sivcpz - 'C'
$SIV-rel:$SIV-rel - dark blue
$HIV2:$HIV2 - '2'
$SIVagm:$SIVagm - 'A'
$SIVsm-rel:$SIVsm-rel - light blue
$SIVmac:$SIVmac - 'M'
$SIVsmm:$SIVsmm - 'S'
$HIV-rel:$SIV-rel - 'magenta'
all other comparisons '+' and black

 

The top section gives the names of all viruses used in Figs. 2 and 3. These were grouped as shown in section 2. Note that all group names start with a '$' sign and groups can contain other groups. The third section gives the colours and symbols used to represent comparisons between the individual taxa in Figs. 2 and 3. The order of these assignments is significant, for example, all comparisons between SIV-related viruses were first coloured dark blue, but the colour for the SIVsm-related subset was subsequently changed to light blue.


Download a copy of the program now

 

Back to Back