Molecular Biology
and Evolution 15(3): 326-335. 1998.
Phylogenetic Profiles:
A graphical method for detecting genetic recombinations in homologous
sequences
Georg
F. Weiller
Bioinformatics
Laboratory, Research School of Biological Sciences, Australian
National University
Canberra,
ACT 0200, AUSTRALIA, ph: +61 (2) 6249-5916, fx: +61 (2) 6249-4437
Keywords: genetic recombination, computer
algorithm, Salmonella enterica, molecular phylogeny,
evolution
Abstract
Phylogenetic profiles constitute a
novel way of displaying graphically the coherence of the sequence
relationships over the entire length of a set of aligned homologous
sequences. Using a sliding window technique, the method determines
the pairwise distances of all sequences in the windows and evaluates,
for each sequence, the degree to which the patterns of distances
in these regions agree. The method is suited for exploring data
consistency as well as detecting recombinant sequences. A computer
program implementing the algorithm has been developed and examples
with simulated and natural sequences are given to demonstrate
the sensitivity and accuracy of the method for identifying recombinant
sequences and their recombination junctions as well as detecting
hot-spots of recombinational activity.
Introduction
Gene conversion, and
other recombinational events like transposition, transduction
and intron-homing are important processes that influence biological
evolution. They also complicate the work of the molecular phylogenetist,
as genomes rearranged in such ways become a mosaic of regions
with different phylogenetic histories. In viruses, horizontal
gene transfer over a wide range of phylogenetic distances has
been a major evolutionary force (Gorbalenya 1995, Gibbs and
Keese 1995, Nuttall et al. 1995), and similar processes have
been suggested to occur in procaryotes (Whatmore and Kehoe 1994,
Bik et al. 1995) eucaryotes (Assali, Mache and de Goer 1990)
and even between kingdoms (Doolittle et al. 1990), hence phylogenetic
relationships deduced from gene sequences only represent the
evolutionary history of those genes unless recombination can
be excluded. However, genetic recombination is not limited to
exchanges involving whole genes and there is strong evidence
for intragenic recombination in viruses (Hahn et al. 1988, Sandmeier
1994), bacteria (DuBose, Dykhuizen and Hartl 1988, Reeves 1993,
Li et al. 1994, Thampapillai, Lan and Reeves 1994) and eucaryotes
(Stephens 1985, Weiller, Schueller and Schweyen 1989, Paquin,
Laforest and Lang 1994). The recognition of extra- and intragenic
recombination is not only important for unravelling the phylogenetic
history of genes, it is also crucial for molecular phylogenetic
inference, as trees derived from different genes or gene regions
may differ in topology, and taxa with mosaic sequences will
be placed incorrectly. The analytical challenges posed by horizontal
gene-transfer are reviewed by Syvanen (1994).
Various methods for
detecting gene conversion and recombination in homologous DNA
sequences have been described. In Stephen’s (1985) method, a
set of aligned sequences is split into two subsets, at every
variable position, and the distribution of all variable sites
that support particular splits is examined. Significant deviations
from an uniform distribution are used as an indication that
some of the sequences are recombinants. However, in samples
of more than a few sequences the appropriate splits are hard
to find and sites with more than two alternative nucleotides
present a problem for Stephen’s method although the statistical
difficulties created by regions with variable mutation rates
are lessened by simply excluding invariant sites.
Sawyer’s (1989) method
reduces the problem of variable mutation rates by focussing
on silent polymorphic sites. His method also overcomes the partitioning
problem of Stephen’s method by analysing the distribution of
maximal length segments common to some pairs of sequences. A
Monte Carlo test involving permutation of sites is used to estimate
the significance of the distribution. Some of the imperfections
of this method include the drastically reduced amount of useable
data when only silent polymorphic sites are used and the validity
of the significance test.
In Fitch and Goodman’s
(1991) ‘ Phylogenetic-Scanning’ method, sets of phylogenetic
trees are constructed at different intervals in the sequence
alignment. The support for some of these trees is then evaluated
at all intervals using the parsimony principle and presented
graphically. The main computational complication of this method
arises with the very large number of possible trees when more
than a few sequences are analysed. Only a tiny subset of all
possible trees can be analysed, and it is not always clear which
trees to choose. While the graphs can be very informative, the
requirement that each tree is represented as a column in the
graph further reduces the number of alternatives that can be
tested.
Hein (1993) has developed
a method that employs a heuristic extension of the parsimony
principle to infer phylogenies from recombinant sequences. The
method assumes that a correct tree can be found for some sequence
regions and tries to reconstruct the recombinational steps required
to explain the tree topology found in other regions. Similar
to Fitch and Goodman’s method, this method is only applicable
for a comparatively small number of sequences and cannot detect
recombinations that do not change the topology of a tree.
Recently, methods
have been developed specifically for the analysis of HIV sequences
(Robertson et al. 1995, Salimen et al. 1995) whereby sliding
windows are used to compare the relationships of aligned sequences
with previously determined HIV prototype sequences. The success
of these methods depends largely on the availability of suitable
prototype sequences as only recombinations that switch the prototype
can be detected.
Two newly developed
and related computer methods make use of compatibility matrices
(Jakobsen & Easteal 1996) and partition matrices (Jakobsen,
Wilson and Easteal 1997) to graphically display the consistency
of the phylogenetic signal in all columns of a multiple sequence
alignment. These methods make fewer assumptions and do not require
the prior knowledge (or even existence) of a single phylogeny.
They are therefore especially helpful for exploratory analysis
of a limited number of sequences.
The ‘phylogenetic
profile’ method, described below, is a new computer graphic
method that overcomes some of the limitations of the currently
available methods. Similar to other methods mentioned above,
it is based on the principle that phylogenetic relationships
derived from different regions of a multiple sequence alignment
will be similar when no recombination has occurred. Thus the
method attempts to establish consistency in sequence relationships
between different parts of the alignment. Rather than tree topologies
or compatibility matrices, the method uses distance data to
describe the relationships and thus avoids many of the difficulties
posed by constructing and comparing tree topologies. The distance
approach makes it possible to detect recombinations that do
not change the tree topology, is very fast to compute and thus
allows analysis of a large dataset with more than a thousand
sequences. The estimate that a recombinational event has occurred
is then plotted for every position of every sequence and the
entire information is displayed in a single diagram.
The Phylogenetic Profile
Algorithm
The method introduces
the ‘phylogenetic correlation’ measure that quantifies the coherence
of the sequence interrelationships in two different regions
of a multiple alignment. The phylogenetic correlation of any
given position is determined by evaluating regions immediately
upstream and downstream of this position. Positions in which
sequence relationships in the upstream region clearly differ
from their downstream counterparts exhibit low phylogenetic
correlations and are likely recombination sites.
To determine the phylogenetic
correlation of a given test sequence at a given test location
the method defines two sequence windows, located immediately
before (upstream) and after (downstream) the test location and
determines the differences between the test sequence and all
other sequences in the windows, resulting in two vectors of
distance data. If the test sequence relates to the other sequences
similarly in both windows then the two distance vectors will
exhibit the same trend and correlate well. Conversely, if the
test sequence has recombined so that the sequence fragments
in both windows have different phylogenetic histories then the
two sets of sequence relationships would correlate poorly. Accordingly,
the phylogenetic correlation has been defined as the correlation
coefficient of the two distance vectors.
Table
1 demonstrates the computation of the phylogenetic correlation
for nine of the sequences described in the legend to Figure
1. As the recombinant sequence R1 matches sequences S1 upstream
and S6 downstream of the recombination site at position 500,
the R1 distance vectors are identical to the S1 and S6 distance
vectors in the respective regions. The phylogenetic correlations
of all nine sequences at the R1 recombination site is given
in the bottom row of Table 1.
Note the poor phylogenetic correlation of sequence R1. The phylogenetic
correlation for the sequences S1 and S6 are slightly lower than
the values for the other sequences but clearly higher than the
R1 values. These values reflect the degree to which the relationships
of the test sequence differ in the two windows. While the upstream
and downstream distance vectors of the recombinant R1 vary greatly,
the vector pairs of the other sequences (S1-8) are closely related
varying mainly in their R1 component and this variation is especially
pronounced for S1 and S6 which were used to construct R1.
For each individual
sequence in the alignment, the phylogenetic correlations are
computed at every position using sliding window techniques.
If a recombination site is not exactly located between the two
windows but inside one them, then a part of the test sequence
in the two windows will still have similar relationships resulting
in an intermediate phylogenetic correlation. Consequently, when
the window moves over a recombination junction, the phylogenetic
correlation decreases; it is smallest when the recombination
junctions is exactly at the junction of the two windows. The
plot of all phylogenetic correlations of a sequence against
the sequence positions is termed a ‘phylogenetic profile’, and
the profiles of all individual sequences are typically superimposed
in a single diagram. By examining and comparing the phylogenetic
profiles for all sequences, the recombinant sequences and the
location of recombination junctions are easily detected.
Phylogenetic profiles
can exploit a variety of different measures for estimating the
sequence distances as well as for determining the phylogenetic
correlation of distance vectors. In addition, two different
sliding window techniques can be used. These parameters are
briefly discussed below.
Distance estimates
Although a simple count of different
nucleotides or amino acids (Hamming distance) gives an adequate
measure of distance, the fraction of differences (p-distance)
is preferable if alignment gaps impede a significant number
of pairwise comparisons. The phylogenetic profile principle
is however amenable to any distance metric that provides sensible
distance values including multiple hit corrections and various
nucleotide or amino acid scoring matrices (PAMs etc.). For a
more detailed treatment of distance values see Weiller,
McClure and Gibbs (1995).
Inter-correlation measures
A number of standard coefficients
can be used to determine the phylogenetic correlation of distance
vectors. The Bray-Curtis distance, Canberra metric, chi-squared
distance, average Manhattan distance, and the linear correlation
coefficient as well as nonparametric correlations like the Spearman
Rank-Order Correlation were explored in a variety of simulated
and real sequence sets. As all these measures gave similar results,
the linear correlation coefficient (Pearson coefficient) was
used for all data presented here. Note that multiplication of
a distance vector with a constant will not change the correlation
value. It is therefore not necessary to normalise the sequence
differences by the window width even when the widths of the
upstream and downstream window differ. For a more detailed treatment
of correlation measures see Rohlf (1993) and William et al.
(1992).
Sequence windows
In general, the recombination
signal will be strongest, ie. the phylogenetic correlation will
be minimal, when the window used for determining one distance
vector contains only sequence from one ancestor, while the other
window contains only sequence from a different and phylogenetically
discordant ancestor. Multiple recombination sites within one
window will probably decrease the resolution of the method.
Hence, sequences with many recombination sites are best analysed
using appropriately narrow windows. Wider windows on the contrary
will be less discriminatory but will provide more sites for
estimating the interrelationships of the sequences, and therefore
enhance the signal to noise ratio in the resulting plot.
Two different techniques
are used to control ‘movement’ of the sequence windows, and
these are optimal for different types of sequence data. The
first method uses the entire sequence in two variable sized
windows with the left edge of the upstream window fixed at the
beginning of the aligned sequences, the right edge of the downstream
window fixed at the sequence end and a sliding split between
them. Thus for the first comparison, the upstream window covers
only the first site of the alignment, while the downstream window
covers all remaining sites. In successive steps the upstream
window grows by a single site, while the downstream window decreases
by one site until the downstream window covers the last site
only and the upstream window covers the remaining sites.
The second method
uses two windows with identical and fixed widths and consequently
cannot analyse sites that lie less than one window width from
either end of the sequence; the width of the two windows is
specified at the beginning of the scanning process. The appropriate
minimal width depends on the variability of the sequences analysed,
as the method requires sufficient variable sites inside each
window, to reliably determine the sequence relationships. To
allow for datasets that have an uneven distribution of variable
positions invariant sites are removed from the sequences before
analysis, as these differentially dilute the recombination signal.
Explorations with simulated
data
To demonstrate the properties of phylogenetic
profiles, and to explore their dependency on parameters and
sequence inclusion sets, several phylogenetic profiles have
been produced using simulated sequences as well as recombinants
of these constructed in-silico.
Single recombinant
sequences
Eight related sequences
(S1-8) and two recombinant sequences, one with a single crossover
(R1) and one with a central insertion (R2) were constructed.
Dendrograms showing the relationships of these data are given
in Figure 1. The 10 aligned sequences
were then condensed to the 733 variable positions only. In the
condensed sequences, the recombination junctions corresponded
to the positions 349 (R1) and 237/465 (R2) respectively. Several
phylogenetic profiles were derived from this dataset (Figure
2). A simple count of different nucleotides was used to
estimate sequence distances, and the linear correlation coefficient
of the distance vectors was calculated to determine their phylogenetic
correlation.
The profiles on the
top row (a) of Figure 2 include
the recombinant sequences R1 (left column) and R1/R2 (right
column) together with sequences S1-S8. The progenitor sequence
S1 and S6 of the recombinant sequences R1 an R2, were omitted
from the plots b-d. Nevertheless the recombinant sequences can
still be identified in these plots, even in series 2, where
the phylogenetic background signal is fairly weak as 2 of the
8 sequences are recombinant.
Series a and b use
the variable width window technique, which utilises the entire
length of the alignment for analysis, but it can be seen that
the estimate of phylogenetic correlations becomes ‘noisy’ at
either end, as one or other of the windows becomes too narrow.
Note that the phylogenetic correlation for sequence R1 (bold
in a1 and b1) is small over the entire sequence, as the R1 recombination
junction is always included in one of the sequence windows.
The value is smallest at the R1 junction, as at this point,
each window contains sequences exclusively from different parents.
Note that data shown in Table 1
are taken from the R1 junction in a1. The profile of recombinant
R2 (bold in a2 and b2) is decreased less, as the R2 sequence
contains some sites donated by S1 in both windows, irrespective
of the position of the window split. The recombination junctions
are nevertheless clearly visible, as the profile has its minima
at the two junction sites (237 and 465), where one window contains
sequence exclusively derived from S1, while the other window
contains sequence from both parents (S1 and S6). Note also the
large phylogenetic correlation of R2 around site 350, where
the R2 sequences in both windows contain a similar mixture of
sites of both parental sequences. This large value indicates
that R2 has an insertion and would not have been observed if
the 5’ and 3’ sequences of R2 had come from different parents.
The recombination
junction of R1 cannot be determined precisely from b2 alone,
as the second recombinant R2 distorts the phylogenetic correlation
of R1. This distortion is exceptionally pronounced because R2
was constructed from the same donor sequences as R1, and these
were excluded from the analysis represented by b2.
Series c and d use
fixed size windows of size 70 and 35 respectively. Note that
the smaller fixed size windows in c2, because they contain fewer
contradictory sites, are better suited to pinpoint the three
crossover sites than the maximal size windows in b2. The series
d graphs demonstrate that when the window width is too small,
the noise generated by sampling errors obscures the recombination
signals.
Recombination hot-spots
and multiple recombinants
The phylogenetic profile
method determines the phylogenetic correlation of a particular
sequence in different regions by comparing it with other reference
sequences. However, some of the reference sequences may themselves
be recombinants, indeed, when the analysis includes recombination
hot-spots, all or most sequences might be recombinant.
To demonstrate the
properties of phylogenetic profiles in these situations, the
artificial dataset was modified, generating two series (A and
B) of reciprocal recombinant sequences by combining every sequence
si with the sequence si+2
. This resulted in the recombinant sequences of type S1:3, S2:4,
S3:5, S4:6, S5:7, S6:8, S7:1 and S8:2. Series A contained these
16 reciprocally recombined sequences, that resembled R1 in the
example above, with a single site of recombination again at
site 500 (349 in the condensed variable sites sequences). Series
B was constructed in a manner analogous to that used to construct
R2, by exchanging the centre and flanking regions of si
with si+2. This also yielded 16 reciprocally
recombinant sequences with recombination junctions occurring
at sites 333 and 666 (237 and 465 in the condensed variable
sites sequences). Note that the si:si+2
scheme for producing recombinants resulted in the
sequences combining regions of different similarity. While the
sequences S1:3, S2:4, S5:7 and S6:8 are relatively closely related,
sequences S3:5, S4:6, S7:1 and S8:2 are more distantly related
and so stronger recombination signals can be expected from combinations
of the latter. Figure 3 gives
the phylogenetic profiles of the two series of recombinants,
as well as a combination of both. Note that the sites of recombination
can be clearly seen in all three graphs. Strong and weak signals
cannot be distinguished in graphs a) and b) of Figure
3 , because here all sequences have their recombination
junction at the same site and the algorithm has no means to
determine which sequence is closest to the unrecombined ‘wild-type’
sequence. However, when the sequence of both series, A and B,
are included the strength of the recombination signal is revealed
(graph c in Figure 3 ). This
is because the recombination sites of the two series differ,
therefore there are always some partial sequences in the parental
(non recombined) configuration at every site in the dataset,
allowing the algorithm to distinguish between strong and weak
recombination signals. Consequently, it can be seen that the
phylogenetic correlation of some sequences is particularly small
(< 0 in Figure 3 c ) and these
come from the recombinants formed from the most distantly related
sequences S3:5, S4:6, S7:1 and S8:2. However, the dataset still
does not contain sufficient information to distinguish clearly
the recombinants of closely related sequences from parental
sequences. In general, if a large proportion of the sequences
is recombinants, the phylogenetic profile method may not be
able to determine which of the sequences are the parents, the
identification of recombinational hot-spots however is not impaired.
Complex Phylogenies
Multiple recombinations during the
phylogenetic development of sequences can lead to very complex
phylogenetic relationships, resulting in the translocation of
entire subtrees. In addition, continuing evolutionary changes
overwrite the initial signal in the sequences. Hudson and Kaplan
(1985) have demonstrated that recombination events may leave
no evidence in the extant sequences. The simulation given in
Figure 4 was chosen to demonstrate
the behaviour of phylogenetic profiles in this more realistic
situation. A randomly generated sequence containing 25% of each
nucleotide was evolved over four generations with 2% nucleotide
changes per generation. At every but the last generation one
recombinant sequence was created (x, y and z in Figure
4) and added to the population. A phylogenetic profile of
the resulting 30 sequences is given in Figure
5. Only the parsimoniously informative sites are used for
distance calculation. Note the four descendants of the recombinant
sequence y and the two descendants of the recombinant sequence
z are clearly identified. The identification of the eight descendants
of the recombinant x is more difficult albeit still possible
and could be considered as close to the limit of the resolution
of this particular profile. The sequence window was chosen deliberately
wide (60 parsimonious sites) in order to collect sufficient
signal. When smaller windows (10 - 40 parsimonious sites) were
used, only the recombinants derived from sequences y and z could
be clearly identified (data not shown). This was to be expected
as the recombination x combines two sequences (sB and sA) with
only 4% sequence differences. Evolution of sequence x for 3
more generations added an additional 12 % changes to the sequences
which obscure the recombination signal. Note that although 12
of the 30 sequences are recombinant the dataset still provides
sufficient background signal to distinguish all recombinant
sequences. When a large proportion of the dataset is represented
by recombinant sequences the exclusion of sequences with regions
of low phylogenetic correlation can help to improve the detection
of sequences with modest recombination signals. This is shown
with the following example using a real dataset.
Example with real data
In order to test the phylogenetic
profile method with real gene sequences, a set of 34 gnd
gene sequences of Salmonella enterica was scanned for
recombinations. These sequences and the sites where recombinations
have probably occurred were previously reported by Thampapillai,
Lan and Reeves (1994). The sequences are 1329 bp long and correspond
to position 16 to 1344 of the gnd coding region. Only
the variable sites were used for the phylogenetic profiles and
these are given in Figure 6.
Individual sequences
As many of these sequences
are recombinants, it could be expected that strong recombination
signals would obscure the weaker ones. To avoid this, the analysis
was repeated many times, and after each analysis the sequence
that gave the smallest phylogenetic correlation (ie. the strongest
recombination signal) was removed. Once the 14 sequences with
the smallest phylogenetic correlation were removed, they were
reintroduced into the dataset individually. Some of the resulting
phylogenetic profiles are given in Figure
7. The top graph in Figure 7
shows a phylogenetic profile of all 34 sequences, whereas the
remaining profiles are each of 21 sequences, with the profile
of the reintroduced sequence highlighted. It can easily be seen
that the reintroduced sequences have one or more distinct minima
in their phylogenetic profiles, suggesting that recombinations
might have occurred at or close to these minima. The possible
recombination junctions, as deducted from the phylogenetic profiles
in Figure 7 are summarised in
Table 2. Changing the various
scanning parameters resulted in very similar plots (data not
shown) only varying the relative strength of the individual
recombination signals slightly, but not changing the conclusions
that can be drawn from the analysis.
As the purpose of
these tests was to examine the capability of the phylogenetic
profile method rather than give a comprehensive analysis of
the evolution of the gnd locus, further interpretation
of these profiles is left to a later publication. A comprehensive
analysis of the evolution of the gnd locus of S. enterica
has been made by Thampapillai, Lan and Reeves (1994), who
searched for evidence of recombination in the sequences of the
strains m318, m298, m130, m38, m322 and m321. All of the recombinations
reported by the authors are detected by the phylogenetic profile
method and the positions of the recombination junctions agree
in the two studies. These sequences are shown in the graphs
in the left column of Figure 7,
whereas the right column features sequences with similarly strong
recombination signals that have previously not been recognised.
The additional sites detected illustrate the strength and sensitivity
of the phylogenetic profile method.
Recombination hotspots
Many of the gnd
sequences appear to exhibit particularly small phylogenetic
correlations in their central parts. Nine of the 22 local minima
(marked with * in Table 2 ) overlap
the variable sites 143-153, indicating a particularly large
number of recombinations in this region. As previously reported
by Thampapillai, Lan and Reeves (1994), a variant of the general
recombination stimulating sequence, chi, is located in this
region at position 744-751 (variable site 145-150) of many strains
in this analysis. All of the nine recombinant sequences mentioned
above have the canonical sequence ( 5’-CCTGGTGG-3’) which
is a single bp variation of the E. coli chi motif ( 5’-GCTGGTGG-3’).
This sequence could possibly be regarded as the S. enterica
equivalent to E. coli chi.
In order to examine
the influence of this sequence motif on the phylogenetic profile
of the sequences, a profile was created exclusively of sequences
which had other variants of chi in this location. As can be
seen in Figure 8 a), none of
them has a local minimum at this site in its phylogenetic profile.
In contrast, all sequences with local phylogenetic correlation
minima near the chi site also have the canonical (CCTGGTGG)
chi motif (Figure 8 b). This
chi motif is also found in sequences w2D1, m38, m130, lind and
316, which were not included in Figure
8 b) as their phylogenetic profiles are more difficult to
interpret when analysed in the context of the sequences in Figure
8 b). However, as shown above (Figure
7 and Table 2), recombinations
at or close to the chi-like motif are also likely for these
five sequences.
Implementation
All the phylogenetic profiles were
generated using the PhylPro
program. PhylPro is a Microsoft Windows application developed
in C++ by the author. A prototype of the program is available
from the author free of charge. The finalised version of the
program will be released to the public domain later in 1997.
Acknowledgments
I thank Prof. Peter Reeves for communicating
the S. enterica sequences prior to publication, Prof.
Adrian Gibbs for helpful discussions during the development
of the phylogenetic profile method and Holger Averdunk for his
help in exploring the parameter space of the method.
Literature
cited
Assali, N. E., R.
Mache, and S. L. de Goer. 1990. Evidence for a composite phylogenetic
origin of the plastid genome of the brown alga Pylaiella
littoralis (L.) Kjellm. Plant. Mol. Biol. 15:307-315.
Bik, E. M., A. E.
Bunschoten, R. D. Gouw, and F. R. Mooi. 1995. Genesis of the
novel epidemic Vibrio cholerae O139 strain: evidence
for horizontal transfer of genes involved in polysaccharide
synthesis. EMBO J. 14:209-216.
Doolittle, R. F.,
D. F. Feng, K. L. Anderson, and M. R. Alberro. 1990. A naturally
occurring horizontal gene transfer from eukaryote to prokaryote.
J. Mol. Evol. 31:383-388.
DuBose, R., D. Dykhuizen,
and D. Hartl. 1988. Genetic exchange among natural isolates
of bacteria: recombination within the phoA locus of Escherichia
coli. Proc. Natl. Acad. Sci. USA 85:7036-7040.
Felsenstein, J. 1991.
PHYLIP: phylogeny inference package. Version 3.4. University
of Washington, Seattle.
Fitch, D. H. A., and
M. Goodman. 1991. Phylogenetic scanning: a computer-assisted
algorithm for mapping gene conversions and other recombinational
events. CABIOS 7:207-215.
Gibbs, A. J., and
P. K. Keese. 1995. In search of origins of viral genes. Pp.
76-91 in A. J. Gibbs, C. H. Calisher and F. Garcia-Arenal,
eds. Molecular Basis of Virus Evolution. Cambridge University
Press, Cambridge, UK.
Gorbalenya, A. E.
1995. Origin of RNA viral genomes: approaching the problem by
comparative sequence analysis. Pp. 49-67 in A. J. Gibbs,
C. H. Calisher and F. Garcia-Arenal, eds. Molecular Basis of
Virus Evolution. Cambridge University Press, Cambridge, UK.
Hahn, C. S., S. Lustig,
S. Strauss, and E. G. Strauss. 1988. Western equine encephalitis
virus is a recombinant virus. Proc. Natl. Acad. Sci. USA 85:5997-6001
Hein, J. 1993. A heuristic
method to reconstruct the history of sequences subject to recombination.
J. Mol. Evol. 36:396-405.
Hudson, R. R., and
N. L. Kaplan. 1985. Statistical properties of the number of
recombination events in the history of a sample of DNA sequences.
Genetics 111:147-164.
Jakobsen, I. B., and
S. Easteal. 1996. A program for calculating and displaying compatibility
matrices as an aid in determining reticulate evolution in molecular
sequences. CABIOS 12:291-295
Jakobsen I. B., S.
R. Wilson, and S. Easteal. 1997. The partition matrix: Exploring
variable phylogenetic signals along nucleotide sequence alignments.
Mol. Biol. Evol. 14:474-484.
Li. J., K. Nelson,
A. C. McWhorter, T. S. Whittam, and R. K. Selander. 1994. Recombinational
basis of serovar diversity in Salmonella enterica. Proc.
Natl. Acad. Sci. USA 91:2252-2256.
Nuttall, P. A., M.
A. Morse, L. D. Jones, and A. Portela. 1995. Adaptation of members
of the Orthomyxoviridae family to transmission by ticks. Pp
416-26 in A. J. Gibbs, C. H. Calisher and F. Garcia-Arenal,
eds. Molecular Basis of Virus Evolution. Cambridge University
Press, Cambridge. UK.
Paquin, B., M. J.
Laforest, and B. F. Lang. 1994. Interspecific transfer of mitochondrial
genes in fungi and creation of a homologous hybrid gene. Proc.
Natl. Acad. Sci. USA 91:11807-11810.
Reeves, P. R. 1993.
Evolution of Salmonella O antigen variation by interspecific
gene transfer on a large scale. Trends Genet. 9:17-22.
Robertson, D. L.,
P. M. Sharp, F. E. McCutchan, and B. H. Hahn. 1995. Recombination
in HIV1. Nature 374:124-126.
Rohlf, F. J. 1993.
NTSYS-pc: Numerical taxonomy and multivariate analysis system,
Applied Biostatistics Inc., New York 11733, ISBN:0-925031-22-4
Salimen, M. O., J.
K. Carr, D. S. Burke, and F. E. McCutchan. 1995. Identification
of breakpoints in intergenotypic recombinants of HIV Type 1
by bootscanning. Aids Res. and Hum. Retroviruses 11:1423-425
Sandmeier, H. 1994.
Acquisition and rearrangement of sequence motifs in the evolution
of bacteriophage tail fibres. Mol. Microbiol. 12:343-350
Sawyer, S. A. 1989.
Statistical tests for detecting gene conversion. Mol. Biol.
Evol. 6:526-538.
Schoeniger, A., and
A. Haeseler. 1995. Simulating efficiently the evolution of DNA
sequences. CABIOS 11:111-115.
Stephens, J. C. 1985.
Statistical method of DNA sequence analysis: detection of intragenic
recombination or gene conversion. Mol. Biol. Evol. 2:539-556.
Syvanen, M. 1994.
Horizontal gene transfer: evidence and possible consequences.
Ann. Rev. Genet. 28:237-261.
Thampapillai, G.,
R. Lan, and P. R. Reeves. 1994. Molecular evolution in the gnd
locus of Salmonella enterica. Mol. Biol. Evol.
11:813-828.
Weiller, G. F., C.
M. E. Schueller, and R. J. Schweyen. 1989. Putative target sites
for mobile G+C rich clusters in yeast mitochondrial DNA: Single
elements and tandem arrays. Mol. Gen. Genet. 218:272-283.
Weiller, G. F., M. A. McClure, and
A. J. Gibbs. 1995. Molecular
phylogenetic analysis. Pp. 553-85 in A. J. Gibbs,
C. H. Calisher and F. Garcia-Arenal, eds. Molecular Basis of
Virus Evolution. Cambridge University Press, Cambridge, UK.
William, H. P., S.
A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical
Recipes in C. Cambridge University Press, Cambridge, UK.
Whatmore, A. M., and
M. A. Kehoe. 1994. Horizontal gene transfer in the evolution
of group A streptococcal emm-like genes: gene mosaics and variation
in Vir regulons. Mol. Microbiol. 11:363-374.
Table
1
Computation of the phylogenetic
correlations at position 500 of the demonstration dataset
|
R1a |
S1 |
S2 |
S3 |
S4 |
S5 |
S6 |
S7 |
S8 |
| distances
in upstream window (positions 1 - 500) b |
|
|
|
| R1 |
0 |
0 |
78 |
148 |
147 |
170 |
178 |
182 |
163 |
| S1 |
0 |
0 |
78 |
148 |
147 |
170 |
178 |
182 |
163 |
| S2 |
78 |
78 |
0 |
125 |
130 |
161 |
164 |
183 |
160 |
| S3 |
148 |
148 |
125 |
0 |
79 |
179 |
177 |
187 |
173 |
| S4 |
147 |
147 |
130 |
79 |
0 |
178 |
177 |
185 |
168 |
| S5 |
170 |
170 |
161 |
179 |
178 |
0 |
83 |
148 |
129 |
| S6 |
178 |
178 |
164 |
177 |
177 |
83 |
0 |
141 |
125 |
| S7 |
182 |
182 |
183 |
187 |
185 |
148 |
141 |
0 |
91 |
| S8 |
163 |
163 |
160 |
173 |
168 |
129 |
125 |
91 |
0 |
| distances
in downstream window (positions 501-1000) |
|
|
|
| R1 |
0 |
193 |
200 |
199 |
198 |
84 |
0 |
145 |
139 |
| S1 |
193 |
0 |
85 |
143 |
153 |
192 |
193 |
203 |
187 |
| S2 |
200 |
85 |
0 |
144 |
149 |
199 |
200 |
199 |
190 |
| S3 |
199 |
143 |
144 |
0 |
91 |
195 |
199 |
186 |
181 |
| S4 |
198 |
153 |
149 |
91 |
0 |
201 |
198 |
185 |
190 |
| S5 |
84 |
192 |
199 |
195 |
201 |
0 |
84 |
148 |
148 |
| S6 |
0 |
193 |
200 |
199 |
198 |
84 |
0 |
145 |
139 |
| S7 |
145 |
203 |
199 |
186 |
185 |
148 |
145 |
0 |
82 |
| S8 |
139 |
187 |
190 |
181 |
190 |
148 |
139 |
82 |
0 |
| phylogenetic
correlations at position 500 c |
|
|
|
0.01 |
0.63 |
0.85 |
0.97 |
0.98 |
0.86 |
0.62 |
0.97 |
0.96 |
a The sequences
S1-8 and the recombinant sequence R1 are described in Figure
1.
b Distances were computed
as the number of differences (Hamming distance) within the specified
region of the multiple sequence alignment.
c The phylogenetic
correlation for each sequence is calculated as the linear correlation
coefficient of the upstream and downstream distance vectors
(columns in the matrices above).
Table
2
Local minima in the phylogenetic
profiles of Figure 7 (possible
recombination sites)
| Strain |
gene position |
variable position |
| m318 |
735-774 * + |
152-165 |
| m298 |
687-762 * + |
138-152 |
| |
1017-1066 |
224-238 |
| 130 |
705-738 * |
141-144 |
| |
414-477 (*) + |
73-83 |
| m38 |
705-738 * |
141-144 |
| |
414-477 (*) + |
73-83 |
| m322 |
889-894 + |
172-175 |
| m321 |
930-978 + |
187-205 |
| w2Di |
494-546 |
87-103 |
| |
735-774 * |
153-165 |
| west |
705-744 * |
141-145 |
| |
930-1018 |
185-225 |
| m317 |
990-1067 |
210-230 |
| |
648-738 * |
131-144 |
| w3038 |
507-741 * |
90-145 |
| |
989-1024 |
210-228 |
| lind |
555-741 * |
105-145 |
| |
930-989 |
185-210 |
| m316 |
1018-1074 |
225-243 |
* overlapping with
or close to the chi-like sequence at pos. 744-751 (145-150)
(*) overlapping
with or close to a chi-like sequence at pos. 424-431 (74-76)
+ independently
identified by Thampapillai, Lan and Reeves (1994)
Figures
Figure 1
Neighbor joining trees of the demonstration
dataset
Eight artificial 1000
nucleotide long sequences (S1-S8) were generated using EVOL-TREE
(Schoeniger and Haeseler, 1995) with a 10% nucleotide substitution
between every generation and 25% of each type of nucleotide.
The recombinant sequence (R1) was constructed by combining the
500 bp 5’-fragment of sequence S1 with a 500 bp 3’-fragment
of sequence S6, and the recombinant sequence (R2) was constructed
by replacing the region 333 - 666 of S1 with the corresponding
fragment of S6.
The dendrograms were
produced using the neighbor joining program of the Phylip package
(Felsenstein, 1991). Part a) of the figure shows the relationships
of simulated sequences S1 to S8. Part b) also includes the recombinant
sequences R1 and R2.
Figure 2
Phylogenetic profiles of sequences
S1-8 and R1-2
Series 1 (left column)
contains only one recombinant sequence R1 (bold line) with a
single cross over at site 349. Series 2 (right column) additionally
contains R2 (bold line) with a double cross over on site 237
and 465. The parental sequences S1 and S6 of both recombinants
are omitted from rows b-d. Rows a and b make use of variable
window widths while rows c and d utilise fixed size windows
with a width of 70 and 35 positions respectively (see text).
Figure 3
Phylogenetic profiles of chimaeric
sequences created from the demonstration dataset
All profiles use Hamming
distances and fixed windows of 50 bp width. All sequences in
the profiles are recombinant. Plot a contains 16 sequences each
with a crossover at position 349 (Series A). Plot b contains
16 sequences each with a double crossover at positions 237 and
465 (Series B). Plot c combines all 32 sequences used in a and
b (see text).
Figure 4
Simulation of a complex phylogeny
A 1000 bp long random
progenitor sequence (s) containing 25% A, C, T, and G was assembled.
Descendent sequences were created by introducing 20 random nucleotide
changes to their respective progenitors. The recombinant sequence
x was built by combining the first 500 and the last 500 nucleotides
of the sequences sB and sA respectively. The remaining recombinants
y and z were created similarly by crossing sequences sAB with
sBA on position 300 and sABB with sBAA on position 700. The
recombinants were evolved further yielding a total of 30 sequences.
Figure 5
Phylogenetic profile of a simulated
complex phylogeny
The source of the
30 sequences is described in Figure
4 . The profile utilises only the 304 parsimonious sites.
For convenience, the graph is mapped to display all positions
whereby the x-axis gives the range from position 100 to 900.
The y axis gives the phylogenetic correlation from +1 to -0.5.
The boxes labelled x, y, and z mark the sequences derived from
the recombinant x, y and z respectively (see text and Figure
4).
Figure 6
gnd gene sequence of 34
Salmonella enterica strains
Only the variable
sites and their locations within the gene are given. The black
bar indicates a chi-like region.
Figure 7
Phylogenetic profiles of the gnd
locus in Salmonella enterica strains
All graphs used a
fixed window size of 60 bp and only the variable sites, as given
in Figure 6. The proportion of
different nucleotides was used to determine the relationships
of the sequences and the linear correlation coefficient was
calculated to determine phylogenetic correlations. The top graph
shows the profiles of all 34 sequences as given in Figure
6. All but one of the sequences of the strains that gave
strong recombination signals (m318, m298, m130, m38, m322, m321,
w2DI, west, m317, w3038, lind, m316, m325, m287) were removed
from the remaining plots. The profile of the sequence with the
strongest recombination signal is plotted in bold, and the name
of the corresponding strain is given (see text).
Figure 8
Phylogenetic profiles of sequences
with and without chi-like motif
The canonical chi-like
motif (5’ 744-CCTGGTGG-751 3’) is marked (variable sites 145-150)
in both graphs. The graph in a) includes all sequences (s71,
m229, m311, sofia, m261, m287, m319, m35, m320, m313, m314,
m321, m322, m326) that have different variations of this chi-like
sequence, while all sequences in b) (lt2, s41, m46, m36, m73,
m13, m55, m298, m295, west, m317, w3038, m318, m324, m325) have
the canonical chi-like sequence. The thick line gives the averages
of the phylogenetic correlations of all sequences that are included
in each graph.