ANU Home | HORUS | Staff Home | Students | RSBS
The Australian National University
Research School of Biological Sciences
    
Site Search
     
Advanced
Printer Friendly Version of this Document

 

Molecular Biology and Evolution 15(3): 326-335. 1998.

Phylogenetic Profiles: A graphical method for detecting genetic recombinations in homologous sequences

Georg F. Weiller

Bioinformatics Laboratory, Research School of Biological Sciences, Australian National University

Canberra, ACT 0200, AUSTRALIA, ph: +61 (2) 6249-5916, fx: +61 (2) 6249-4437

Keywords: genetic recombination, computer algorithm, Salmonella enterica, molecular phylogeny, evolution

 

Abstract

Phylogenetic profiles constitute a novel way of displaying graphically the coherence of the sequence relationships over the entire length of a set of aligned homologous sequences. Using a sliding window technique, the method determines the pairwise distances of all sequences in the windows and evaluates, for each sequence, the degree to which the patterns of distances in these regions agree. The method is suited for exploring data consistency as well as detecting recombinant sequences. A computer program implementing the algorithm has been developed and examples with simulated and natural sequences are given to demonstrate the sensitivity and accuracy of the method for identifying recombinant sequences and their recombination junctions as well as detecting hot-spots of recombinational activity.

 

Introduction

Gene conversion, and other recombinational events like transposition, transduction and intron-homing are important processes that influence biological evolution. They also complicate the work of the molecular phylogenetist, as genomes rearranged in such ways become a mosaic of regions with different phylogenetic histories. In viruses, horizontal gene transfer over a wide range of phylogenetic distances has been a major evolutionary force (Gorbalenya 1995, Gibbs and Keese 1995, Nuttall et al. 1995), and similar processes have been suggested to occur in procaryotes (Whatmore and Kehoe 1994, Bik et al. 1995) eucaryotes (Assali, Mache and de Goer 1990) and even between kingdoms (Doolittle et al. 1990), hence phylogenetic relationships deduced from gene sequences only represent the evolutionary history of those genes unless recombination can be excluded. However, genetic recombination is not limited to exchanges involving whole genes and there is strong evidence for intragenic recombination in viruses (Hahn et al. 1988, Sandmeier 1994), bacteria (DuBose, Dykhuizen and Hartl 1988, Reeves 1993, Li et al. 1994, Thampapillai, Lan and Reeves 1994) and eucaryotes (Stephens 1985, Weiller, Schueller and Schweyen 1989, Paquin, Laforest and Lang 1994). The recognition of extra- and intragenic recombination is not only important for unravelling the phylogenetic history of genes, it is also crucial for molecular phylogenetic inference, as trees derived from different genes or gene regions may differ in topology, and taxa with mosaic sequences will be placed incorrectly. The analytical challenges posed by horizontal gene-transfer are reviewed by Syvanen (1994).

Various methods for detecting gene conversion and recombination in homologous DNA sequences have been described. In Stephen’s (1985) method, a set of aligned sequences is split into two subsets, at every variable position, and the distribution of all variable sites that support particular splits is examined. Significant deviations from an uniform distribution are used as an indication that some of the sequences are recombinants. However, in samples of more than a few sequences the appropriate splits are hard to find and sites with more than two alternative nucleotides present a problem for Stephen’s method although the statistical difficulties created by regions with variable mutation rates are lessened by simply excluding invariant sites.

Sawyer’s (1989) method reduces the problem of variable mutation rates by focussing on silent polymorphic sites. His method also overcomes the partitioning problem of Stephen’s method by analysing the distribution of maximal length segments common to some pairs of sequences. A Monte Carlo test involving permutation of sites is used to estimate the significance of the distribution. Some of the imperfections of this method include the drastically reduced amount of useable data when only silent polymorphic sites are used and the validity of the significance test.

In Fitch and Goodman’s (1991) ‘ Phylogenetic-Scanning’ method, sets of phylogenetic trees are constructed at different intervals in the sequence alignment. The support for some of these trees is then evaluated at all intervals using the parsimony principle and presented graphically. The main computational complication of this method arises with the very large number of possible trees when more than a few sequences are analysed. Only a tiny subset of all possible trees can be analysed, and it is not always clear which trees to choose. While the graphs can be very informative, the requirement that each tree is represented as a column in the graph further reduces the number of alternatives that can be tested.

Hein (1993) has developed a method that employs a heuristic extension of the parsimony principle to infer phylogenies from recombinant sequences. The method assumes that a correct tree can be found for some sequence regions and tries to reconstruct the recombinational steps required to explain the tree topology found in other regions. Similar to Fitch and Goodman’s method, this method is only applicable for a comparatively small number of sequences and cannot detect recombinations that do not change the topology of a tree.

Recently, methods have been developed specifically for the analysis of HIV sequences (Robertson et al. 1995, Salimen et al. 1995) whereby sliding windows are used to compare the relationships of aligned sequences with previously determined HIV prototype sequences. The success of these methods depends largely on the availability of suitable prototype sequences as only recombinations that switch the prototype can be detected.

Two newly developed and related computer methods make use of compatibility matrices (Jakobsen & Easteal 1996) and partition matrices (Jakobsen, Wilson and Easteal 1997) to graphically display the consistency of the phylogenetic signal in all columns of a multiple sequence alignment. These methods make fewer assumptions and do not require the prior knowledge (or even existence) of a single phylogeny. They are therefore especially helpful for exploratory analysis of a limited number of sequences.

The ‘phylogenetic profile’ method, described below, is a new computer graphic method that overcomes some of the limitations of the currently available methods. Similar to other methods mentioned above, it is based on the principle that phylogenetic relationships derived from different regions of a multiple sequence alignment will be similar when no recombination has occurred. Thus the method attempts to establish consistency in sequence relationships between different parts of the alignment. Rather than tree topologies or compatibility matrices, the method uses distance data to describe the relationships and thus avoids many of the difficulties posed by constructing and comparing tree topologies. The distance approach makes it possible to detect recombinations that do not change the tree topology, is very fast to compute and thus allows analysis of a large dataset with more than a thousand sequences. The estimate that a recombinational event has occurred is then plotted for every position of every sequence and the entire information is displayed in a single diagram.

The Phylogenetic Profile Algorithm

The method introduces the ‘phylogenetic correlation’ measure that quantifies the coherence of the sequence interrelationships in two different regions of a multiple alignment. The phylogenetic correlation of any given position is determined by evaluating regions immediately upstream and downstream of this position. Positions in which sequence relationships in the upstream region clearly differ from their downstream counterparts exhibit low phylogenetic correlations and are likely recombination sites.

To determine the phylogenetic correlation of a given test sequence at a given test location the method defines two sequence windows, located immediately before (upstream) and after (downstream) the test location and determines the differences between the test sequence and all other sequences in the windows, resulting in two vectors of distance data. If the test sequence relates to the other sequences similarly in both windows then the two distance vectors will exhibit the same trend and correlate well. Conversely, if the test sequence has recombined so that the sequence fragments in both windows have different phylogenetic histories then the two sets of sequence relationships would correlate poorly. Accordingly, the phylogenetic correlation has been defined as the correlation coefficient of the two distance vectors.

Table 1 demonstrates the computation of the phylogenetic correlation for nine of the sequences described in the legend to Figure 1. As the recombinant sequence R1 matches sequences S1 upstream and S6 downstream of the recombination site at position 500, the R1 distance vectors are identical to the S1 and S6 distance vectors in the respective regions. The phylogenetic correlations of all nine sequences at the R1 recombination site is given in the bottom row of Table 1. Note the poor phylogenetic correlation of sequence R1. The phylogenetic correlation for the sequences S1 and S6 are slightly lower than the values for the other sequences but clearly higher than the R1 values. These values reflect the degree to which the relationships of the test sequence differ in the two windows. While the upstream and downstream distance vectors of the recombinant R1 vary greatly, the vector pairs of the other sequences (S1-8) are closely related varying mainly in their R1 component and this variation is especially pronounced for S1 and S6 which were used to construct R1.

For each individual sequence in the alignment, the phylogenetic correlations are computed at every position using sliding window techniques. If a recombination site is not exactly located between the two windows but inside one them, then a part of the test sequence in the two windows will still have similar relationships resulting in an intermediate phylogenetic correlation. Consequently, when the window moves over a recombination junction, the phylogenetic correlation decreases; it is smallest when the recombination junctions is exactly at the junction of the two windows. The plot of all phylogenetic correlations of a sequence against the sequence positions is termed a ‘phylogenetic profile’, and the profiles of all individual sequences are typically superimposed in a single diagram. By examining and comparing the phylogenetic profiles for all sequences, the recombinant sequences and the location of recombination junctions are easily detected.

Phylogenetic profiles can exploit a variety of different measures for estimating the sequence distances as well as for determining the phylogenetic correlation of distance vectors. In addition, two different sliding window techniques can be used. These parameters are briefly discussed below.

Distance estimates

Although a simple count of different nucleotides or amino acids (Hamming distance) gives an adequate measure of distance, the fraction of differences (p-distance) is preferable if alignment gaps impede a significant number of pairwise comparisons. The phylogenetic profile principle is however amenable to any distance metric that provides sensible distance values including multiple hit corrections and various nucleotide or amino acid scoring matrices (PAMs etc.). For a more detailed treatment of distance values see Weiller, McClure and Gibbs (1995).

Inter-correlation measures

A number of standard coefficients can be used to determine the phylogenetic correlation of distance vectors. The Bray-Curtis distance, Canberra metric, chi-squared distance, average Manhattan distance, and the linear correlation coefficient as well as nonparametric correlations like the Spearman Rank-Order Correlation were explored in a variety of simulated and real sequence sets. As all these measures gave similar results, the linear correlation coefficient (Pearson coefficient) was used for all data presented here. Note that multiplication of a distance vector with a constant will not change the correlation value. It is therefore not necessary to normalise the sequence differences by the window width even when the widths of the upstream and downstream window differ. For a more detailed treatment of correlation measures see Rohlf (1993) and William et al. (1992).

Sequence windows

In general, the recombination signal will be strongest, ie. the phylogenetic correlation will be minimal, when the window used for determining one distance vector contains only sequence from one ancestor, while the other window contains only sequence from a different and phylogenetically discordant ancestor. Multiple recombination sites within one window will probably decrease the resolution of the method. Hence, sequences with many recombination sites are best analysed using appropriately narrow windows. Wider windows on the contrary will be less discriminatory but will provide more sites for estimating the interrelationships of the sequences, and therefore enhance the signal to noise ratio in the resulting plot.

Two different techniques are used to control ‘movement’ of the sequence windows, and these are optimal for different types of sequence data. The first method uses the entire sequence in two variable sized windows with the left edge of the upstream window fixed at the beginning of the aligned sequences, the right edge of the downstream window fixed at the sequence end and a sliding split between them. Thus for the first comparison, the upstream window covers only the first site of the alignment, while the downstream window covers all remaining sites. In successive steps the upstream window grows by a single site, while the downstream window decreases by one site until the downstream window covers the last site only and the upstream window covers the remaining sites.

The second method uses two windows with identical and fixed widths and consequently cannot analyse sites that lie less than one window width from either end of the sequence; the width of the two windows is specified at the beginning of the scanning process. The appropriate minimal width depends on the variability of the sequences analysed, as the method requires sufficient variable sites inside each window, to reliably determine the sequence relationships. To allow for datasets that have an uneven distribution of variable positions invariant sites are removed from the sequences before analysis, as these differentially dilute the recombination signal.

Explorations with simulated data

To demonstrate the properties of phylogenetic profiles, and to explore their dependency on parameters and sequence inclusion sets, several phylogenetic profiles have been produced using simulated sequences as well as recombinants of these constructed in-silico.

Single recombinant sequences

Eight related sequences (S1-8) and two recombinant sequences, one with a single crossover (R1) and one with a central insertion (R2) were constructed. Dendrograms showing the relationships of these data are given in Figure 1. The 10 aligned sequences were then condensed to the 733 variable positions only. In the condensed sequences, the recombination junctions corresponded to the positions 349 (R1) and 237/465 (R2) respectively. Several phylogenetic profiles were derived from this dataset (Figure 2). A simple count of different nucleotides was used to estimate sequence distances, and the linear correlation coefficient of the distance vectors was calculated to determine their phylogenetic correlation.

The profiles on the top row (a) of Figure 2 include the recombinant sequences R1 (left column) and R1/R2 (right column) together with sequences S1-S8. The progenitor sequence S1 and S6 of the recombinant sequences R1 an R2, were omitted from the plots b-d. Nevertheless the recombinant sequences can still be identified in these plots, even in series 2, where the phylogenetic background signal is fairly weak as 2 of the 8 sequences are recombinant.

Series a and b use the variable width window technique, which utilises the entire length of the alignment for analysis, but it can be seen that the estimate of phylogenetic correlations becomes ‘noisy’ at either end, as one or other of the windows becomes too narrow. Note that the phylogenetic correlation for sequence R1 (bold in a1 and b1) is small over the entire sequence, as the R1 recombination junction is always included in one of the sequence windows. The value is smallest at the R1 junction, as at this point, each window contains sequences exclusively from different parents. Note that data shown in Table 1 are taken from the R1 junction in a1. The profile of recombinant R2 (bold in a2 and b2) is decreased less, as the R2 sequence contains some sites donated by S1 in both windows, irrespective of the position of the window split. The recombination junctions are nevertheless clearly visible, as the profile has its minima at the two junction sites (237 and 465), where one window contains sequence exclusively derived from S1, while the other window contains sequence from both parents (S1 and S6). Note also the large phylogenetic correlation of R2 around site 350, where the R2 sequences in both windows contain a similar mixture of sites of both parental sequences. This large value indicates that R2 has an insertion and would not have been observed if the 5’ and 3’ sequences of R2 had come from different parents.

The recombination junction of R1 cannot be determined precisely from b2 alone, as the second recombinant R2 distorts the phylogenetic correlation of R1. This distortion is exceptionally pronounced because R2 was constructed from the same donor sequences as R1, and these were excluded from the analysis represented by b2.

Series c and d use fixed size windows of size 70 and 35 respectively. Note that the smaller fixed size windows in c2, because they contain fewer contradictory sites, are better suited to pinpoint the three crossover sites than the maximal size windows in b2. The series d graphs demonstrate that when the window width is too small, the noise generated by sampling errors obscures the recombination signals.

Recombination hot-spots and multiple recombinants

The phylogenetic profile method determines the phylogenetic correlation of a particular sequence in different regions by comparing it with other reference sequences. However, some of the reference sequences may themselves be recombinants, indeed, when the analysis includes recombination hot-spots, all or most sequences might be recombinant.

To demonstrate the properties of phylogenetic profiles in these situations, the artificial dataset was modified, generating two series (A and B) of reciprocal recombinant sequences by combining every sequence si with the sequence si+2 . This resulted in the recombinant sequences of type S1:3, S2:4, S3:5, S4:6, S5:7, S6:8, S7:1 and S8:2. Series A contained these 16 reciprocally recombined sequences, that resembled R1 in the example above, with a single site of recombination again at site 500 (349 in the condensed variable sites sequences). Series B was constructed in a manner analogous to that used to construct R2, by exchanging the centre and flanking regions of si with si+2. This also yielded 16 reciprocally recombinant sequences with recombination junctions occurring at sites 333 and 666 (237 and 465 in the condensed variable sites sequences). Note that the si:si+2 scheme for producing recombinants resulted in the sequences combining regions of different similarity. While the sequences S1:3, S2:4, S5:7 and S6:8 are relatively closely related, sequences S3:5, S4:6, S7:1 and S8:2 are more distantly related and so stronger recombination signals can be expected from combinations of the latter. Figure 3 gives the phylogenetic profiles of the two series of recombinants, as well as a combination of both. Note that the sites of recombination can be clearly seen in all three graphs. Strong and weak signals cannot be distinguished in graphs a) and b) of Figure 3 , because here all sequences have their recombination junction at the same site and the algorithm has no means to determine which sequence is closest to the unrecombined ‘wild-type’ sequence. However, when the sequence of both series, A and B, are included the strength of the recombination signal is revealed (graph c in Figure 3 ). This is because the recombination sites of the two series differ, therefore there are always some partial sequences in the parental (non recombined) configuration at every site in the dataset, allowing the algorithm to distinguish between strong and weak recombination signals. Consequently, it can be seen that the phylogenetic correlation of some sequences is particularly small (< 0 in Figure 3 c ) and these come from the recombinants formed from the most distantly related sequences S3:5, S4:6, S7:1 and S8:2. However, the dataset still does not contain sufficient information to distinguish clearly the recombinants of closely related sequences from parental sequences. In general, if a large proportion of the sequences is recombinants, the phylogenetic profile method may not be able to determine which of the sequences are the parents, the identification of recombinational hot-spots however is not impaired.

Complex Phylogenies

Multiple recombinations during the phylogenetic development of sequences can lead to very complex phylogenetic relationships, resulting in the translocation of entire subtrees. In addition, continuing evolutionary changes overwrite the initial signal in the sequences. Hudson and Kaplan (1985) have demonstrated that recombination events may leave no evidence in the extant sequences. The simulation given in Figure 4 was chosen to demonstrate the behaviour of phylogenetic profiles in this more realistic situation. A randomly generated sequence containing 25% of each nucleotide was evolved over four generations with 2% nucleotide changes per generation. At every but the last generation one recombinant sequence was created (x, y and z in Figure 4) and added to the population. A phylogenetic profile of the resulting 30 sequences is given in Figure 5. Only the parsimoniously informative sites are used for distance calculation. Note the four descendants of the recombinant sequence y and the two descendants of the recombinant sequence z are clearly identified. The identification of the eight descendants of the recombinant x is more difficult albeit still possible and could be considered as close to the limit of the resolution of this particular profile. The sequence window was chosen deliberately wide (60 parsimonious sites) in order to collect sufficient signal. When smaller windows (10 - 40 parsimonious sites) were used, only the recombinants derived from sequences y and z could be clearly identified (data not shown). This was to be expected as the recombination x combines two sequences (sB and sA) with only 4% sequence differences. Evolution of sequence x for 3 more generations added an additional 12 % changes to the sequences which obscure the recombination signal. Note that although 12 of the 30 sequences are recombinant the dataset still provides sufficient background signal to distinguish all recombinant sequences. When a large proportion of the dataset is represented by recombinant sequences the exclusion of sequences with regions of low phylogenetic correlation can help to improve the detection of sequences with modest recombination signals. This is shown with the following example using a real dataset.

Example with real data

In order to test the phylogenetic profile method with real gene sequences, a set of 34 gnd gene sequences of Salmonella enterica was scanned for recombinations. These sequences and the sites where recombinations have probably occurred were previously reported by Thampapillai, Lan and Reeves (1994). The sequences are 1329 bp long and correspond to position 16 to 1344 of the gnd coding region. Only the variable sites were used for the phylogenetic profiles and these are given in Figure 6.

Individual sequences

As many of these sequences are recombinants, it could be expected that strong recombination signals would obscure the weaker ones. To avoid this, the analysis was repeated many times, and after each analysis the sequence that gave the smallest phylogenetic correlation (ie. the strongest recombination signal) was removed. Once the 14 sequences with the smallest phylogenetic correlation were removed, they were reintroduced into the dataset individually. Some of the resulting phylogenetic profiles are given in Figure 7. The top graph in Figure 7 shows a phylogenetic profile of all 34 sequences, whereas the remaining profiles are each of 21 sequences, with the profile of the reintroduced sequence highlighted. It can easily be seen that the reintroduced sequences have one or more distinct minima in their phylogenetic profiles, suggesting that recombinations might have occurred at or close to these minima. The possible recombination junctions, as deducted from the phylogenetic profiles in Figure 7 are summarised in Table 2. Changing the various scanning parameters resulted in very similar plots (data not shown) only varying the relative strength of the individual recombination signals slightly, but not changing the conclusions that can be drawn from the analysis.

As the purpose of these tests was to examine the capability of the phylogenetic profile method rather than give a comprehensive analysis of the evolution of the gnd locus, further interpretation of these profiles is left to a later publication. A comprehensive analysis of the evolution of the gnd locus of S. enterica has been made by Thampapillai, Lan and Reeves (1994), who searched for evidence of recombination in the sequences of the strains m318, m298, m130, m38, m322 and m321. All of the recombinations reported by the authors are detected by the phylogenetic profile method and the positions of the recombination junctions agree in the two studies. These sequences are shown in the graphs in the left column of Figure 7, whereas the right column features sequences with similarly strong recombination signals that have previously not been recognised. The additional sites detected illustrate the strength and sensitivity of the phylogenetic profile method.

Recombination hotspots

Many of the gnd sequences appear to exhibit particularly small phylogenetic correlations in their central parts. Nine of the 22 local minima (marked with * in Table 2 ) overlap the variable sites 143-153, indicating a particularly large number of recombinations in this region. As previously reported by Thampapillai, Lan and Reeves (1994), a variant of the general recombination stimulating sequence, chi, is located in this region at position 744-751 (variable site 145-150) of many strains in this analysis. All of the nine recombinant sequences mentioned above have the canonical sequence ( 5’-CCTGGTGG-3’) which is a single bp variation of the E. coli chi motif ( 5’-GCTGGTGG-3’). This sequence could possibly be regarded as the S. enterica equivalent to E. coli chi.

In order to examine the influence of this sequence motif on the phylogenetic profile of the sequences, a profile was created exclusively of sequences which had other variants of chi in this location. As can be seen in Figure 8 a), none of them has a local minimum at this site in its phylogenetic profile. In contrast, all sequences with local phylogenetic correlation minima near the chi site also have the canonical (CCTGGTGG) chi motif (Figure 8 b). This chi motif is also found in sequences w2D1, m38, m130, lind and 316, which were not included in Figure 8 b) as their phylogenetic profiles are more difficult to interpret when analysed in the context of the sequences in Figure 8 b). However, as shown above (Figure 7 and Table 2), recombinations at or close to the chi-like motif are also likely for these five sequences.

Implementation

All the phylogenetic profiles were generated using the PhylPro program. PhylPro is a Microsoft Windows application developed in C++ by the author. A prototype of the program is available from the author free of charge. The finalised version of the program will be released to the public domain later in 1997.

Acknowledgments

I thank Prof. Peter Reeves for communicating the S. enterica sequences prior to publication, Prof. Adrian Gibbs for helpful discussions during the development of the phylogenetic profile method and Holger Averdunk for his help in exploring the parameter space of the method.


Literature cited

Assali, N. E., R. Mache, and S. L. de Goer. 1990. Evidence for a composite phylogenetic origin of the plastid genome of the brown alga Pylaiella littoralis (L.) Kjellm. Plant. Mol. Biol. 15:307-315.

Bik, E. M., A. E. Bunschoten, R. D. Gouw, and F. R. Mooi. 1995. Genesis of the novel epidemic Vibrio cholerae O139 strain: evidence for horizontal transfer of genes involved in polysaccharide synthesis. EMBO J. 14:209-216.

Doolittle, R. F., D. F. Feng, K. L. Anderson, and M. R. Alberro. 1990. A naturally occurring horizontal gene transfer from eukaryote to prokaryote. J. Mol. Evol. 31:383-388.

DuBose, R., D. Dykhuizen, and D. Hartl. 1988. Genetic exchange among natural isolates of bacteria: recombination within the phoA locus of Escherichia coli. Proc. Natl. Acad. Sci. USA 85:7036-7040.

Felsenstein, J. 1991. PHYLIP: phylogeny inference package. Version 3.4. University of Washington, Seattle.

Fitch, D. H. A., and M. Goodman. 1991. Phylogenetic scanning: a computer-assisted algorithm for mapping gene conversions and other recombinational events. CABIOS 7:207-215.

Gibbs, A. J., and P. K. Keese. 1995. In search of origins of viral genes. Pp. 76-91 in A. J. Gibbs, C. H. Calisher and F. Garcia-Arenal, eds. Molecular Basis of Virus Evolution. Cambridge University Press, Cambridge, UK.

Gorbalenya, A. E. 1995. Origin of RNA viral genomes: approaching the problem by comparative sequence analysis. Pp. 49-67 in A. J. Gibbs, C. H. Calisher and F. Garcia-Arenal, eds. Molecular Basis of Virus Evolution. Cambridge University Press, Cambridge, UK.

Hahn, C. S., S. Lustig, S. Strauss, and E. G. Strauss. 1988. Western equine encephalitis virus is a recombinant virus. Proc. Natl. Acad. Sci. USA 85:5997-6001

Hein, J. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36:396-405.

Hudson, R. R., and N. L. Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111:147-164.

Jakobsen, I. B., and S. Easteal. 1996. A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences. CABIOS 12:291-295

Jakobsen I. B., S. R. Wilson, and S. Easteal. 1997. The partition matrix: Exploring variable phylogenetic signals along nucleotide sequence alignments. Mol. Biol. Evol. 14:474-484.

Li. J., K. Nelson, A. C. McWhorter, T. S. Whittam, and R. K. Selander. 1994. Recombinational basis of serovar diversity in Salmonella enterica. Proc. Natl. Acad. Sci. USA 91:2252-2256.

Nuttall, P. A., M. A. Morse, L. D. Jones, and A. Portela. 1995. Adaptation of members of the Orthomyxoviridae family to transmission by ticks. Pp 416-26 in A. J. Gibbs, C. H. Calisher and F. Garcia-Arenal, eds. Molecular Basis of Virus Evolution. Cambridge University Press, Cambridge. UK.

Paquin, B., M. J. Laforest, and B. F. Lang. 1994. Interspecific transfer of mitochondrial genes in fungi and creation of a homologous hybrid gene. Proc. Natl. Acad. Sci. USA 91:11807-11810.

Reeves, P. R. 1993. Evolution of Salmonella O antigen variation by interspecific gene transfer on a large scale. Trends Genet. 9:17-22.

Robertson, D. L., P. M. Sharp, F. E. McCutchan, and B. H. Hahn. 1995. Recombination in HIV1. Nature 374:124-126.

Rohlf, F. J. 1993. NTSYS-pc: Numerical taxonomy and multivariate analysis system, Applied Biostatistics Inc., New York 11733, ISBN:0-925031-22-4

Salimen, M. O., J. K. Carr, D. S. Burke, and F. E. McCutchan. 1995. Identification of breakpoints in intergenotypic recombinants of HIV Type 1 by bootscanning. Aids Res. and Hum. Retroviruses 11:1423-425

Sandmeier, H. 1994. Acquisition and rearrangement of sequence motifs in the evolution of bacteriophage tail fibres. Mol. Microbiol. 12:343-350

Sawyer, S. A. 1989. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 6:526-538.

Schoeniger, A., and A. Haeseler. 1995. Simulating efficiently the evolution of DNA sequences. CABIOS 11:111-115.

Stephens, J. C. 1985. Statistical method of DNA sequence analysis: detection of intragenic recombination or gene conversion. Mol. Biol. Evol. 2:539-556.

Syvanen, M. 1994. Horizontal gene transfer: evidence and possible consequences. Ann. Rev. Genet. 28:237-261.

Thampapillai, G., R. Lan, and P. R. Reeves. 1994. Molecular evolution in the gnd locus of Salmonella enterica. Mol. Biol. Evol. 11:813-828.

Weiller, G. F., C. M. E. Schueller, and R. J. Schweyen. 1989. Putative target sites for mobile G+C rich clusters in yeast mitochondrial DNA: Single elements and tandem arrays. Mol. Gen. Genet. 218:272-283.

Weiller, G. F., M. A. McClure, and A. J. Gibbs. 1995. Molecular phylogenetic analysis. Pp. 553-85 in A. J. Gibbs, C. H. Calisher and F. Garcia-Arenal, eds. Molecular Basis of Virus Evolution. Cambridge University Press, Cambridge, UK.

William, H. P., S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. 1992. Numerical Recipes in C. Cambridge University Press, Cambridge, UK.

Whatmore, A. M., and M. A. Kehoe. 1994. Horizontal gene transfer in the evolution of group A streptococcal emm-like genes: gene mosaics and variation in Vir regulons. Mol. Microbiol. 11:363-374.


Table 1

Computation of the phylogenetic correlations at position 500 of the demonstration dataset

R1a

S1

S2

S3

S4

S5

S6

S7

S8

distances in upstream window (positions 1 - 500) b

R1

0

0

78

148

147

170

178

182

163

S1

0

0

78

148

147

170

178

182

163

S2

78

78

0

125

130

161

164

183

160

S3

148

148

125

0

79

179

177

187

173

S4

147

147

130

79

0

178

177

185

168

S5

170

170

161

179

178

0

83

148

129

S6

178

178

164

177

177

83

0

141

125

S7

182

182

183

187

185

148

141

0

91

S8

163

163

160

173

168

129

125

91

0

distances in downstream window (positions 501-1000)

R1

0

193

200

199

198

84

0

145

139

S1

193

0

85

143

153

192

193

203

187

S2

200

85

0

144

149

199

200

199

190

S3

199

143

144

0

91

195

199

186

181

S4

198

153

149

91

0

201

198

185

190

S5

84

192

199

195

201

0

84

148

148

S6

0

193

200

199

198

84

0

145

139

S7

145

203

199

186

185

148

145

0

82

S8

139

187

190

181

190

148

139

82

0

phylogenetic correlations at position 500 c

0.01

0.63

0.85

0.97

0.98

0.86

0.62

0.97

0.96

 

 

a The sequences S1-8 and the recombinant sequence R1 are described in Figure 1.

b Distances were computed as the number of differences (Hamming distance) within the specified region of the multiple sequence alignment.

c The phylogenetic correlation for each sequence is calculated as the linear correlation coefficient of the upstream and downstream distance vectors (columns in the matrices above).


Table 2

Local minima in the phylogenetic profiles of Figure 7 (possible recombination sites)

Strain

gene position

variable position

m318

735-774 * +

152-165

m298

687-762 * +

138-152

 

1017-1066

224-238

130

705-738 *

141-144

 

414-477 (*) +

73-83

m38

705-738 *

141-144

 

414-477 (*) +

73-83

m322

889-894 +

172-175

m321

930-978 +

187-205

w2Di

494-546

87-103

 

735-774 *

153-165

west

705-744 *

141-145

 

930-1018

185-225

m317

990-1067

210-230

 

648-738 *

131-144

w3038

507-741 *

90-145

 

989-1024

210-228

lind

555-741 *

105-145

 

930-989

185-210

m316

1018-1074

225-243

 

* overlapping with or close to the chi-like sequence at pos. 744-751 (145-150)

(*) overlapping with or close to a chi-like sequence at pos. 424-431 (74-76)

+ independently identified by Thampapillai, Lan and Reeves (1994)

 

Figures


Figure 1

Neighbor joining trees of the demonstration dataset

Eight artificial 1000 nucleotide long sequences (S1-S8) were generated using EVOL-TREE (Schoeniger and Haeseler, 1995) with a 10% nucleotide substitution between every generation and 25% of each type of nucleotide. The recombinant sequence (R1) was constructed by combining the 500 bp 5’-fragment of sequence S1 with a 500 bp 3’-fragment of sequence S6, and the recombinant sequence (R2) was constructed by replacing the region 333 - 666 of S1 with the corresponding fragment of S6.

The dendrograms were produced using the neighbor joining program of the Phylip package (Felsenstein, 1991). Part a) of the figure shows the relationships of simulated sequences S1 to S8. Part b) also includes the recombinant sequences R1 and R2.


Figure 2

Phylogenetic profiles of sequences S1-8 and R1-2

Series 1 (left column) contains only one recombinant sequence R1 (bold line) with a single cross over at site 349. Series 2 (right column) additionally contains R2 (bold line) with a double cross over on site 237 and 465. The parental sequences S1 and S6 of both recombinants are omitted from rows b-d. Rows a and b make use of variable window widths while rows c and d utilise fixed size windows with a width of 70 and 35 positions respectively (see text).


Figure 3

Phylogenetic profiles of chimaeric sequences created from the demonstration dataset

All profiles use Hamming distances and fixed windows of 50 bp width. All sequences in the profiles are recombinant. Plot a contains 16 sequences each with a crossover at position 349 (Series A). Plot b contains 16 sequences each with a double crossover at positions 237 and 465 (Series B). Plot c combines all 32 sequences used in a and b (see text).


Figure 4

Simulation of a complex phylogeny

A 1000 bp long random progenitor sequence (s) containing 25% A, C, T, and G was assembled. Descendent sequences were created by introducing 20 random nucleotide changes to their respective progenitors. The recombinant sequence x was built by combining the first 500 and the last 500 nucleotides of the sequences sB and sA respectively. The remaining recombinants y and z were created similarly by crossing sequences sAB with sBA on position 300 and sABB with sBAA on position 700. The recombinants were evolved further yielding a total of 30 sequences.

 


Figure 5

Phylogenetic profile of a simulated complex phylogeny

The source of the 30 sequences is described in Figure 4 . The profile utilises only the 304 parsimonious sites. For convenience, the graph is mapped to display all positions whereby the x-axis gives the range from position 100 to 900. The y axis gives the phylogenetic correlation from +1 to -0.5. The boxes labelled x, y, and z mark the sequences derived from the recombinant x, y and z respectively (see text and Figure 4).


Figure 6

gnd gene sequence of 34 Salmonella enterica strains

Only the variable sites and their locations within the gene are given. The black bar indicates a chi-like region.


Figure 7

Phylogenetic profiles of the gnd locus in Salmonella enterica strains

All graphs used a fixed window size of 60 bp and only the variable sites, as given in Figure 6. The proportion of different nucleotides was used to determine the relationships of the sequences and the linear correlation coefficient was calculated to determine phylogenetic correlations. The top graph shows the profiles of all 34 sequences as given in Figure 6. All but one of the sequences of the strains that gave strong recombination signals (m318, m298, m130, m38, m322, m321, w2DI, west, m317, w3038, lind, m316, m325, m287) were removed from the remaining plots. The profile of the sequence with the strongest recombination signal is plotted in bold, and the name of the corresponding strain is given (see text).


Figure 8

Phylogenetic profiles of sequences with and without chi-like motif

The canonical chi-like motif (5’ 744-CCTGGTGG-751 3’) is marked (variable sites 145-150) in both graphs. The graph in a) includes all sequences (s71, m229, m311, sofia, m261, m287, m319, m35, m320, m313, m314, m321, m322, m326) that have different variations of this chi-like sequence, while all sequences in b) (lt2, s41, m46, m36, m73, m13, m55, m298, m295, west, m317, w3038, m318, m324, m325) have the canonical chi-like sequence. The thick line gives the averages of the phylogenetic correlations of all sequences that are included in each graph.


Back to Back