|
|
||||||||


1 Unité de Biochimie Bactérienne, UR477, INRA, 78350 Jouy-en-Josas, France
2 Unité Mathématique Informatique et Génome, UR1077, INRA, 78350 Jouy-en-Josas, France
3 Unité de Génétique Microbienne, UR895, INRA, 78350 Jouy-en-Josas, France
Correspondence
Rozenn Gardan
rozenn.gardan{at}jouy.inra.fr
| ABSTRACT |
|---|
|
|
|---|
These authors contributed equally to this work.
Two supplementary tables listing spCDSs of the genomes studied that passed the filtration of pseudogenes and false predictions, and predicted and annotated short CDSs supported by extrinsic evidence in one Enterococcus faecalis and 18 streptococcal genomes, and a supplementary figure showing the chromosomal context of CDSs encoding the SHPs associated with rgg genes, are available with the online version of this paper.
| INTRODUCTION |
|---|
|
|
|---|
All microbial genomes contain an abundance of small ORFs, potentially encoding peptides. However, the number of short coding sequences (CDSs) is still debated, and their identification and the precise localization of start codons remain the most difficult problems affecting bacterial CDS detection in silico (Harrison et al., 2003
; Nielsen & Krogh, 2005
; Ochman, 2002
; Skovgaard et al., 2001
). The short length of these CDSs probably affects both pillars of CDS prediction, namely intrinsic and extrinsic approaches (Borodovsky et al., 1994
). Intrinsic approaches evaluate the coding potential of ORFs on the basis of nucleotide composition and the presence of RBSs. In this context, the short length of the ORFs limits the amount of information available, and the large number of small ORFs is expected to markedly increase the risk of false prediction (Larsen & Krogh, 2003
). Extrinsic approaches rely on sequence comparisons. Here again, the small size of the ORFs potentially decreases the sensitivity of the comparisons. In addition, the sequence comparisons used in extrinsic analyses are often restricted to a search against annotated proteins and thus miss short CDSs which are unannotated in other genomes (Borodovsky et al., 1994
; Pearson et al., 1997
).
Besides the difficulty of in silico prediction of the short CDSs, the overrepresentation of genes of unknown function among the annotated short CDSs also seems to point to a deficit of biological knowledge regarding this class of genes (Zuber, 2001
). The specific challenges posed by the experimental study of these genes and the corresponding peptides probably partly explain this situation. Classical hybridization methods such as Northern blotting often fail to detect their transcription as the corresponding RNAs are small, and gene disruption is also difficult because of the need for a double recombination event. Furthermore, in comparison with proteins, the resulting peptides are difficult to detect either by staining or by UV absorbance.
Taken together, the central role of small peptides in complex biological functions and the difficulty in identifying short CDSs both in silico and experimentally clearly call for a careful in silico study aimed at producing a reliable list of small genes that is as exhaustive as possible. In order to produce such a list, we implemented a novel three-step strategy that combined intrinsic and extrinsic approaches, to circumvent the limitations of both types of approach. We started by complementing genome annotations by the identification of new short putative CDSs (spCDSs). During the second step, pseudogene fragments and false predictions were filtered out. Finally, the remaining spCDSs were screened for three lines of extrinsic evidence involving sequence and gene context comparisons.
This strategy was applied to the genomes of the Streptococcus genus and Enterococcus faecalis V583 (Paulsen et al., 2003
). The number of complete streptococcal genomes available is particularly high due to the medical importance of several species such as Streptococcus pneumoniae (Hoskins et al., 2001
; Tettelin et al., 2001
), Streptococcus pyogenes (Banks et al., 2004
; Beres et al., 2002
, 2006
; Ferretti et al., 2001
; Green et al., 2005
; Nakagawa et al., 2003
; Smoot et al., 2002
; Sumby et al., 2005
), Streptococcus agalactiae (Glaser et al., 2002
; Tettelin et al., 2002
, 2005
) and Streptococcus mutans (Ajdic et al., 2002
). Only one sequenced species, Streptococcus thermophilus (Bolotin et al., 2004
), is classified as generally recognized as safe, and this species is of major importance to the food industry, since it is widely used for the manufacture of yoghurt and cheese. This dense taxon sampling increases the power of the extrinsic analyses and therefore made the genus particularly attractive for testing our strategy.
As an illustration of the utility of our list, we report the discovery of a new family of CDSs that encode hydrophobic peptides (20–23 aa) located just upstream of genes encoding transcriptional regulators of the Rgg (also called MutR) family. We also examined the transcription of seven genes of S. thermophilus, six selected from those with extrinsic evidence that could be linked to quorum sensing and one selected from those with only intrinsic evidence, using real-time quantitative RT-PCR. Finally, one peptide from the list, encoded by a gene that is not annotated in GenBank, was identified in a cell-envelope-enriched fraction of S. thermophilus CNRZ1066 by LC-MS/MS analysis.
| METHODS |
|---|
|
|
|---|
spCDS detection based on intrinsic evidence.
BactgeneSHOW (options -m 4C_si -rbs m1 -duprev -cdst 0.1) was used to predict additional CDSs (Bryson et al., 2006
). This program uses a hidden Markov model (HMM) that accounts for the presence of RBSs and four types of nucleotide composition of the coding sequences (see below). It also features fully unsupervised parameter estimation from the raw sequence (Nicolas et al., 2002
) carried out in a maximum-likelihood framework with the iterative expectation–maximization algorithm (Rabiner, 1989
). Each prediction comes with a confidence measure, referred to here as pHMM, that corresponds to the probability for the ORF to be a CDS (i.e. a true positive prediction) as computed after posterior decoding with the HMM. To ensure the homogeneity of the set of spCDSs, predicted start codons were preferred over annotated start codons when these did not match (annotation policies sometimes strongly favour the most upstream ATG triplet; Besemer et al., 2001
).
Identification of pseudogenes and false predictions.
Protein–protein comparisons used the exact Smith–Waterman algorithm implemented in SSEARCH with default parameters (FASTA3.4 package; Pearson & Lipman, 1988
; Smith & Waterman, 1981
). DNA–protein comparisons involved both strands of the spCDS plus an additional 300 bp on each side and used FASTX (FASTA3.4; Pearson, 2000
), a program that allows frameshifts and stop codons in the alignments. Only FASTX alignments that encompassed the middle of the spCDSs were incorporated in our analysis. Both SSEARCH and FASTX comparisons were conducted separately for each couple of genomes; E-values were therefore multiplied by the number of comparisons (20) to account for multiple testing.
Statistical assessment of extrinsic evidence.
P-E1 assesses the statistical confidence in extrinsic evidence E1 that corresponds to protein sequence similarity between two spCDSs. P-E1 is the E-value reported by SSEARCH after correction for multiple comparisons (genome pairs were treated separately).
P-E2 assesses the statistical confidence in extrinsic evidence E2 that corresponds to protein sequence similarity between a query spCDS and another spCDS of a similar length (allowing a 10 aa difference) in a similar gene context. Two spCDSs in the same position relative to two nearby reference genes that are homologous were considered to be in a similar gene context. The position of the spCDS relative to the reference gene was categorized into one of four possible configurations depending on the location (upstream or downstream) and the direction (same or opposite) of the spCDS. Each CDS annotated in GenBank with an extremity less than 500 bp away from an spCDS was considered as a possible reference for this spCDS. The set of homologous genes for each reference gene was constructed using BLASTP (version 2.2.14) (Altschul et al., 1997
), imposing an E-value
10–6 and an alignment spanning at least 70 % of both the query and the subject. Pair-wise protein-sequence similarity between spCDSs in a similar gene context was quantified using PRSS (FASTA3.4). P-E2 is an E-value that accounts for both protein sequence similarity and gene context similarity. It is obtained as the P-value reported by PRSS multiplied by the expected number of comparisons with other spCDSs in a similar gene context. To compute this expected number, spCDSs were modelled as randomly distributed across the intergenic regions of the genome. Specifically, we used a homogeneous Poisson process (Nelson, 1995
) to model the occurrences of the spCDSs across the regions not occupied by annotated or predicted CDSs more than 10 aa longer than the query sequence. The intensity of the Poisson process was estimated separately for each spCDS length range and each genome as the ratio between the number of spCDSs in the genome and the total number of available positions for the spCDSs to occur (maximum-likelihood estimate).
Extrinsic evidence E1 and E2 also involved testing the null hypothesis of random distribution of the DNA differences across the three codon positions. Pair-wise DNA alignments were deduced from pair-wise protein alignments, and only those alignments with fewer differences in the second rather than the third codon position were further analysed. The statistical test was a variant of the Fisher exact test constructed as follows. Let n1, n2 and n3 denote the number of differences in the first, second and third codon positions across the pair-wise alignment and N be the total number of aligned codons. In addition, let P(x,y,z) be the probability of observing x differences in the first codon position, y in the second and z in the third, under the null hypothesis given the value of x+y+z. P(x,y,z) is computed as:
|
z and P(x,y,z)
P(n1,n2,n3). P-E3 assesses the statistical confidence in evidence E3 that corresponds to a higher number of spCDSs of a particular size (allowing a 10 aa difference) in a particular gene context than that expected by chance. As for E2, the gene context is assessed by the position of an spCDS relative to a reference gene and the occurrences of spCDSs are modelled as randomly distributed across intergenic regions. P-E3 is a P-value based on the distribution of the sum of independent but non-identically distributed Bernoulli variables, where each variable corresponds to the presence or absence of at least one spCDS, and the sum is of the different homologues of the reference. In order to avoid counting spCDSs found on very closely related DNA sequences as distinct occurrences, only homologous reference genes that diverged to the extent that they were less than 85 % identical at the third codon position were taken into account.
Phylogenetic analysis of the Rgg family.
The seven genes annotated as being representatives of the MutR family in S. thermophilus CNRZ1066 and their homologues in S. agalactiae A909, S. mutans UA159, S. pyogenes MGAS10270 and S. pneumoniae R6 were collected (alignments encompassing more than 70 % of both the query and the subject and E value
10–6 in a BLASTP comparison). The multiple protein sequence alignment was created with CLUSTALW1.83 (default parameters) (Thompson et al., 1994
) and the positions with gaps were removed. Tree reconstruction was carried out with PROTDIST and NEIGHBOR (PHYLIP3.6 package; Felsenstein, 1989
) using a Jones, Taylor and Thornton (JTT) model of protein evolution and a gamma model of rate variations across sites. The coefficient of variation of the gamma distribution was set to 0.52, the value estimated by TREE-PUZZLE5.2 (Schmidt et al., 2002
) on this particular dataset. Bootstrap confidence values were computed on 1000 replicates created using SEQBOOT (PHYLIP3.6).
Growth conditions and bacterial strain.
The S. thermophilus CNRZ1066 strain used in this study was derived from the Institut National de la Recherche Agronomique–Centre National de Recherches Zootechniques (CNRZ) bacteria collection. Depending on the experiment, S. thermophilus CNRZ1066 was grown in M17 (Difco) supplemented with 10 g lactose l–1 (M17lac) or in a chemically defined medium (CDM; Letort & Juillard, 2001
). The cultures were incubated at 42 °C. OD600 of the cultures was measured using a spectrophotometer (Uvikon 931, Kontron).
RNA extraction and reverse transcription.
Total RNA was extracted from cells grown in M17lac or CDM. Cells were harvested during exponential (OD600 1.2) and stationary phases of growth (OD600 2.3 in CDM or OD600 2.6 in M17lac) by centrifugation (5000 g for 1 min at 4 °C) and stored at –80 °C. Bacterial pellets were resuspended in 400 µl buffer containing 10 % glucose, 12.5 mM Tris-HCl (pH 7.6) and 70 mM EDTA, and transferred into tubes containing 500 µl acid phenol (pH 4.5) and 0.6 g 0.1 mm diameter glass beads (Sigma). Cells were subjected to mechanical disruption with a Fastprep apparatus (two 30 s cycles of homogenization at maximum speed with 1 min intervals on ice). After a centrifugation step (19 000 g for 10 min at 4 °C), RNA was purified by TRIzol/chloroform/isoamyl alcohol (Invitrogen) and chloroform/isoamyl alcohol extractions, and 2-propanol precipitation. RNA samples were resuspended in 50 µl TE (10 mM Tris, 1 mM EDTA, pH 7.6) and treated with DNase I (Ambion) as recommended by the manufacturer in order to eliminate contaminating DNA. Total RNA concentrations were determined by measuring A260 using a spectrophotometer (BioPhotometer, Eppendorf). RNA integrity was checked with an Agilent 2100 bioanalyser. Three extractions were performed independently for each condition. Furthermore, cDNA synthesis was performed from 1 µg RNA using M-MLV reverse transcriptase (Invitrogen) according to the manufacturer's protocol.
Real-time RT-PCR.
The primers used in this study are listed in Table 1
. The primers were designed using Primer Express (version 2.0; Applied Biosystems). The length of PCR products ranged from 69 to 150 bp. The efficiency of amplification was determined by running a standard curve with serial dilutions of PCR products encompassing each target gene. Efficiency (E) was calculated using the formula E=[10(1/–s)–1]x100, where s is the slope of the regression line obtained for each pair of primers. Real-time PCR was carried out with 25 ng reverse-transcribed cDNA using the SYBR Green PCR Master Mix (Applied Biosystems), as recommended by the manufacturer. A 200 nM concentration of primers was used, except for the amplification of gene 3, where 100 nM was used to avoid dimer formation. PCR reactions were run on the ABI Prism 7700 sequence detector (Perkin–Elmer Applied Biosystems) under the following conditions: Taq polymerase activation at 95 °C for 10 min, and then 40 cycles of 15 s at 95 °C and 1 min at 60 °C. A melting-curve analysis was performed at the end of each run for all primer sets. This resulted in single-product-specific melting curves, and no primer-dimers were generated during the runs. A no-template control (replacing cDNAs with distilled H2O) and an RT-negative control (replacing cDNAs with RNA samples which had not undergone the reverse transcription step) were included in each run in order to confirm the absence of contaminating DNA. SYBR Green PCRs were performed in triplicate, and for each condition the experiments were repeated independently in triplicate. Data were either recorded as the threshold cycle (CT) and expressed as mean±SD or computed using the comparative critical threshold method (
) described by Livak & Schmittgen (2001)
. In this case, ldhL was used to normalize data. For each gene, an analysis of variance was performed on the CT corrected by the CT of ldhL in order to determine whether the relative expression levels between two conditions were significantly different (P<0.05).
|
LC-MS/MS analysis and database searching.
LC-MS/MS analysis was performed with an Ultimate 3000 LC system (Dionex) connected to a linear ion trap mass spectrometer (LTQ, Thermo Fisher) by nanoelectrospray. Tryptic peptide mixtures (4 µl) were loaded at a flow rate of 20 µl min–1 onto Pepmap C18 precolumns (0.3x5 mm, 10 nm, 5 µm; Dionex). After 4 min, the precolumn was connected to the Pepmap C18 separating nanocolumn (0.075 mm diameterx150 mm length, 10 nm internal diameter, 3 µm particle size), and the linear gradient from 2 to 36 % buffer B (0.1 % formic acid, 80 % acetonitrile) in buffer A (0.1 % formic acid, 2 % acetonitrile) at 300 nl min–1 over 90 min was started. Ionization was performed on the liquid junction with a spray voltage of 1.3 kV applied to a non-coated capillary probe (PicoTip EMITER 10 µm tip ID; New Objective). Peptide ions were analysed by the Nth-dependent method as follows: (i) full MS scan (m/z 300–1500), (ii) ZoomScan (scan of the three major ions), (iii) MS/MS on these three ions.
All protein identification was performed with Bioworks 3.3 (Thermo Fisher). All peaks, lists of precursors and fragment ions were matched automatically against a database which compiled GenBank annotations of S. thermophilus CNRZ1066 and new predictions. The Bioworks search parameters were: trypsin-specificity with one missed cleavage, oxidation variable for methionine, and mass tolerance fixed to 1.4 Da for precursor ions and 0.5 Da for fragment ions. The search results were filtered using Bioworks 3.3. A multiple-threshold filter applied at the peptide level consisted of the following criteria: Xcorr magnitude up to 1.7, 2.5 and 3.0 for mono-, di- and tri-charged peptides, respectively, peptide probability lower than 0.01,
Cn greater than 0.1.
| RESULTS |
|---|
|
|
|---|
0.1).
GenBank annotations and new predictions were compiled into a set of non-redundant spCDSs (
60 aa) whose number ranged from 109 to 335 per Mbp. Table 2
, column Step (1), gives the results for the 20 genomes. The highest number and density of spCDSs was found in S. thermophilus genomes, probably reflecting the distinctively high number of pseudogene fragments in this species (Bolotin et al., 2004
).
|
0.01).
The fraction of spCDSs eliminated by this systematic filtering varied across genomes, from 19 % in E. faecalis V583 to 50 % in S. thermophilus CNRZ1066. Table 2
, column Step (2), lists the number of CDSs present after this step of the analysis, and the list of genes is given in Supplementary Table S1. As an indication of the high sensitivity of our filter, all 62 spCDSs annotated as pseudogene fragments in S. thermophilus CNRZ1066 were discarded after this step of the analysis.
Screening for extrinsic evidence
The final stage of the strategy to identify short CDSs consisted of screening the remaining spCDSs for support by at least one of three lines of extrinsic evidence in which we could be confident. These three lines were: conserved protein sequence (E1); similar gene context and conserved protein sequence (E2); and a significantly high number of spCDSs of a particular size in particular gene context (E3). Both E1 and E2 are based on a Smith–Waterman search against similarly sized spCDSs, allowing a difference in length of up to 10 aa, but differ in terms of the set of candidates against which the search is conducted. In E1, all similarly sized spCDSs are considered. In E2, the search is restricted to those spCDSs of similar size in the same gene context as characterized by their location (either upstream or downstream) and by the strand (either identical or opposite), relative to a reference gene. The idea behind this distinction is that E2 may be more sensitive for finding weak similarities between short CDSs by reducing the number of comparisons. Importantly, significant sequence similarities identified in E1 and E2 at the protein level were verified as not simply being the result of the evolutionary proximity between DNA sequences (Ochman, 2002
). As sequence conservation is expected to depend on codon position in coding sequences under purifying selection (Kimura, 1977
), DNA sequence divergence was required to be significantly influenced by codon position (P
0.05) or to reach 40 % over the whole alignment. The different lines of extrinsic evidence were evaluated separately, and evidence was considered to be found when its associated E- or P-value (P-E1, P-E2 or P-E3) was below 0.001, a choice that balanced high sensitivity with very few expected false positives.
The results of this screening are shown in Table 2
, column Step (3). Over the 20 genomes, a total of 789 spCDSs were supported by at least one of the three extrinsic lines of evidence, corresponding to an average rate of 20 per Mbp. This rate was remarkably stable, as it ranged from 15 in E. faecalis V583 to 25 in S. pyogenes MGAS10394. Among the 789 spCDSs, only 5.5 % were missed by our ab initio prediction but annotated in GenBank. The fraction of spCDSs supported by extrinsic evidence and unannotated in GenBank exhibited extensive variations across the 20 genomes (from 61 % in S. thermophilus CNRZ1066 to 11 % in S. pneumoniae TIGR4) and corresponded to 31 % of the 789 spCDSs. As expected, this proportion decreased when the length of the CDS increased, being as high as 67 % of 204 for those shorter than 40 aa and as low as 14 % of 374 for those with lengths ranging from 50 to 60 aa.
Interestingly, the presence/absence of annotated CDSs in any of the GenBank records never closely matched the set of short CDSs with extrinsic support: either many short CDSs were annotated and most of these were not supported by extrinsic evidence, or a few short CDSs were annotated and the annotation missed an important fraction of the short CDSs with extrinsic support (Fig. 1
). For instance, the 33 S. thermophilus CNRZ1066 annotated spCDSs had one of the highest percentages of extrinsic support (42 %) but also had the lowest coverage of the spCDSs with extrinsic evidence. Conversely, the 200 annotated spCDSs of S. pneumoniae TIGR4 had the lowest rate of extrinsic support (16 %) but covered the highest fraction of supported short CDSs.
|
Table 3
lists the 35 spCDSs that received extrinsic support in the S. thermophilus CNRZ1066 genome. Among them, five were found to belong to a new family of CDSs that encode short hydrophobic peptides (SHPs), and are described below. Six other CDSs (plus another that did not receive extrinsic support) were selected for transcriptional analysis. Supplementary Table S2 gives the same information as Table 3
for the other 19 genomes.
|
|
In order to determine whether these seven selected CDSs were transcribed, we measured the CT for each gene using cDNA obtained from RNA extracted from cells grown in rich medium (M17lac) in exponential phase, as described in Methods. As shown in Table 4
, we obtained CT values of between 19 and 28, indicating that all CDSs were transcribed.
|
|
|
| DISCUSSION |
|---|
|
|
|---|
The added value of our genome-wide survey of the short CDSs is well illustrated by the fact that in none of the GenBank records did the presence/absence of CDS annotations closely match the list of spCDSs supported by extrinsic lines of evidence. Indeed, it is unlikely that either intrinsic or extrinsic approaches taken separately could ever produce such a list. Intrinsic approaches are typically misled by the presence of pseudogene fragments, and it is apparently difficult to distinguish some functional CDSs from the intergenic background without accepting a relatively high level of false predictions. However, intrinsic approaches proved very helpful in establishing lists of candidate CDSs that could be subsequently screened for extrinsic lines of evidence.
We also compared our list of short CDSs supported by extrinsic lines of evidence to the CDSs accompanied by functional annotation in GenBank. Some biologically known CDSs could not be confirmed by extrinsic in silico approaches with the sequence data currently available. This demonstrated that the list reported here is incomplete, although it is already very large compared to the number of known short CDSs.
The experimental use and validation of these results were initiated using the list of predicted short CDSs of S. thermophilus CNRZ1066. For all seven genes chosen for experimental studies, we were able to detect the presence of RNA transcripts using real-time quantitative RT-PCR. Very little reasonable doubt now remains concerning their translation into proteins, as they were all targeted on the basis of a number of criteria exclusively related to translation (RBSs, tri-periodic nucleotide composition and protein sequence conservation). The seven genes were selected on the basis of three criteria: the presence of potential signal sequences in the corresponding peptides, proximity to genes encoding ABC transporters and two-component systems, and proximity to genes encoding proteins of known function. The first two criteria were motivated by our interest in peptides that might be involved in quorum-sensing functions. Although blpC, which encodes a bacteriocin-like inducer peptide precursor from a class II bacteriocin locus, fulfils these criteria, it was not selected because this locus is probably not functional in S. thermophilus LMG18311 and CNRZ1066 (Hols et al., 2005
). We chose to compare the expression of the seven genes under different conditions in order to see whether some of them were similarly regulated, which would suggest a functional link between them. Instead, comparison of their expression in rich medium and CDM and during the exponential and stationary phases revealed distinct patterns of expression for all genes. One of our objectives is now to identify their functions. We have already started to study the consequences of inactivating g-1 in S. thermophilus in order to assess its potential involvement in iron metabolism.
Another experimental validation of our results came from the MS analysis of peptides purified from the cell envelope of S. thermophilus CNRZ1066. Three peptides were identified in this preliminary experiment, and all belonged to our final list of spCDSs supported by extrinsic evidence (Table 3
). Besides two peptides corresponding to ribosomal proteins, one corresponded to a short CDS genetically linked to the gene miaA, which encodes a tRNA delta(2) isopentenylpyrophosphate transferase (Table 3
). The presence of a signal peptide (predicted by SignalP v3.0) indicates that the encoded product of this short CDS is probably either secreted or anchored in the membrane.
During this study, one family of CDSs encoding SHPs was identified on the basis of the conservation of their context with genes encoding transcriptional regulators of the Rgg family. Although they were present in at least one copy in each Streptococcus species, none of them was annotated in GenBank. Most of them shared strong sequence similarities corresponding to E1 and E2 support, but E3 support was necessary to detect all members of this family. This example emphasizes the importance of our three extrinsic lines of evidence. SHPs have a conserved size of 22–23 aa, they always have at least one positively charged amino acid (lysine) in the N-terminal portion and one glycine in the C-terminal portion, and their sequence is mainly composed of hydrophobic amino acids interrupted by a negatively charged amino acid (glutamate or aspartate). Some of these features [size, one or more lysine(s) in the N-terminal portion, and an abundance of hydrophobic amino acids] are shared with the inhibitor pheromones involved in the regulation of conjugation in E. faecalis (Chandler & Dunny, 2004
). However, the similarities between SHPs and inhibitor pheromones were too weak to be detected automatically by our analysis. Like shp, the inhibitor peptides are encoded upstream of and divergently from transcriptional regulators which nevertheless do not belong to the Rgg family. It has recently been demonstrated that one of these inhibitor pheromones (which are secreted and reimported) interacts directly with its regulator, PrgX (Kozlowicz et al., 2006
). These common features of SHPs and the inhibitor pheromones of E. faecalis suggest that SHPs could be involved in quorum-sensing regulation. We studied in detail one shp–rgg locus which is specific to S. thermophilus strain LMD9. We demonstrated that the Rgg regulator positively controls the transcription of a gene (ster1357) that encodes a cyclic peptide, and that in the shp or an ami (oligopeptide transport system) knockout mutant, the transcription of ster1357 is drastically decreased (M. Ibrahim and others, unpublished results). These results confirm the biological role of at least one shp gene. Phylogenetic analysis of the Rgg-like regulators of streptococci identified two distinct groups of Rgg-like regulators associated with SHPs. The sequences of the corresponding SHPs were conserved within these two groups, thus reinforcing a functional link between SHPs and Rgg regulators. A challenge concerning this family of regulators will be to discover their targets. The rgg-like regulators that have been characterized so far are not associated with SHPs, and appear to positively regulate the transcription of at least the adjacent genes (Lyon et al., 1998
; Qi et al., 1999
; Rawlinson et al., 2002
; Sanders et al., 1998
; Vickerman & Minick, 2002
). We can hypothesize that genes downstream of rgg genes and transcribed in the same orientation are putative targets. None of these genes encodes a protein of known function except mutM (fpg) and coaE, which encode a formamidopyrimidine-DNA glycosylase and a dephospho-CoA kinase, respectively. However, we identified three groups of proteins, SAM radical enzymes, exporters and small peptides, suggesting a functional link between these potential targets.
In conclusion, we present a reliable list of short genes discovered in streptococci and E. faecalis genomes. A preliminary study using these data has been performed on S. thermophilus. The results obtained suggest that this expression study should be extended to other streptococci, perhaps employing more global approaches such as microarray technology. This list could also be used as a valuable tool to check for the presence of short genes in defined loci.
|
| ACKNOWLEDGEMENTS |
|---|
Edited by: M. Kleerebezem
| REFERENCES |
|---|
|
|
|---|
, G., Chang, J., Carson, M. B., Primeaux, C., Tian, R., Kenton, S. & other authors (2002). Genome sequence of Streptococcus mutans UA159, a cariogenic dental pathogen. Proc Natl Acad Sci U S A 99, 14434–14439.Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.
Banks, D. J., Porcella, S. F., Barbian, K. D., Beres, S. B., Philips, L. E., Voyich, J. M., DeLeo, F. R., Martin, J. M., Somerville, G. A. & Musser, J. M. (2004). Progress toward characterization of the group A Streptococcus metagenome: complete genome sequence of a macrolide-resistant serotype M6 strain. J Infect Dis 190, 727–738.[CrossRef][Medline]
Beres, S. B., Sylva, G. L., Barbian, K. D., Lei, B., Hoff, J. S., Mammarella, N. D., Liu, M. Y., Smoot, J. C., Porcella, S. F. & other authors (2002). Genome sequence of a serotype M3 strain of group A Streptococcus: phage-encoded toxins, the high-virulence phenotype, and clone emergence. Proc Natl Acad Sci U S A 99, 10078–10083.
Beres, S. B., Richter, E. W., Nagiec, M. J., Sumby, P., Porcella, S. F., DeLeo, F. R. & Musser, J. M. (2006). Molecular genetic anatomy of inter- and intraserotype variation in the human bacterial pathogen group A Streptococcus. Proc Natl Acad Sci U S A 103, 7059–7064.
Besemer, J., Lomsadze, A. & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29, 2607–2618.
Bolotin, A., Quinquis, B., Renault, P., Sorokin, A., Ehrlich, S. D., Kulakauskas, S., Lapidus, A., Goltsman, E., Mazur, M. & other authors (2004). Complete sequence and comparative genome analysis of the dairy bacterium Streptococcus thermophilus. Nat Biotechnol 22, 1554–1558.[CrossRef][Medline]
Borodovsky, M., Rudd, K. E. & Koonin, E. V. (1994). Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res 22, 4756–4767.
Bryson, K., Loux, V., Bossy, R., Nicolas, P., Chaillou, S., van de Guchte, M., Penaud, S., Maguin, E., Hoebeke, M. & other authors (2006). AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res 34, 3533–3545.
Chandler, J. R. & Dunny, G. M. (2004). Enterococcal peptide sex pheromones: synthesis and control of biological activity. Peptides 25, 1377–1388.[CrossRef][Medline]
Felsenstein, J. (1989). PHYLIP – phylogeny inference package (version 3.2). Cladistics 5, 164–166.
Ferretti, J. J., McShan, W. M., Ajdic, D., Savic, D. J., Savic, G., Lyon, K., Primeaux, C., Sezate, S., Suvorov, A. N. & other authors (2001). Complete genome sequence of an M1 strain of Streptococcus pyogenes. Proc Natl Acad Sci U S A 98, 4658–4663.
Gardy, J. L., Laird, M. R., Chen, F., Rey, S., Walsh, C. J., Ester, M. & Brinkman, F. S. (2005). PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21, 617–623.
Gitton, C., Meyrand, M., Wang, J., Caron, C., Trubuil, A., Guillot, A. & Mistou, M. Y. (2005). Proteomic signature of Lactococcus lactis NCDO763 cultivated in milk. Appl Environ Microbiol 71, 7152–7163.
Glaser, P., Rusniok, C., Buchrieser, C., Chevalier, F., Frangeul, L., Msadek, T., Zouine, M., Couvé, E., Lalioui, L. & other authors (2002). Genome sequence of Streptococcus agalactiae, a pathogen causing invasive neonatal disease. Mol Microbiol 45, 1499–1513.[CrossRef][Medline]
Green, N. M., Zhang, S., Porcella, S. F., Nagiec, M. J., Barbian, K. D., Beres, S. B., LeFebvre, R. B. & Musser, J. M. (2005). Genome sequence of a serotype M28 strain of group A Streptococcus: potential new insights into puerperal sepsis and bacterial disease specificity. J Infect Dis 192, 760–770.[CrossRef][Medline]
Hamoen, L. W., Venema, G. & Kuipers, O. P. (2003). Controlling competence in Bacillus subtilis: shared use of regulators. Microbiology 149, 9–17.
Harrison, P. M., Carriero, N., Liu, Y. & Gerstein, M. (2003). A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs. J Mol Biol 333, 885–892.[CrossRef][Medline]
Hols, P., Hancy, F., Fontaine, L., Grossiord, B., Prozzi, D., Leblond-Bourget, N., Decaris, B., Bolotin, A., Delorme, C. & other authors (2005). New insights in the molecular biology and physiology of Streptococcus thermophilus revealed by comparative genomics. FEMS Microbiol Rev 29, 435–463.[CrossRef][Medline]
Hoskins, J., Alborn, W. E., Jr, Arnold, J., Blaszczak, L. C., Burgett, S., DeHoff, B. S., Estrem, S. T., Fritz, L., Fu, D. J. & other authors (2001). Genome of the bacterium Streptococcus pneumoniae strain R6. J Bacteriol 183, 5709–5717.
Kimura, M. (1977). Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267, 275–276.[CrossRef][Medline]
Kleerebezem, M. (2004). Quorum sensing control of lantibiotic production; nisin and subtilin autoregulate their own biosynthesis. Peptides 25, 1405–1414.[CrossRef][Medline]
Kozlowicz, B. K., Shi, K., Gu, Z. Y., Ohlendorf, D. H., Earhart, C. A. & Dunny, G. M. (2006). Molecular basis for control of conjugation by bacterial pheromone and inhibitor peptides. Mol Microbiol 62, 958–969.[CrossRef][Medline]
Larsen, T. S. & Krogh, A. (2003). EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics 4, 21[Medline]
Letort, C. & Juillard, V. (2001). Development of a minimal chemically-defined medium for the exponential growth of Streptococcus thermophilus. J Appl Microbiol 91, 1023–1029.[CrossRef][Medline]
Livak, K. J. & Schmittgen, T. D. (2001). Analysis of relative gene expression data using real-time quantitative PCR and the
method. Methods 25, 402–408.[CrossRef][Medline]
Lyon, G. J. & Novick, R. P. (2004). Peptide signaling in Staphylococcus aureus and other Gram-positive bacteria. Peptides 25, 1389–1403.[CrossRef][Medline]
Lyon, W. R., Gibson, C. M. & Caparon, M. G. (1998). A role for trigger factor and an Rgg-like regulator in the transcription, secretion and processing of the cysteine proteinase of Streptococcus pyogenes. EMBO J 17, 6263–6275.[CrossRef][Medline]
Martin, B., Quentin, Y., Fichant, G. & Claverys, J. P. (2006). Independent evolution of competence regulatory cascades in streptococci?. Trends Microbiol 14, 339–345.[CrossRef][Medline]
Nakagawa, I., Kurokawa, K., Yamashita, A., Nakata, M., Tomiyasu, Y., Okahashi, N., Kawabata, S., Yamazaki, K., Shiba, T. & other authors (2003). Genome sequence of an M3 strain of Streptococcus pyogenes reveals a large-scale genomic rearrangement in invasive strains and new insights into phage evolution. Genome Res 13, 1042–1055.
Nelson, B. L. (1995). Stochastic Modeling: Analysis and Simulation. Mineola, NY: Dover Publications.
Nicolas, P., Bize, L., Muri, F., Hoebeke, M., Rodolphe, F., Ehrlich, S. D., Prum, B. & Bessières, P. (2002). Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Res 30, 1418–1426.
Nielsen, P. & Krogh, A. (2005). Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21, 4322–4329.
Nielsen, H., Engelbrecht, J., Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10, 1–6.[Medline]
Ochman, H. (2002). Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends Genet 18, 335–337.[CrossRef][Medline]
Paulsen, I. T., Banerjei, L., Myers, G. S., Nelson, K. E., Seshadri, R., Read, T. D., Fouts, D. E., Eisen, J. A., Gill, S. R. & other authors (2003). Role of mobile DNA in the evolution of vancomycin-resistant Enterococcus faecalis. Science 299, 2071–2074.
Pearson, W. R. (2000). Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132, 185–219.[Medline]
Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444–2448.
Pearson, W. R., Wood, T., Zhang, Z. & Miller, W. (1997). Comparison of DNA sequences with protein sequences. Genomics 46, 24–36.[CrossRef][Medline]
Qi, F., Chen, P. & Caufield, P. W. (1999). Functional analyses of the promoters in the lantibiotic mutacin II biosynthetic locus in Streptococcus mutans. Appl Environ Microbiol 65, 652–658.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77, 257–286.[CrossRef]
Rawlinson, E. L., Nes, I. F. & Skaugen, M. (2002). LasX, a transcriptional regulator of the lactocin S biosynthetic genes in Lactobacillus sakei L45, acts both as an activator and a repressor. Biochimie 84, 559–567.[Medline]
Sanders, J. W., Leenhouts, K., Burghoorn, J., Brands, J. R., Venema, G. & Kok, J. (1998). A chloride-inducible acid resistance mechanism in Lactococcus lactis and its regulation. Mol Microbiol 27, 299–310.[CrossRef][Medline]
Schmidt, H. A., Strimmer, K., Vingron, M. & von Haeseler, A. (2002). TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18, 502–504.
Skovgaard, M., Jensen, L. J., Brunak, S., Ussery, D. & Krogh, A. (2001). On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17, 425–428.[CrossRef][Medline]
Slamti, L. & Lereclus, D. (2005). Specificity and polymorphism of the PlcR–PapR quorum-sensing system in the Bacillus cereus group. J Bacteriol 187, 1182–1187.
Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195–197.[CrossRef][Medline]
Smoot, J. C., Barbian, K. D., Van Gompel, J. J., Smoot, L. M., Chaussee, M. S., Sylva, G. L., Sturdevant, D. E., Ricklefs, S. M., Porcella, S. F. & other authors (2002). Genome sequence and comparative microarray analysis of serotype M18 group A Streptococcus strains associated with acute rheumatic fever outbreaks. Proc Natl Acad Sci U S A 99, 4668–4673.
Sumby, P., Porcella, S. F., Madrigal, A. G., Barbian, K. D., Virtaneva, K., Ricklefs, S. M., Sturdevant, D. E., Graham, M. R., Vuopio-Varkila, J. & other authors (2005). Evolutionary origin and emergence of a highly successful clone of serotype M1 group A Streptococcus involved multiple horizontal gene transfer events. J Infect Dis 192, 771–782.[CrossRef][Medline]
Tettelin, H., Nelson, K. E., Paulsen, I. T., Eisen, J. A., Read, T. D., Peterson, S., Heidelberg, J., DeBoy, R. T., Haft, D. H. & other authors (2001). Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293, 498–506.
Tettelin, H., Masignani, V., Cieslewicz, M. J., Eisen, J. A., Peterson, S., Wessels, M. R., Paulsen, I. T., Nelson, K. E., Margarit, I. & other authors (2002). Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae. Proc Natl Acad Sci U S A 99, 12391–12396.
Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., Angiuoli, S. V., Crabtree, J., Jones, A. L. & other authors (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A 102, 13950–13955.
Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673–4680.
Vickerman, M. M. & Minick, P. E. (2002). Genetic analysis of the rgg–gtfG junctional region and its role in Streptococcus gordonii glucosyltransferase activity. Infect Immun 70, 1703–1714.
Zuber, P. (2001). A peptide profile of the Bacillus subtilis genome. Peptides 22, 1555–1577.[CrossRef][Medline]
Received 20 January 2007;
revised 8 June 2007;
accepted 22 June 2007.
This article has been cited by other articles:
![]() |
R. Gardan, C. Besset, A. Guillot, C. Gitton, and V. Monnet The Oligopeptide Transport System Is Essential for the Development of Natural Competence in Streptococcus thermophilus Strain LMD-9 J. Bacteriol., July 15, 2009; 191(14): 4647 - 4655. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Liu, R. J. Siezen, and A. Nauta In Silico Prediction of Horizontal Gene Transfer Events in Lactobacillus bulgaricus and Streptococcus thermophilus Reveals Protocooperation in Yogurt Manufacturing Appl. Envir. Microbiol., June 15, 2009; 75(12): 4120 - 4129. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ibrahim, A. Guillot, F. Wessner, F. Algaron, C. Besset, P. Courtin, R. Gardan, and V. Monnet Control of the Transcription of a Short Gene Encoding a Cyclic Peptide in Streptococcus thermophilus: a New Quorum-Sensing System? J. Bacteriol., December 15, 2007; 189(24): 8844 - 8854. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| INT J SYST EVOL MICROBIOL | MICROBIOLOGY | J GEN VIROL |
| J MED MICROBIOL | ALL SGM JOURNALS | |