Microbiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Microbiology 153 (2007), 3631-3644; DOI  10.1099/mic.0.2007/006205-0
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplementary data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Ibrahim, M.
Right arrow Articles by Gardan, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ibrahim, M.
Right arrow Articles by Gardan, R.
Agricola
Right arrow Articles by Ibrahim, M.
Right arrow Articles by Gardan, R.
Microbiology 153 (2007), 3631-3644; DOI  10.1099/mic.0.2007/006205-0
© 2007 Society for General Microbiology

A genome-wide survey of short coding sequences in streptococci

Mariam Ibrahim1,{dagger}, Pierre Nicolas2,{dagger}, Philippe Bessières2, Alexander Bolotin3, Véronique Monnet1 and Rozenn Gardan1

1 Unité de Biochimie Bactérienne, UR477, INRA, 78350 Jouy-en-Josas, France
2 Unité Mathématique Informatique et Génome, UR1077, INRA, 78350 Jouy-en-Josas, France
3 Unité de Génétique Microbienne, UR895, INRA, 78350 Jouy-en-Josas, France

Correspondence
Rozenn Gardan
rozenn.gardan{at}jouy.inra.fr


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Identification of short genes that encode peptides of fewer than 60 aa is challenging, both experimentally and in silico. As a consequence, the universe of these short coding sequences (CDSs) remains largely unknown, although some are acknowledged to play important roles in cell–cell communication, particularly in Gram-positive bacteria. This paper reports a thorough search for short CDSs across streptococcal genomes. Our bioinformatic approach relied on a combination of advanced intrinsic and extrinsic methods. In the first step, intrinsic sequence information (nucleotide composition and presence of RBSs) served to identify new short putative CDSs (spCDSs) and to eliminate the differences between annotation policies. In the second step, pseudogene fragments and false predictions were filtered out. The last step consisted of screening the remaining spCDSs for lines of extrinsic evidence involving sequence and gene-context comparisons. A total of 789 spCDSs across 20 complete genomes (19 Streptococcus and one Enterococcus) received the support of at least one line of extrinsic evidence, which corresponds to an average of 20 short CDSs per million base pairs. Most of these had no known function, and a significant fraction (31 %) are not even annotated as hypothetical genes in GenBank records. As an illustration of the value of this list, we describe a new family of CDSs, encoding very short hydrophobic peptides (20–23 aa) situated just upstream of some of the positive transcriptional regulators of the Rgg family. The expression of seven other short CDSs from Streptococcus thermophilus CNRZ1066 that encode peptides ranging in length from 41 to 56 aa was confirmed by real-time quantitative RT-PCR and revealed a variety of expression patterns. Finally, one peptide from this list, encoded by a gene that is not annotated in GenBank, was identified in a cell-envelope-enriched fraction of S. thermophilus CNRZ1066.


Abbreviations: CDS, coding sequence; CT, threshold cycle; HMM, hidden Markov model; LC-MS/MS, liquid chromatography–tandem MS; pHMM, probability of a ‘true positive’ prediction after decoding with HMM; SHP, short hydrophobic peptide; spCDS, short putative CDS

{dagger}These authors contributed equally to this work.

Two supplementary tables listing spCDSs of the genomes studied that passed the filtration of pseudogenes and false predictions, and predicted and annotated short CDSs supported by extrinsic evidence in one Enterococcus faecalis and 18 streptococcal genomes, and a supplementary figure showing the chromosomal context of CDSs encoding the SHPs associated with rgg genes, are available with the online version of this paper.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
In Gram-positive bacteria, it is now well established that small peptides play significant roles, as they are involved in such diverse functions as secretion, stress response, metabolism and translation (Zuber, 2001Down). They are also involved in the signalling and regulation of gene expression in quorum-sensing-dependent processes. In this case, these peptides (also called pheromones) are secreted, and at some threshold concentration, either interact with the transmembrane receptors of two-component regulatory systems or are imported back via oligopeptide permease systems. In both cases, pheromones directly or indirectly activate intracellular regulators, which, in turn, modulate the expression of target genes. In Gram-positive bacteria, quorum sensing controls the expression of genes involved in the competence of Bacillus subtilis (Hamoen et al., 2003Down) and Streptococcus pneumoniae (Martin et al., 2006Down), the virulence of Bacillus cereus (Slamti & Lereclus, 2005Down) and Staphylococcus aureus (Lyon & Novick, 2004Down), the conjugation of Enterococcus faecalis (Chandler & Dunny, 2004Down) and the production of antimicrobial peptides in lactic acid bacteria (Kleerebezem, 2004Down).

All microbial genomes contain an abundance of small ORFs, potentially encoding peptides. However, the number of short coding sequences (CDSs) is still debated, and their identification and the precise localization of start codons remain the most difficult problems affecting bacterial CDS detection in silico (Harrison et al., 2003Down; Nielsen & Krogh, 2005Down; Ochman, 2002Down; Skovgaard et al., 2001Down). The short length of these CDSs probably affects both pillars of CDS prediction, namely intrinsic and extrinsic approaches (Borodovsky et al., 1994Down). Intrinsic approaches evaluate the coding potential of ORFs on the basis of nucleotide composition and the presence of RBSs. In this context, the short length of the ORFs limits the amount of information available, and the large number of small ORFs is expected to markedly increase the risk of false prediction (Larsen & Krogh, 2003Down). Extrinsic approaches rely on sequence comparisons. Here again, the small size of the ORFs potentially decreases the sensitivity of the comparisons. In addition, the sequence comparisons used in extrinsic analyses are often restricted to a search against annotated proteins and thus miss short CDSs which are unannotated in other genomes (Borodovsky et al., 1994Down; Pearson et al., 1997Down).

Besides the difficulty of in silico prediction of the short CDSs, the overrepresentation of genes of unknown function among the annotated short CDSs also seems to point to a deficit of biological knowledge regarding this class of genes (Zuber, 2001Down). The specific challenges posed by the experimental study of these genes and the corresponding peptides probably partly explain this situation. Classical hybridization methods such as Northern blotting often fail to detect their transcription as the corresponding RNAs are small, and gene disruption is also difficult because of the need for a double recombination event. Furthermore, in comparison with proteins, the resulting peptides are difficult to detect either by staining or by UV absorbance.

Taken together, the central role of small peptides in complex biological functions and the difficulty in identifying short CDSs both in silico and experimentally clearly call for a careful in silico study aimed at producing a reliable list of small genes that is as exhaustive as possible. In order to produce such a list, we implemented a novel three-step strategy that combined intrinsic and extrinsic approaches, to circumvent the limitations of both types of approach. We started by complementing genome annotations by the identification of new short putative CDSs (spCDSs). During the second step, pseudogene fragments and false predictions were filtered out. Finally, the remaining spCDSs were screened for three lines of extrinsic evidence involving sequence and gene context comparisons.

This strategy was applied to the genomes of the Streptococcus genus and Enterococcus faecalis V583 (Paulsen et al., 2003Down). The number of complete streptococcal genomes available is particularly high due to the medical importance of several species such as Streptococcus pneumoniae (Hoskins et al., 2001Down; Tettelin et al., 2001Down), Streptococcus pyogenes (Banks et al., 2004Down; Beres et al., 2002Down, 2006Down; Ferretti et al., 2001Down; Green et al., 2005Down; Nakagawa et al., 2003Down; Smoot et al., 2002Down; Sumby et al., 2005Down), Streptococcus agalactiae (Glaser et al., 2002Down; Tettelin et al., 2002Down, 2005Down) and Streptococcus mutans (Ajdic et al., 2002Down). Only one sequenced species, Streptococcus thermophilus (Bolotin et al., 2004Down), is classified as ‘generally recognized as safe’, and this species is of major importance to the food industry, since it is widely used for the manufacture of yoghurt and cheese. This dense taxon sampling increases the power of the extrinsic analyses and therefore made the genus particularly attractive for testing our strategy.

As an illustration of the utility of our list, we report the discovery of a new family of CDSs that encode hydrophobic peptides (20–23 aa) located just upstream of genes encoding transcriptional regulators of the Rgg (also called MutR) family. We also examined the transcription of seven genes of S. thermophilus, six selected from those with extrinsic evidence that could be linked to quorum sensing and one selected from those with only intrinsic evidence, using real-time quantitative RT-PCR. Finally, one peptide from the list, encoded by a gene that is not annotated in GenBank, was identified in a cell-envelope-enriched fraction of S. thermophilus CNRZ1066 by LC-MS/MS analysis.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Sequences and annotations.
Complete genome records of Streptococcus, available in GenBank as of July 2006, were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/). With E. faecalis (strain V583; GenBank accession numbers AE016830–AE016833), a closely related outgroup, this represented 20 genomes from six species: S. agalactiae (strains 2603VR, A909 and NEM316; accession numbers AE009948, CP000114 and AL732656), S. mutans (strain UA159; accession number AE014133), S. pneumoniae (strains R6 and TIGR4; accession numbers AE007317 and AE005672), S. pyogenes (strains M1GAS, MGAS10270, MGAS10394, MGAS10750, MGAS2096, MGAS315, MGAS5005, MGAS6180, MGAS8232, MGAS9429 and SSI1; accession numbers AE004092, CP000260, CP000003, CP000262, CP000261, AE014074, CP000017, CP000056, AE009949, CP000259 and BA000034), S. thermophilus (strains CNRZ1066 and LMG18311; accession numbers CP000023 and CP000024). Two plasmids, pAM373 and pCF10 (accession numbers AE002565 and AY855841), from E. faecalis were also included.

spCDS detection based on intrinsic evidence.
BactgeneSHOW (options -m 4C_si -rbs m1 -duprev -cdst 0.1) was used to predict additional CDSs (Bryson et al., 2006Down). This program uses a hidden Markov model (HMM) that accounts for the presence of RBSs and four types of nucleotide composition of the coding sequences (see below). It also features fully unsupervised parameter estimation from the raw sequence (Nicolas et al., 2002Down) carried out in a maximum-likelihood framework with the iterative expectation–maximization algorithm (Rabiner, 1989Down). Each prediction comes with a confidence measure, referred to here as pHMM, that corresponds to the probability for the ORF to be a CDS (i.e. a ‘true positive’ prediction) as computed after posterior decoding with the HMM. To ensure the homogeneity of the set of spCDSs, predicted start codons were preferred over annotated start codons when these did not match (annotation policies sometimes strongly favour the most upstream ATG triplet; Besemer et al., 2001Down).

Identification of pseudogenes and false predictions.
Protein–protein comparisons used the exact Smith–Waterman algorithm implemented in SSEARCH with default parameters (FASTA3.4 package; Pearson & Lipman, 1988Down; Smith & Waterman, 1981Down). DNA–protein comparisons involved both strands of the spCDS plus an additional 300 bp on each side and used FASTX (FASTA3.4; Pearson, 2000Down), a program that allows frameshifts and stop codons in the alignments. Only FASTX alignments that encompassed the middle of the spCDSs were incorporated in our analysis. Both SSEARCH and FASTX comparisons were conducted separately for each couple of genomes; E-values were therefore multiplied by the number of comparisons (20) to account for multiple testing.

Statistical assessment of extrinsic evidence.
P-E1 assesses the statistical confidence in extrinsic evidence E1 that corresponds to protein sequence similarity between two spCDSs. P-E1 is the E-value reported by SSEARCH after correction for multiple comparisons (genome pairs were treated separately).

P-E2 assesses the statistical confidence in extrinsic evidence E2 that corresponds to protein sequence similarity between a query spCDS and another spCDS of a similar length (allowing a 10 aa difference) in a similar gene context. Two spCDSs in the same position relative to two nearby reference genes that are homologous were considered to be in a similar gene context. The position of the spCDS relative to the reference gene was categorized into one of four possible configurations depending on the location (upstream or downstream) and the direction (same or opposite) of the spCDS. Each CDS annotated in GenBank with an extremity less than 500 bp away from an spCDS was considered as a possible reference for this spCDS. The set of homologous genes for each reference gene was constructed using BLASTP (version 2.2.14) (Altschul et al., 1997Down), imposing an E-value ≤10–6 and an alignment spanning at least 70 % of both the query and the subject. Pair-wise protein-sequence similarity between spCDSs in a similar gene context was quantified using PRSS (FASTA3.4). P-E2 is an E-value that accounts for both protein sequence similarity and gene context similarity. It is obtained as the P-value reported by PRSS multiplied by the expected number of comparisons with other spCDSs in a similar gene context. To compute this expected number, spCDSs were modelled as randomly distributed across the intergenic regions of the genome. Specifically, we used a homogeneous Poisson process (Nelson, 1995Down) to model the occurrences of the spCDSs across the regions not occupied by annotated or predicted CDSs more than 10 aa longer than the query sequence. The intensity of the Poisson process was estimated separately for each spCDS length range and each genome as the ratio between the number of spCDSs in the genome and the total number of available positions for the spCDSs to occur (maximum-likelihood estimate).

Extrinsic evidence E1 and E2 also involved testing the null hypothesis of random distribution of the DNA differences across the three codon positions. Pair-wise DNA alignments were deduced from pair-wise protein alignments, and only those alignments with fewer differences in the second rather than the third codon position were further analysed. The statistical test was a variant of the Fisher exact test constructed as follows. Let n1, n2 and n3 denote the number of differences in the first, second and third codon positions across the pair-wise alignment and N be the total number of aligned codons. In addition, let P(x,y,z) be the probability of observing x differences in the first codon position, y in the second and z in the third, under the null hypothesis given the value of x+y+z. P(x,y,z) is computed as:

Formula
The P-value is the sum of P(x,y,z) over the triplets of integers (x,y,z) such that x+y+z=n1+n2+n3, y ≤ z and P(x,y,z)≤P(n1,n2,n3).

P-E3 assesses the statistical confidence in evidence E3 that corresponds to a higher number of spCDSs of a particular size (allowing a 10 aa difference) in a particular gene context than that expected by chance. As for E2, the gene context is assessed by the position of an spCDS relative to a reference gene and the occurrences of spCDSs are modelled as randomly distributed across intergenic regions. P-E3 is a P-value based on the distribution of the sum of independent but non-identically distributed Bernoulli variables, where each variable corresponds to the presence or absence of at least one spCDS, and the sum is of the different homologues of the reference. In order to avoid counting spCDSs found on very closely related DNA sequences as distinct occurrences, only homologous reference genes that diverged to the extent that they were less than 85 % identical at the third codon position were taken into account.

Phylogenetic analysis of the Rgg family.
The seven genes annotated as being representatives of the MutR family in S. thermophilus CNRZ1066 and their homologues in S. agalactiae A909, S. mutans UA159, S. pyogenes MGAS10270 and S. pneumoniae R6 were collected (alignments encompassing more than 70 % of both the query and the subject and E value ≤10–6 in a BLASTP comparison). The multiple protein sequence alignment was created with CLUSTALW1.83 (default parameters) (Thompson et al., 1994Down) and the positions with gaps were removed. Tree reconstruction was carried out with PROTDIST and NEIGHBOR (PHYLIP3.6 package; Felsenstein, 1989Down) using a Jones, Taylor and Thornton (JTT) model of protein evolution and a gamma model of rate variations across sites. The coefficient of variation of the gamma distribution was set to 0.52, the value estimated by TREE-PUZZLE5.2 (Schmidt et al., 2002Down) on this particular dataset. Bootstrap confidence values were computed on 1000 replicates created using SEQBOOT (PHYLIP3.6).

Growth conditions and bacterial strain.
The S. thermophilus CNRZ1066 strain used in this study was derived from the Institut National de la Recherche Agronomique–Centre National de Recherches Zootechniques (CNRZ) bacteria collection. Depending on the experiment, S. thermophilus CNRZ1066 was grown in M17 (Difco) supplemented with 10 g lactose l–1 (M17lac) or in a chemically defined medium (CDM; Letort & Juillard, 2001Down). The cultures were incubated at 42 °C. OD600 of the cultures was measured using a spectrophotometer (Uvikon 931, Kontron).

RNA extraction and reverse transcription.
Total RNA was extracted from cells grown in M17lac or CDM. Cells were harvested during exponential (OD600 1.2) and stationary phases of growth (OD600 2.3 in CDM or OD600 2.6 in M17lac) by centrifugation (5000 g for 1 min at 4 °C) and stored at –80 °C. Bacterial pellets were resuspended in 400 µl buffer containing 10 % glucose, 12.5 mM Tris-HCl (pH 7.6) and 70 mM EDTA, and transferred into tubes containing 500 µl acid phenol (pH 4.5) and 0.6 g 0.1 mm diameter glass beads (Sigma). Cells were subjected to mechanical disruption with a Fastprep apparatus (two 30 s cycles of homogenization at maximum speed with 1 min intervals on ice). After a centrifugation step (19 000 g for 10 min at 4 °C), RNA was purified by TRIzol/chloroform/isoamyl alcohol (Invitrogen) and chloroform/isoamyl alcohol extractions, and 2-propanol precipitation. RNA samples were resuspended in 50 µl TE (10 mM Tris, 1 mM EDTA, pH 7.6) and treated with DNase I (Ambion) as recommended by the manufacturer in order to eliminate contaminating DNA. Total RNA concentrations were determined by measuring A260 using a spectrophotometer (BioPhotometer, Eppendorf). RNA integrity was checked with an Agilent 2100 bioanalyser. Three extractions were performed independently for each condition. Furthermore, cDNA synthesis was performed from 1 µg RNA using M-MLV reverse transcriptase (Invitrogen) according to the manufacturer's protocol.

Real-time RT-PCR.
The primers used in this study are listed in Table 1Down. The primers were designed using Primer Express (version 2.0; Applied Biosystems). The length of PCR products ranged from 69 to 150 bp. The efficiency of amplification was determined by running a standard curve with serial dilutions of PCR products encompassing each target gene. Efficiency (E) was calculated using the formula E=[10(1/–s)–1]x100, where s is the slope of the regression line obtained for each pair of primers. Real-time PCR was carried out with 25 ng reverse-transcribed cDNA using the SYBR Green PCR Master Mix (Applied Biosystems), as recommended by the manufacturer. A 200 nM concentration of primers was used, except for the amplification of gene 3, where 100 nM was used to avoid dimer formation. PCR reactions were run on the ABI Prism 7700 sequence detector (Perkin–Elmer Applied Biosystems) under the following conditions: Taq polymerase activation at 95 °C for 10 min, and then 40 cycles of 15 s at 95 °C and 1 min at 60 °C. A melting-curve analysis was performed at the end of each run for all primer sets. This resulted in single-product-specific melting curves, and no primer-dimers were generated during the runs. A ‘no-template control’ (replacing cDNAs with distilled H2O) and an ‘RT-negative control’ (replacing cDNAs with RNA samples which had not undergone the reverse transcription step) were included in each run in order to confirm the absence of contaminating DNA. SYBR Green PCRs were performed in triplicate, and for each condition the experiments were repeated independently in triplicate. Data were either recorded as the threshold cycle (CT) and expressed as mean±SD or computed using the comparative critical threshold method (Formula ) described by Livak & Schmittgen (2001)Down. In this case, ldhL was used to normalize data. For each gene, an analysis of variance was performed on the CT corrected by the CT of ldhL in order to determine whether the relative expression levels between two conditions were significantly different (P<0.05).


View this table:
[in this window]
[in a new window]

 
Table 1. Primers used for real-time quantitative RT-PCR

 
Purification of cell-envelope-associated peptides.
Proteins were extracted from cells grown and harvested under the same conditions as described for RNA extraction. Cell-envelope-associated protein extracts were prepared as described elsewhere (Gitton et al., 2005Down). Proteins were separated by 1D electrophoresis. Cell-envelope-associated proteins (20 µg) resuspended in 20 % (v/v) glycerol, 25 mM DTT, 2 % SDS, 50 mM Tris, pH 6.8, 0.1 % bromophenol blue were loaded on a denaturing SDS-PAGE gel (NuPage, 4–12 % acrylamide; Invitrogen). The gel was stained with Blue Safe Stain (Invitrogen) and washed twice with water. For each sample, the portion of the gel containing proteins with molecular masses lower than 10 kDa was removed and washed twice with 50 mM NH4HCO3, 50 % acetonitrile. The gel pieces were dried at room temperature, digested overnight at 37 °C with 1 µg trypsin (Promega), and resuspended in 50 mM NH4HCO3. Peptides released into the buffer were pooled with peptides extracted by three additional steps: one with 40 µl 50 mM NH4HCO3, one with 40 µl 1 % formic acid, 50 % acetonitrile, and one with 40 µl 1 % formic acid, 90 % acetonitrile. The pools of peptides were dried in a Speed-Vacuum concentrator for 1 h, and then resolubilized in 25 µl loading buffer containing 0.08 % trifluoroacetic acid and 2 % acetonitrile in water, prior to liquid chromatography–tandem MS (LC-MS/MS) analysis.

LC-MS/MS analysis and database searching.
LC-MS/MS analysis was performed with an Ultimate 3000 LC system (Dionex) connected to a linear ion trap mass spectrometer (LTQ, Thermo Fisher) by nanoelectrospray. Tryptic peptide mixtures (4 µl) were loaded at a flow rate of 20 µl min–1 onto Pepmap C18 precolumns (0.3x5 mm, 10 nm, 5 µm; Dionex). After 4 min, the precolumn was connected to the Pepmap C18 separating nanocolumn (0.075 mm diameterx150 mm length, 10 nm internal diameter, 3 µm particle size), and the linear gradient from 2 to 36 % buffer B (0.1 % formic acid, 80 % acetonitrile) in buffer A (0.1 % formic acid, 2 % acetonitrile) at 300 nl min–1 over 90 min was started. Ionization was performed on the liquid junction with a spray voltage of 1.3 kV applied to a non-coated capillary probe (PicoTip EMITER 10 µm tip ID; New Objective). Peptide ions were analysed by the Nth-dependent method as follows: (i) full MS scan (m/z 300–1500), (ii) ZoomScan (scan of the three major ions), (iii) MS/MS on these three ions.

All protein identification was performed with Bioworks 3.3 (Thermo Fisher). All peaks, lists of precursors and fragment ions were matched automatically against a database which compiled GenBank annotations of S. thermophilus CNRZ1066 and new predictions. The Bioworks search parameters were: trypsin-specificity with one missed cleavage, oxidation variable for methionine, and mass tolerance fixed to 1.4 Da for precursor ions and 0.5 Da for fragment ions. The search results were filtered using Bioworks 3.3. A multiple-threshold filter applied at the peptide level consisted of the following criteria: Xcorr magnitude up to 1.7, 2.5 and 3.0 for mono-, di- and tri-charged peptides, respectively, peptide probability lower than 0.01, {Delta}Cn greater than 0.1.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
spCDSs indicated by intrinsic evidence
The number of annotated CDSs encoding products shorter than 60 aa exhibits extensive variation across the 20 GenBank records. It ranges from 17 CDSs per Mbp in S. agalactiae NEM316 to 96 in S. pneumoniae TIGR4. Although some of the variations probably reflect the underlying biology, differences between strains in the same species suggest that the primary source of variations is annotation policies (17 vs 56 CDSs per Mbp when comparing the S. agalactiae strains NEM316 and 2603VR, 49 vs 96 when comparing S. pneumoniae strains R6 and TIGR4). Ab initio gene prediction independent of GenBank annotation was carried out in order to equalize these differences and generate a reasonably comprehensive pool of spCDSs. A low-probability cut-off was adopted to minimize the number of true short CDSs missed at this stage of the analysis. We considered as spCDSs all those predictions longer than 10 aa that were calculated to have more than a one in ten chance of being true positives (pHMM ≥0.1).

GenBank annotations and new predictions were compiled into a set of non-redundant spCDSs (≤60 aa) whose number ranged from 109 to 335 per Mbp. Table 2Down, column Step (1), gives the results for the 20 genomes. The highest number and density of spCDSs was found in S. thermophilus genomes, probably reflecting the distinctively high number of pseudogene fragments in this species (Bolotin et al., 2004Down).


View this table:
[in this window]
[in a new window]

 
Table 2. Three-step strategy to identify short CDSs

 
Filtering pseudogenes and false predictions
The set of spCDSs was contaminated by pseudogene fragments and false-prediction artefacts. The second stage of the analysis consisted of removing as many of these contaminants as possible. Pseudogene fragments are a particular problem if they are not filtered out: they cause protein matches that can be misinterpreted as extrinsic support for the CDSs (Harrison et al., 2003Down). We used protein–protein or DNA–protein comparisons to remove all spCDSs with significant matches to protein products longer than 100 aa (E-value ≤0.01).

The fraction of spCDSs eliminated by this systematic filtering varied across genomes, from 19 % in E. faecalis V583 to 50 % in S. thermophilus CNRZ1066. Table 2Up, column Step (2), lists the number of CDSs present after this step of the analysis, and the list of genes is given in Supplementary Table S1. As an indication of the high sensitivity of our filter, all 62 spCDSs annotated as pseudogene fragments in S. thermophilus CNRZ1066 were discarded after this step of the analysis.

Screening for extrinsic evidence
The final stage of the strategy to identify short CDSs consisted of screening the remaining spCDSs for support by at least one of three lines of extrinsic evidence in which we could be confident. These three lines were: conserved protein sequence (E1); similar gene context and conserved protein sequence (E2); and a significantly high number of spCDSs of a particular size in particular gene context (E3). Both E1 and E2 are based on a Smith–Waterman search against similarly sized spCDSs, allowing a difference in length of up to 10 aa, but differ in terms of the set of candidates against which the search is conducted. In E1, all similarly sized spCDSs are considered. In E2, the search is restricted to those spCDSs of similar size in the same gene context as characterized by their location (either upstream or downstream) and by the strand (either identical or opposite), relative to a reference gene. The idea behind this distinction is that E2 may be more sensitive for finding weak similarities between short CDSs by reducing the number of comparisons. Importantly, significant sequence similarities identified in E1 and E2 at the protein level were verified as not simply being the result of the evolutionary proximity between DNA sequences (Ochman, 2002Down). As sequence conservation is expected to depend on codon position in coding sequences under purifying selection (Kimura, 1977Down), DNA sequence divergence was required to be significantly influenced by codon position (P≤0.05) or to reach 40 % over the whole alignment. The different lines of extrinsic evidence were evaluated separately, and evidence was considered to be found when its associated E- or P-value (P-E1, P-E2 or P-E3) was below 0.001, a choice that balanced high sensitivity with very few expected false positives.

The results of this screening are shown in Table 2Up, column Step (3). Over the 20 genomes, a total of 789 spCDSs were supported by at least one of the three extrinsic lines of evidence, corresponding to an average rate of 20 per Mbp. This rate was remarkably stable, as it ranged from 15 in E. faecalis V583 to 25 in S. pyogenes MGAS10394. Among the 789 spCDSs, only 5.5 % were missed by our ab initio prediction but annotated in GenBank. The fraction of spCDSs supported by extrinsic evidence and unannotated in GenBank exhibited extensive variations across the 20 genomes (from 61 % in S. thermophilus CNRZ1066 to 11 % in S. pneumoniae TIGR4) and corresponded to 31 % of the 789 spCDSs. As expected, this proportion decreased when the length of the CDS increased, being as high as 67 % of 204 for those shorter than 40 aa and as low as 14 % of 374 for those with lengths ranging from 50 to 60 aa.

Interestingly, the presence/absence of annotated CDSs in any of the GenBank records never closely matched the set of short CDSs with extrinsic support: either many short CDSs were annotated and most of these were not supported by extrinsic evidence, or a few short CDSs were annotated and the annotation missed an important fraction of the short CDSs with extrinsic support (Fig. 1Down). For instance, the 33 S. thermophilus CNRZ1066 annotated spCDSs had one of the highest percentages of extrinsic support (42 %) but also had the lowest coverage of the spCDSs with extrinsic evidence. Conversely, the 200 annotated spCDSs of S. pneumoniae TIGR4 had the lowest rate of extrinsic support (16 %) but covered the highest fraction of supported short CDSs.


Figure 1
View larger version (34K):
[in this window]
[in a new window]

 
Fig. 1. GenBank annotations and extrinsic support. For each GenBank record, the fraction of annotated CDSs receiving extrinsic support is plotted against the fraction of the spCDSs with extrinsic support covered by the annotation. A clear negative correlation between both quantities is visible (Spearman's rho=–0.81, P <10–5).

 
As another point of comparison, we used the CDSs annotated in GenBank with functional annotation (having at least a gene name) in the 19 streptococcal genomes recorded in GenBank (CDSs annotated as putative or hypothetical CDSs were discarded) and checked if they were supported by extrinsic lines of evidence. Short genes encoding ribosomal proteins were all supported by extrinsic lines of evidence, except the abnormally short gene rpmF in S. pyogenes M1GAS. The other genes encode bacteriocins (blpO, srtA and salA), a cytolytic toxin (sagA), pheromones (blpC, also annotated ip, and pep27), a 4-oxalocrotonate tautomerase (xylH; also annotated xylM), a subunit of the protein translocation complex (secE) and peptides induced during competence (orf51 and orf47). blpC or ip, blpO, xylH or xylM and secE were supported by extrinsic lines of evidence, while sagA, srtA, salA, pep27, orf51 and orf47 were not. The comC gene received extrinsic support in S. mutans but not in S. pneumoniae. This clearly indicates that the list of short genes with extrinsic support cannot be considered as an exhaustive list of genes encoding functional peptides.

Table 3Down lists the 35 spCDSs that received extrinsic support in the S. thermophilus CNRZ1066 genome. Among them, five were found to belong to a new family of CDSs that encode short hydrophobic peptides (SHPs), and are described below. Six other CDSs (plus another that did not receive extrinsic support) were selected for transcriptional analysis. Supplementary Table S2 gives the same information as Table 3Down for the other 19 genomes.


View this table:
[in this window]
[in a new window]

 
Table 3. Thirty-five predicted and annotated short CDSs supported by extrinsic evidence in S. thermophilus CNRZ1066

 
Identification of a new family of CDSs encoding SHPs
One of the most striking results of short CDS detection in the streptococcal genomes presented above concerned the identification of a new family of CDSs that encode SHPs. As shown in Table 3Up and Supplementary Table S2, all these peptides received at least E3 support corresponding to a conserved genetic location upstream of and divergent from genes encoding transcriptional regulators of the Rgg (MutR) family. A copy of these SHPs was found in S. mutans, in S. pneumoniae strain R6 and in the three strains of S. agalactiae. Two copies were found in all S. pyogenes strains and five in both strains of S. thermophilus. An analysis of the context of all the genes encoding the SHP–Rgg pairs (see Supplementary Fig. S1) indicated that the surrounding genes were not conserved. Genes located upstream of the shp genes were transcribed in both orientations and encoded proteins with unknown functions or functions linked to transcription (nifR), translation (syfB and tyrSE), amino acid metabolism (aroE2) and cell envelope biogenesis (rffD). Genes downstream from the rgg genes were, in most cases, transcribed in the same orientation as the rgg genes. The functions of the corresponding proteins are unknown, but some shared features can be highlighted: three proteins belong to the S-adenosylmethionine (SAM) radical enzymes family, seven are efflux transporters and four are small peptides. In order to visualize the evolutionary relationship between genes encoding SHPs and those encoding Rgg regulators, a phylogenetic tree (Fig. 2Down) was constructed using all sequences of Rgg regulators detected in the streptococcal genomes analysed in this study. The Rgg regulators associated with SHPs (SHP–Rgg) could be classified into two groups. Group I corresponded to the SHP–Rgg of S. agalactiae and S. pneumoniae, the two SHP–Rgg of S. pyogenes and one of S. thermophilus. As shown in Fig. 2Down, there were marked similarities between the SHPs in this group. All of them received support from the three extrinsic lines of evidence, except one from S. pyogenes encoded by a gene located upstream of spy0441, for which E1 and E2 were not calculated because of very great similarities at the DNA level with the SHP from S. pneumoniae. Group II corresponded to the SHP–Rgg of S. mutans and three genes from S. thermophilus. In this group, the similarities between the SHPs were marked, but less so than in group I, and corresponded to heterogeneity in extrinsic support: the SHPs of S. thermophilus received E2 and E3 support, and sometimes E1 support as well, and the SHP of S. mutans received only E3 support. One SHP–Rgg regulator in S. thermophilus could not be classified. The corresponding SHP exhibited weak similarities to the other SHPs and was only supported by E3 evidence.


Figure 2
View larger version (19K):
[in this window]
[in a new window]

 
Fig. 2. Protein phylogenetic tree reconstruction of the Rgg family. Locus tags or gene names are indicated when available, and in bold type when a short CDS encoding a hydrophobic peptide was found upstream of the Rgg regulator and divergently transcribed. In this case, the amino acid sequence of the corresponding peptide is presented. A three-letter prefix indicates the species, the strains considered here being S. thermophilus CNRZ1066, S. agalactiae A909, S. mutans UA159, S. pyogenes MGAS10270 and S. pneumoniae R6. Only high-confidence internal branches are shown as solid lines (more than 80 % bootstrap support, indicated at the side of each branch). The branch scale is expressed as the expected number of substitutions.

 
Transcriptional analysis of seven short CDSs from S. thermophilus CNRZ1066
Among the CDSs of S. thermophilus CNRZ1066 with extrinsic support, six, named g-1 to g-6 (Table 3Up), were selected for experimental expression studies using real-time quantitative RT-PCR. This technique was used to detect transcription, so as to confirm the functionality of these putative genes. We chose to select CDSs longer than 100 bp because it is very difficult to design primers which respond to the requirements imposed by real-time quantitative RT-PCR when the genes are too short. This constraint excluded expression studies on genes encoding SHPs. We chose to study CDSs with unknown functions on the basis of three features. The first feature, fulfilled by g-1, g-3, g-4 and g-6, was the presence of a potential signal sequence [detected by SignalP (Nielsen et al., 1997Down) and/or Psort (Gardy et al., 2005Down)], indicating that they could be secreted and play a role as pheromones. The second feature was a context usually associated with Gram-positive pheromones (ABC transporters and/or two-component systems). This applied to g-2 and g-3. The third feature was a context that provided a clue to a potential function. Thus, g-1 is located downstream of feoAB and probably encodes a ferrous iron transporter, while g-5 is located inside an operon potentially linked to anaerobic nucleoside metabolism. Finally, the six selected genes displayed various extrinsic evidence profiles, as shown in Table 3Up. We also chose to add to our selection a CDS, g7, located downstream of an Rgg regulator and not supported by any line of evidence. The g-7 gene is located between nucleotides 842 428 and 842 577, and has a pHMM of 1, and the corresponding peptide has 49 aa.

In order to determine whether these seven selected CDSs were transcribed, we measured the CT for each gene using cDNA obtained from RNA extracted from cells grown in rich medium (M17lac) in exponential phase, as described in Methods. As shown in Table 4Down, we obtained CT values of between 19 and 28, indicating that all CDSs were transcribed.


View this table:
[in this window]
[in a new window]

 
Table 4. CT for seven CDSs of S. thermophilus CNRZ1066 after growth in M17lac medium during exponential phase (OD600 1.2)

 
Because the optimum conditions of expression for these CDSs were unknown, total RNA was also extracted from cells grown in M17lac medium during early stationary phase and in CDM during mid-exponential and early stationary phases. The CT values obtained under these four different conditions were used to assess the effect of the growth stage and composition of the medium on gene expression using the comparative CT method (Formula ). The results in Fig. 3(a)Down present a comparison of the expression of each gene in the exponential and stationary phases in M17lac medium and CDM. The expression of all genes was affected by the growth phase in at least one medium, and different patterns were obtained. In both media, two genes (g-1 and g-5) were more strongly expressed in exponential phase and one (g-6) in stationary phase. The remaining genes were affected by the growth phase either in only one medium (g-3 and g-4 more strongly expressed in exponential phase in M17lac only, and g-7 in CDM only) or differently in both media (g-2 more strongly expressed in exponential phase in M17lac and in stationary phase in CDM). Finally, most of the genes were more strongly expressed in exponential phase. The results in Fig. 3(b)Down present a comparison of the expression of each gene between M17lac medium and CDM at the exponential or stationary phases of growth. The expression of three genes (g-1, g-3 and g-6) was not significantly affected by the composition of the medium in either phase of growth. The others were affected in a different manner. Only one gene (g-5) was affected by the medium composition at both stages of growth (g-5 was more strongly expressed in CDM than in M17lac). The remaining genes (g-2, g-4 and g-7) were affected by the medium composition in only one phase: exponential for g-2 and g-7, and stationary for g-4. All these genes, except one (g-2), were more strongly expressed in CDM than in M17lac.


Figure 3
View larger version (25K):
[in this window]
[in a new window]

 
Fig. 3. Relative expression levels of seven short CDSs of S. thermophilus CNRZ1066 (a) between exponential and stationary phase in M17lac medium and CDM, and (b) between CDM and M17lac medium during exponential and stationary phases. Relative expression levels were computed using the comparative critical threshold method (Figure 3), as described by Livak & Schmittgen (2001)Down. Data are expressed as means from three independent experiments. Asterisks indicate significant results according to an analysis of variance (P<0.05).

 
Identification of cell-envelope-associated peptides of S. thermophilus CNRZ1066 by LC-MS/MS analysis
In order to go further in the validation of our predictions, and as we were more interested in peptides that could be involved in communication functions, we chose to purify and identify by LC-MS/MS analysis cell-envelope-associated peptides (<10 kDa) of S. thermophilus CNRZ1066. The peptides were extracted from cells grown in rich medium (M17lac) and CDM at exponential and stationary phase, as described in Methods. Three peptides smaller than 60 aa were identified. Two were identified under all growth conditions tested and correspond to ribosomal proteins RpmD and RpmGB (data not shown). This result is not surprising because the samples that were analysed by LC-MS/MS were enriched in proteins preferentially located at the cytosol–membrane interface, but also contained proteins organized in complexes such as the translational complex. One peptide was identified under only three growth conditions: M17lac medium during stationary phase, and CDM medium during exponential and stationary phases from, respectively, one, five and six peptide fragments identified under those three conditions. Results obtained with the six peptide fragments are presented in detail in Table 5Down, and the percentage amino acid coverage of the entire peptide was 69 %. This peptide is not annotated in GenBank and is located between positions 1 031 982 and 1 032 158. It was highlighted by our genomic analysis of short CDSs because it received E1, E2 and E3 extrinsic support (see Table 3Up).


View this table:
[in this window]
[in a new window]

 
Table 5. Sequence, charge state and filter parameter values for peptides used for the identification of the peptide corresponding to an unannotated CDS in S. thermophilus CNRZ1066

Cells were grown in CDM and harvested during stationary phase. The gene encoding the peptide was located between position 1 031 982 and position 1 032 158 in the genome. The positions of the identified peptides in the 58 aa peptide sequence are shown in Fig. 4Down.

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The universe of short CDSs remains largely unknown in streptococci. Very few genes have been studied experimentally, and most of the CDSs annotated in GenBank are hypothetical and of unknown function. This study describes a reliable list of short genes of streptococci that could be tested in biological experiments.

The added value of our genome-wide survey of the short CDSs is well illustrated by the fact that in none of the GenBank records did the presence/absence of CDS annotations closely match the list of spCDSs supported by extrinsic lines of evidence. Indeed, it is unlikely that either intrinsic or extrinsic approaches taken separately could ever produce such a list. Intrinsic approaches are typically misled by the presence of pseudogene fragments, and it is apparently difficult to distinguish some functional CDSs from the intergenic background without accepting a relatively high level of false predictions. However, intrinsic approaches proved very helpful in establishing lists of candidate CDSs that could be subsequently screened for extrinsic lines of evidence.

We also compared our list of short CDSs supported by extrinsic lines of evidence to the CDSs accompanied by functional annotation in GenBank. Some biologically known CDSs could not be confirmed by extrinsic in silico approaches with the sequence data currently available. This demonstrated that the list reported here is incomplete, although it is already very large compared to the number of known short CDSs.

The experimental use and validation of these results were initiated using the list of predicted short CDSs of S. thermophilus CNRZ1066. For all seven genes chosen for experimental studies, we were able to detect the presence of RNA transcripts using real-time quantitative RT-PCR. Very little reasonable doubt now remains concerning their translation into proteins, as they were all targeted on the basis of a number of criteria exclusively related to translation (RBSs, tri-periodic nucleotide composition and protein sequence conservation). The seven genes were selected on the basis of three criteria: the presence of potential signal sequences in the corresponding peptides, proximity to genes encoding ABC transporters and two-component systems, and proximity to genes encoding proteins of known function. The first two criteria were motivated by our interest in peptides that might be involved in quorum-sensing functions. Although blpC, which encodes a bacteriocin-like inducer peptide precursor from a class II bacteriocin locus, fulfils these criteria, it was not selected because this locus is probably not functional in S. thermophilus LMG18311 and CNRZ1066 (Hols et al., 2005Down). We chose to compare the expression of the seven genes under different conditions in order to see whether some of them were similarly regulated, which would suggest a functional link between them. Instead, comparison of their expression in rich medium and CDM and during the exponential and stationary phases revealed distinct patterns of expression for all genes. One of our objectives is now to identify their functions. We have already started to study the consequences of inactivating g-1 in S. thermophilus in order to assess its potential involvement in iron metabolism.

Another experimental validation of our results came from the MS analysis of peptides purified from the cell envelope of S. thermophilus CNRZ1066. Three peptides were identified in this preliminary experiment, and all belonged to our final list of spCDSs supported by extrinsic evidence (Table 3Up). Besides two peptides corresponding to ribosomal proteins, one corresponded to a short CDS genetically linked to the gene miaA, which encodes a tRNA delta(2) isopentenylpyrophosphate transferase (Table 3Up). The presence of a signal peptide (predicted by SignalP v3.0) indicates that the encoded product of this short CDS is probably either secreted or anchored in the membrane.

During this study, one family of CDSs encoding SHPs was identified on the basis of the conservation of their context with genes encoding transcriptional regulators of the Rgg family. Although they were present in at least one copy in each Streptococcus species, none of them was annotated in GenBank. Most of them shared strong sequence similarities corresponding to E1 and E2 support, but E3 support was necessary to detect all members of this family. This example emphasizes the importance of our three extrinsic lines of evidence. SHPs have a conserved size of 22–23 aa, they always have at least one positively charged amino acid (lysine) in the N-terminal portion and one glycine in the C-terminal portion, and their sequence is mainly composed of hydrophobic amino acids interrupted by a negatively charged amino acid (glutamate or aspartate). Some of these features [size, one or more lysine(s) in the N-terminal portion, and an abundance of hydrophobic amino acids] are shared with the inhibitor pheromones involved in the regulation of conjugation in E. faecalis (Chandler & Dunny, 2004Down). However, the similarities between SHPs and inhibitor pheromones were too weak to be detected automatically by our analysis. Like shp, the inhibitor peptides are encoded upstream of and divergently from transcriptional regulators which nevertheless do not belong to the Rgg family. It has recently been demonstrated that one of these inhibitor pheromones (which are secreted and reimported) interacts directly with its regulator, PrgX (Kozlowicz et al., 2006Down). These common features of SHPs and the inhibitor pheromones of E. faecalis suggest that SHPs could be involved in quorum-sensing regulation. We studied in detail one shp–rgg locus which is specific to S. thermophilus strain LMD9. We demonstrated that the Rgg regulator positively controls the transcription of a gene (ster1357) that encodes a cyclic peptide, and that in the shp or an ami (oligopeptide transport system) knockout mutant, the transcription of ster1357 is drastically decreased (M. Ibrahim and others, unpublished results). These results confirm the biological role of at least one shp gene. Phylogenetic analysis of the Rgg-like regulators of streptococci identified two distinct groups of Rgg-like regulators associated with SHPs. The sequences of the corresponding SHPs were conserved within these two groups, thus reinforcing a functional link between SHPs and Rgg regulators. A challenge concerning this family of regulators will be to discover their targets. The rgg-like regulators that have been characterized so far are not associated with SHPs, and appear to positively regulate the transcription of at least the adjacent genes (Lyon et al., 1998Down; Qi et al., 1999Down; Rawlinson et al., 2002Down; Sanders et al., 1998Down; Vickerman & Minick, 2002Down). We can hypothesize that genes downstream of rgg genes and transcribed in the same orientation are putative targets. None of these genes encodes a protein of known function except mutM (fpg) and coaE, which encode a formamidopyrimidine-DNA glycosylase and a dephospho-CoA kinase, respectively. However, we identified three groups of proteins, SAM radical enzymes, exporters and small peptides, suggesting a functional link between these potential targets.

In conclusion, we present a reliable list of short genes discovered in streptococci and E. faecalis genomes. A preliminary study using these data has been performed on S. thermophilus. The results obtained suggest that this expression study should be extended to other streptococci, perhaps employing more global approaches such as microarray technology. This list could also be used as a valuable tool to check for the presence of short genes in defined loci.


Figure 4
View larger version (5K):
[in this window]
[in a new window]

 
Fig. 4. Positions of the identified peptides in the 58 aa peptide sequence (see Table 5Up). Peptide positions are shown by black bars. The predicted signal sequence is indicated in bold type.

 

    ACKNOWLEDGEMENTS
 
We would like to thank Françoise Wessner, Isabelle Guillouard, Jean-Pierre Furet, Liliana Lopez, Charlotte Beltramo and Christophe Gitton for technical advice, Françoise Rul for critical reading of the manuscript and the ‘Plateau d'Instrumentation et de Compétences en Transcriptomique (PICT)’ (INRA, Jouy-en-Josas, France) for advice concerning the use of the Agilent 2100 bioanalyser. We also thank Alain Guillot and Coralie Deladrière from the ‘Plateau d'analyse protéomique par séquençage et spectrométrie de masse (PAPSS)’ (INRA, Jouy-en-Josas, France) for LC-MS/MS analysis. We are grateful to Pascal Hols for access to the sequence of S. thermophilus LMG18311 before publication. Pierre Nicolas and Philippe Bessières received partial support from the Agence Nationale de la Recherche (decision number ANR-05-BLAN-0049-01).

Edited by: M. Kleerebezem


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Ajdic, D., McShan, W. M., McLaughlin, R. E., Savic, G., Chang, J., Carson, M. B., Primeaux, C., Tian, R., Kenton, S. & other authors (2002). Genome sequence of Streptococcus mutans UA159, a cariogenic dental pathogen. Proc Natl Acad Sci U S A 99, 14434–14439.[Abstract/Free Full Text]

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389–3402.[Abstract/Free Full Text]

Banks, D. J., Porcella, S. F., Barbian, K. D., Beres, S. B., Philips, L. E., Voyich, J. M., DeLeo, F. R., Martin, J. M., Somerville, G. A. & Musser, J. M. (2004). Progress toward characterization of the group A Streptococcus metagenome: complete genome sequence of a macrolide-resistant serotype M6 strain. J Infect Dis 190, 727–738.[CrossRef][Medline]

Beres, S. B., Sylva, G. L., Barbian, K. D., Lei, B., Hoff, J. S., Mammarella, N. D., Liu, M. Y., Smoot, J. C., Porcella, S. F. & other authors (2002). Genome sequence of a serotype M3 strain of group A Streptococcus: phage-encoded toxins, the high-virulence phenotype, and clone emergence. Proc Natl Acad Sci U S A 99, 10078–10083.[Abstract/Free Full Text]

Beres, S. B., Richter, E. W., Nagiec, M. J., Sumby, P., Porcella, S. F., DeLeo, F. R. & Musser, J. M. (2006). Molecular genetic anatomy of inter- and intraserotype variation in the human bacterial pathogen group A Streptococcus. Proc Natl Acad Sci U S A 103, 7059–7064.[Abstract/Free Full Text]

Besemer, J., Lomsadze, A. & Borodovsky, M. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29, 2607–2618.[Abstract/Free Full Text]

Bolotin, A., Quinquis, B., Renault, P., Sorokin, A., Ehrlich, S. D., Kulakauskas, S., Lapidus, A., Goltsman, E., Mazur, M. & other authors (2004). Complete sequence and comparative genome analysis of the dairy bacterium Streptococcus thermophilus. Nat Biotechnol 22, 1554–1558.[CrossRef][Medline]

Borodovsky, M., Rudd, K. E. & Koonin, E. V. (1994). Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res 22, 4756–4767.[Abstract/Free Full Text]

Bryson, K., Loux, V., Bossy, R., Nicolas, P., Chaillou, S., van de Guchte, M., Penaud, S., Maguin, E., Hoebeke, M. & other authors (2006). AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system. Nucleic Acids Res 34, 3533–3545.[Abstract/Free Full Text]

Chandler, J. R. & Dunny, G. M. (2004). Enterococcal peptide sex pheromones: synthesis and control of biological activity. Peptides 25, 1377–1388.[CrossRef][Medline]

Felsenstein, J. (1989). PHYLIP – phylogeny inference package (version 3.2). Cladistics 5, 164–166.

Ferretti, J. J., McShan, W. M., Ajdic, D., Savic, D. J., Savic, G., Lyon, K., Primeaux, C., Sezate, S., Suvorov, A. N. & other authors (2001). Complete genome sequence of an M1 strain of Streptococcus pyogenes. Proc Natl Acad Sci U S A 98, 4658–4663.[Abstract/Free Full Text]

Gardy, J. L., Laird, M. R., Chen, F., Rey, S., Walsh, C. J., Ester, M. & Brinkman, F. S. (2005). PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21, 617–623.[Abstract/Free Full Text]

Gitton, C., Meyrand, M., Wang, J., Caron, C., Trubuil, A., Guillot, A. & Mistou, M. Y. (2005). Proteomic signature of Lactococcus lactis NCDO763 cultivated in milk. Appl Environ Microbiol 71, 7152–7163.[Abstract/Free Full Text]

Glaser, P., Rusniok, C., Buchrieser, C., Chevalier, F., Frangeul, L., Msadek, T., Zouine, M., Couvé, E., Lalioui, L. & other authors (2002). Genome sequence of Streptococcus agalactiae, a pathogen causing invasive neonatal disease. Mol Microbiol 45, 1499–1513.[CrossRef][Medline]

Green, N. M., Zhang, S., Porcella, S. F., Nagiec, M. J., Barbian, K. D., Beres, S. B., LeFebvre, R. B. & Musser, J. M. (2005). Genome sequence of a serotype M28 strain of group A Streptococcus: potential new insights into puerperal sepsis and bacterial disease specificity. J Infect Dis 192, 760–770.[CrossRef][Medline]

Hamoen, L. W., Venema, G. & Kuipers, O. P. (2003). Controlling competence in Bacillus subtilis: shared use of regulators. Microbiology 149, 9–17.[Abstract/Free Full Text]

Harrison, P. M., Carriero, N., Liu, Y. & Gerstein, M. (2003). A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs. J Mol Biol 333, 885–892.[CrossRef][Medline]

Hols, P., Hancy, F., Fontaine, L., Grossiord, B., Prozzi, D., Leblond-Bourget, N., Decaris, B., Bolotin, A., Delorme, C. & other authors (2005). New insights in the molecular biology and physiology of Streptococcus thermophilus revealed by comparative genomics. FEMS Microbiol Rev 29, 435–463.[CrossRef][Medline]

Hoskins, J., Alborn, W. E., Jr, Arnold, J., Blaszczak, L. C., Burgett, S., DeHoff, B. S., Estrem, S. T., Fritz, L., Fu, D. J. & other authors (2001). Genome of the bacterium Streptococcus pneumoniae strain R6. J Bacteriol 183, 5709–5717.[Abstract/Free Full Text]

Kimura, M. (1977). Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267, 275–276.[CrossRef][Medline]

Kleerebezem, M. (2004). Quorum sensing control of lantibiotic production; nisin and subtilin autoregulate their own biosynthesis. Peptides 25, 1405–1414.[CrossRef][Medline]

Kozlowicz, B. K., Shi, K., Gu, Z. Y., Ohlendorf, D. H., Earhart, C. A. & Dunny, G. M. (2006). Molecular basis for control of conjugation by bacterial pheromone and inhibitor peptides. Mol Microbiol 62, 958–969.[CrossRef][Medline]

Larsen, T. S. & Krogh, A. (2003). EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics 4, 21[Medline]

Letort, C. & Juillard, V. (2001). Development of a minimal chemically-defined medium for the exponential growth of Streptococcus thermophilus. J Appl Microbiol 91, 1023–1029.[CrossRef][Medline]

Livak, K. J. & Schmittgen, T. D. (2001). Analysis of relative gene expression data using real-time quantitative PCR and the Formula method. Methods 25, 402–408.[CrossRef][Medline]

Lyon, G. J. & Novick, R. P. (2004). Peptide signaling in Staphylococcus aureus and other Gram-positive bacteria. Peptides 25, 1389–1403.[CrossRef][Medline]

Lyon, W. R., Gibson, C. M. & Caparon, M. G. (1998). A role for trigger factor and an Rgg-like regulator in the transcription, secretion and processing of the cysteine proteinase of Streptococcus pyogenes. EMBO J 17, 6263–6275.[CrossRef][Medline]

Martin, B., Quentin, Y., Fichant, G. & Claverys, J. P. (2006). Independent evolution of competence regulatory cascades in streptococci?. Trends Microbiol 14, 339–345.[CrossRef][Medline]

Nakagawa, I., Kurokawa, K., Yamashita, A., Nakata, M., Tomiyasu, Y., Okahashi, N., Kawabata, S., Yamazaki, K., Shiba, T. & other authors (2003). Genome sequence of an M3 strain of Streptococcus pyogenes reveals a large-scale genomic rearrangement in invasive strains and new insights into phage evolution. Genome Res 13, 1042–1055.[Abstract/Free Full Text]

Nelson, B. L. (1995). Stochastic Modeling: Analysis and Simulation. Mineola, NY: Dover Publications.

Nicolas, P., Bize, L., Muri, F., Hoebeke, M., Rodolphe, F., Ehrlich, S. D., Prum, B. & Bessières, P. (2002). Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Res 30, 1418–1426.[Abstract/Free Full Text]

Nielsen, P. & Krogh, A. (2005). Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21, 4322–4329.[Abstract/Free Full Text]

Nielsen, H., Engelbrecht, J., Brunak, S. & von Heijne, G. (1997). Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10, 1–6.[Medline]

Ochman, H. (2002). Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends Genet 18, 335–337.[CrossRef][Medline]

Paulsen, I. T., Banerjei, L., Myers, G. S., Nelson, K. E., Seshadri, R., Read, T. D., Fouts, D. E., Eisen, J. A., Gill, S. R. & other authors (2003). Role of mobile DNA in the evolution of vancomycin-resistant Enterococcus faecalis. Science 299, 2071–2074.[Abstract/Free Full Text]

Pearson, W. R. (2000). Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132, 185–219.[Medline]

Pearson, W. R. & Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444–2448.[Abstract/Free Full Text]

Pearson, W. R., Wood, T., Zhang, Z. & Miller, W. (1997). Comparison of DNA sequences with protein sequences. Genomics 46, 24–36.[CrossRef][Medline]

Qi, F., Chen, P. & Caufield, P. W. (1999). Functional analyses of the promoters in the lantibiotic mutacin II biosynthetic locus in Streptococcus mutans. Appl Environ Microbiol 65, 652–658.[Abstract/Free Full Text]

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77, 257–286.[CrossRef]

Rawlinson, E. L., Nes, I. F. & Skaugen, M. (2002). LasX, a transcriptional regulator of the lactocin S biosynthetic genes in Lactobacillus sakei L45, acts both as an activator and a repressor. Biochimie 84, 559–567.[Medline]

Sanders, J. W., Leenhouts, K., Burghoorn, J., Brands, J. R., Venema, G. & Kok, J. (1998). A chloride-inducible acid resistance mechanism in Lactococcus lactis and its regulation. Mol Microbiol 27, 299–310.[CrossRef][Medline]

Schmidt, H. A., Strimmer, K., Vingron, M. & von Haeseler, A. (2002). TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18, 502–504.[Abstract/Free Full Text]

Skovgaard, M., Jensen, L. J., Brunak, S., Ussery, D. & Krogh, A. (2001). On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17, 425–428.[CrossRef][Medline]

Slamti, L. & Lereclus, D. (2005). Specificity and polymorphism of the PlcR–PapR quorum-sensing system in the Bacillus cereus group. J Bacteriol 187, 1182–1187.[Abstract/Free Full Text]

Smith, T. F. & Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195–197.[CrossRef][Medline]

Smoot, J. C., Barbian, K. D., Van Gompel, J. J., Smoot, L. M., Chaussee, M. S., Sylva, G. L., Sturdevant, D. E., Ricklefs, S. M., Porcella, S. F. & other authors (2002). Genome sequence and comparative microarray analysis of serotype M18 group A Streptococcus strains associated with acute rheumatic fever outbreaks. Proc Natl Acad Sci U S A 99, 4668–4673.[Abstract/Free Full Text]

Sumby, P., Porcella, S. F., Madrigal, A. G., Barbian, K. D., Virtaneva, K., Ricklefs, S. M., Sturdevant, D. E., Graham, M. R., Vuopio-Varkila, J. & other authors (2005). Evolutionary origin and emergence of a highly successful clone of serotype M1 group A Streptococcus involved multiple horizontal gene transfer events. J Infect Dis 192, 771–782.[CrossRef][Medline]

Tettelin, H., Nelson, K. E., Paulsen, I. T., Eisen, J. A., Read, T. D., Peterson, S., Heidelberg, J., DeBoy, R. T., Haft, D. H. & other authors (2001). Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 293, 498–506.[Abstract/Free Full Text]

Tettelin, H., Masignani, V., Cieslewicz, M. J., Eisen, J. A., Peterson, S., Wessels, M. R., Paulsen, I. T., Nelson, K. E., Margarit, I. & other authors (2002). Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae. Proc Natl Acad Sci U S A 99, 12391–12396.[Abstract/Free Full Text]

Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., Angiuoli, S. V., Crabtree, J., Jones, A. L. & other authors (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A 102, 13950–13955.[Abstract/Free Full Text]

Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 4673–4680.[Abstract/Free Full Text]

Vickerman, M. M. & Minick, P. E. (2002). Genetic analysis of the rgg–gtfG junctional region and its role in Streptococcus gordonii glucosyltransferase activity. Infect Immun 70, 1703–1714.[Abstract/Free Full Text]

Zuber, P. (2001). A peptide profile of the Bacillus subtilis genome. Peptides 22, 1555–1577.[CrossRef][Medline]

Received 20 January 2007; revised 8 June 2007; accepted 22 June 2007.


This article has been cited by other articles:


Home page
J. Bacteriol.Home page
R. Gardan, C. Besset, A. Guillot, C. Gitton, and V. Monnet
The Oligopeptide Transport System Is Essential for the Development of Natural Competence in Streptococcus thermophilus Strain LMD-9
J. Bacteriol., July 15, 2009; 191(14): 4647 - 4655.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
M. Liu, R. J. Siezen, and A. Nauta
In Silico Prediction of Horizontal Gene Transfer Events in Lactobacillus bulgaricus and Streptococcus thermophilus Reveals Protocooperation in Yogurt Manufacturing
Appl. Envir. Microbiol., June 15, 2009; 75(12): 4120 - 4129.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
M. Ibrahim, A. Guillot, F. Wessner, F. Algaron, C. Besset, P. Courtin, R. Gardan, and V. Monnet
Control of the Transcription of a Short Gene Encoding a Cyclic Peptide in Streptococcus thermophilus: a New Quorum-Sensing System?
J. Bacteriol., December 15, 2007; 189(24): 8844 - 8854.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplementary data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Ibrahim, M.
Right arrow Articles by Gardan, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ibrahim, M.
Right arrow Articles by Gardan, R.
Agricola
Right arrow Articles by Ibrahim, M.
Right arrow Articles by Gardan, R.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
INT J SYST EVOL MICROBIOL MICROBIOLOGY J GEN VIROL
J MED MICROBIOL ALL SGM JOURNALS
Copyright © 2007 Society for General Microbiology.