Microbiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Microbiology 152 (2006), 257-272; DOI  10.1099/mic.0.28278-0
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplementary data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by van der Werf, M. J.
Right arrow Articles by Jellema, R. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by van der Werf, M. J.
Right arrow Articles by Jellema, R. H.
Agricola
Right arrow Articles by van der Werf, M. J.
Right arrow Articles by Jellema, R. H.
Microbiology 152 (2006), 257-272; DOI  10.1099/mic.0.28278-0
© 2006 Society for General Microbiology

Multivariate analysis of microarray data by principal component discriminant analysis: prioritizing relevant transcripts linked to the degradation of different carbohydrates in Pseudomonas putida S12

Mariët J. van der Werf, Bart Pieterse{dagger}, Nicole van Luijk, Frank Schuren, Bianca van der Werff-van der Vat, Karin Overkamp and Renger H. Jellema

TNO Quality of Life, PO Box 360, 3700 AJ Zeist, The Netherlands

Correspondence
Mariët J. van der Werf
vanderWerf{at}voeding.tno.nl


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The value of the multivariate data analysis tools principal component analysis (PCA) and principal component discriminant analysis (PCDA) for prioritizing leads generated by microarrays was evaluated. To this end, Pseudomonas putida S12 was grown in independent triplicate fermentations on four different carbon sources, i.e. fructose, glucose, gluconate and succinate. RNA isolated from these samples was analysed in duplicate on an anonymous clone-based array to avoid bias during data analysis. The relevant transcripts were identified by analysing the loadings of the principal components (PC) and discriminants (D) in PCA and PCDA, respectively. Even more specifically, the relevant transcripts for a specific phenotype could also be ranked from the loadings under an angle (biplot) obtained after PCDA analysis. The leads identified in this way were compared with those identified using the commonly applied fold-difference and hierarchical clustering approaches. The different data analysis methods gave different results. The methods used were complementary and together resulted in a comprehensive picture of the processes important for the different carbon sources studied. For the more subtle, regulatory processes in a cell, the PCDA approach seemed to be the most effective. Except for glucose and gluconate dehydrogenase, all genes involved in the degradation of glucose, gluconate and fructose were identified. Moreover, the transcriptomics approach resulted in potential new insights into the physiology of the degradation of these carbon sources. Indications of iron limitation were observed with cells grown on glucose, gluconate or succinate but not with fructose-grown cells. Moreover, several cytochrome- or quinone-associated genes seemed to be specifically up- or downregulated, indicating that the composition of the electron-transport chain in P. putida S12 might change significantly in fructose-grown cells compared to glucose-, gluconate- or succinate-grown cells.


Abbreviations: HCA, hierarchical cluster analysis; MVDA, multivariate data analysis; PCA, principal component analysis; PCDA, principal component discriminant analysis

Supplementary tables are available with the online version of this paper.

{dagger}Present address: BioDetection Systems BV, Kruislaan 406, 1098 SM Amsterdam, The Netherlands.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Increasingly, functional genomics tools like transcriptomics, proteomics and metabolomics are used for the identification of biological processes that are important for a specific subject of study. The most challenging aspect of functional genomics studies is the comprehensive and accurate interpretation of the overwhelming amount of data generated.

Currently, two approaches are commonly employed in order to identify the relevant biomolecules from functional genomics data. Most frequently, analysis of the fold difference in expression (univariate data analysis) between the condition of interest and a reference condition is used to identify the biomolecules (i.e. genes, proteins or metabolites) that are potentially of interest. Subsequently, biomolecules whose response is above a certain threshold (e.g. more than twofold difference) are selected and studied in more detail. Increasingly, hierarchical cluster analysis (HCA) is employed for analysing functional genomics data. In this case, biomolecules or experiments with similar expression profiles are grouped. HCA is only useful when more than two functional genomics datasets of more than two situations are available. Visual inspection of the hierarchical clusters results in the identification of coexpressed biomolecules that are specifically regulated under the condition of interest (e.g. Heyer et al., 1999Down; Tefferi et al., 2002Down).

The drawback of both these approaches is that they generally result in large numbers of leads: in the order of tens or hundreds. As it is too time-consuming to analyse all of these in more detail with molecular biological, biochemical or bioinformatic approaches in order to (experimentally) validate the targets identified, there is a need for data analysis tools that allow one to rank the potential targets. Potentially, the fold difference in expression level can be used to rank the targets. However, the fundamental basis behind such a ranking, i.e. that biomolecules that show the largest response are also the most important for the question under study, is doubtful (van der Werf, 2005Down).

Multivariate data analysis (MVDA) tools seem much better suited to prioritize leads from functional genomics datasets. These tools take into account the inherent interdependency of biomolecules. Principal component analysis (PCA) is the most frequently applied multivariate statistics tool. It has been applied for many decades in epidemiology, econometry, ecology, etc., but only recently has the potential of these tools in cellular biology been recognized (Orr & Scherf, 2002Down; Michaud et al., 2003Down). Although the mathematics behind PCA might seem complex to the untrained cellular biologist, the basic principle behind it is straightforward, i.e. PCA combines two, or more, correlated factors (i.e. transcripts) into one new variable, a principal component (PC) (Orr & Scherf, 2002Down; van der Werf et al., 2005Down). Thus in PCA the dimensionality of the dataset is reduced by replacing the original variables by a smaller number of newly formed variables that are linear combinations of the original variables and that explain the majority of the information (variability) from the experiment. For each PC, loadings (or weights) reflect the influence of the original variables, whereas scores (coefficients of the PC) reflect the contribution of each PC in every sample. PCA and the related tool principal component discriminant analysis (PCDA; Hoogerbrugge et al., 1983Down) are currently mostly used for the classification of samples with a similar expression pattern, e.g. related to a specific treatment or phenotype. However, these tools are not only descriptive but also allow the identification of the specific biomolecules that are most important for the differences between the groups. The most important variables are identified by analysing the strength of the correlation of every biomolecule with the biological process of interest.

The goal of the research described here was to empirically investigate the effectiveness of the multivariate data analysis tools PCA and PCDA for the ranking of important transcripts from microarray data and to compare the results with those obtained by the fold-difference and HCA approaches. To this end, genes from Pseudomonas putida S12 were identified whose expression is specific for growth on one or more of four different carbon sources (i.e. glucose, gluconate, fructose and succinate), as the genes encoding enzymes of carbohydrate catabolism and their regulation are still largely unknown (Petruschka et al., 2002Down). In order to avoid biological prejudgments during the evaluation of the data analysis tools, anonymous clone-based arrays were used. The identity of the genes only became known after completing the data analysis phase by sequencing the inserts in the clones corresponding to the relevant spots.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Micro-organism and cultivation conditions.
P. putida S12 was previously isolated on styrene as the sole source of carbon and energy (Hartmans et al., 1990Down). Cultures were grown in batch fermentations at 30 °C in a Bioflow II (New Brunswick Scientific) bioreactor containing 2 litres of mineral salts medium with as carbon source 20 g l–1 glucose, or the equivalent amount on a C-mol basis of fructose (20 g l–1), disodium succinate.6H2O (45 g l–1) or sodium gluconate (24 g l–1). The mineral salts medium contained the following (per litre of demineralized water): 1·55 g K2HPO4, 0·85 g NaH2PO4.H2O, 8·0 g NH4Cl, 0·5 g (NH4)2SO4, 0·3 g MgCl2.6H2O, 40 mg EDTA, 2 mg ZnSO4.7H2O, 1 mg CaCl2.2H2O, 15 mg FeSO4.7H2O, 0·2 mg Na2MoO4.2H2O, 2 mg CuSO4.5H2O, 0·4 mg CoCl2.6H2O and 1 mg MnCl2.4H2O. The culture was inoculated with 5 % (v/v) of a preculture grown without shaking for 24 h on the same medium. A constant pH (pH 7·0) was maintained by automatic titration with 2 M KOH and 1 M H2SO4. After an overnight oxygen-limited growth phase, the dissolved oxygen concentration was maintained at 20 % by manually adjusting the stirrer speed and inlet airflow. Samples were taken from the bioreactor at an OD600 of 10, and immediately quenched in –45 °C methanol as previously described (Pieterse et al., 2005Down). Cell pellets were stored at –45 °C until used.

RNA isolation.
RNA was isolated from the cells using the hot borate method, basically as described by Wan & Wilkins (1994)Down. RNA purity and concentrations were determined both spectrophotometrically and on agarose gel. The RNA isolates were checked for residual RNase activity by comparing samples that were incubated for 1 h at 37 °C with the initial material on an agarose gel.

Array design.
The microarray used is a clone-based array. To this end, a chromosomal library of P. putida S12 was constructed by Baseclear (Leiden, The Netherlands). DNA fragments were obtained by shearing, and fragments of 2–3 kb were blunt-end cloned in pSMARTLC (Lucigen). The chromosomal library was ordered in 96-well microtitre plates and stored as glycerol stocks at –80 °C. Subsequently, in total 5000 genomic fragments were amplified by PCR, purified and arrayed as described previously (Pieterse et al., 2005Down).

Fluorescent labelling and hybridization.
Differential transcript levels were determined by two-colour fluorescent hybridizations of the corresponding cDNAs on the clone-array. The RNA samples were labelled by random hexamers primed in vitro reverse transcription with either Cy5- or Cy3-labelled dUTP. Labelling, hybridization and washing were performed as previously described (Pieterse et al., 2005Down). In all instances, Cy3-labelled cDNA of a batch of the fructose-F3 sample was used as the reference condition.

Image analysis.
The fluorescent signals from the two different labels on the hybridized arrays were quantified with a ScanArray Express scanner (Packard Bioscience) and Imagene 4.2 software (BioDiscovery). Spots with a Flag 0 (spots the quality of which was approved by the Imagene software package) or a Flag 3 (spots which obtained a warning by the Imagene softwate package for a manual check on their quality) were selected. Subsequently, spots from which the difference between the mean signal of the spot and the mean signal of the background was larger than zero times the background standard deviation were excluded from further analysis. After removal of the empty spots and the spots from which the signal exceeded the detection limit of the scanner, 3676 spots (68 %) remained for further analysis.

Data preprocessing and normalization.
The data from the microarray analysis were delivered as Excel files and the data were imported into Matlab (version 6.5.1; The Mathworks). Within-slide, intensity-dependent normalizations were performed with the scatter plot smoother LOWESS (polynomial order=1) using a Matlab routine (copyright 1998 by Datatool). The user-defined fraction of data used for smoothing at each point was set at 25 % for all slides. Subsequently, these preprocessed data were used as the input for significance and MVDA analysis.

Significance analysis.
Prior to significance analysis, a data transformation was applied to the normalized ratios in order to obtain a normal distribution of the data (Pieterse et al., 2005Down). Significance analysis was performed by means of a 1-way ANOVA. Subsequently a Tukey HSD test was performed to determine whether a significant differential expression level (99 % confidence interval) was observed under a specific condition.

Multivariate data analysis.
Datasets were scaled [x/(xmaxxmin)] per variable prior to MVDA analysis. Two different MVDA tools were applied in Matlab (The Mathworks): (i) principal component analysis (PCA) (PLS Toolbox for Matlab, version 3.0.2, Eigenvector Research), and (ii) principal component discriminant analysis (PCDA) [algorithm reproduced from Hoogerbrugge et al. (1983)Down and programmed into Matlab]. In PCDA, the centre of a group was established by taking the mean score on D1 and D2. Subsequently, the loadings were determined under the angle present between the fructose group and that of one of the other three carbon sources.

For the hierarchical clustering and visualization of the results, the programs CLUSTER and TREEVIEW were applied (Eisen et al., 1998Down). Only those genes or operons were included that fell within the 99 % confidence interval of the Tukey HSD test. Average linkage clustering was performed on the natural logarithm of the ratios of the spots taken into account.

Nucleotide sequencing and sequence analysis.
Clones containing the inserts selected by MVDA as being relevant were traced back in the 96-well plates. Approximately 500 bp of both the 3'- and the 5'-end of the inserts in these clones was subsequently sequenced using universal primers based on pSMART by Baseclear (Leiden, the Netherlands), and the complete gene content of these inserts was inferred from the P. putida KT2440 genome sequence (Nelson et al., 2002Down). DNA sequences were identified by similarity searches against the TIGR (www.tigr.org/) and NCBI (www.ncbi.nlm.nih.gov/) database libraries using BLAST. Gene numbers used in this study (PP numbers) are based on the gene numbering of the P. putida KT2440 genome (Nelson et al., 2002Down).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Experimental design
As the full genome sequence of P. putida S12 is not known and in order to unbiasedly select the relevant clones using PC(D)A (see below), an anonymous clone-based array of P. putida S12 was constructed. It was assumed that the presence of 200 nucleotides of a specific gene was sufficient to detect a hybridization signal with a spot. Using the formula of Akopyants (Akopyants et al., 2001Down), and based on an estimated genome size of 6·2 Mb for P. putida S12 (see also Nelson et al., 2002Down) and the fact that 5000 spots could be spotted on the microarray slides, chromosomal fragments of, on average, 2·5 kb were generated by shearing, resulting in a full genome coverage of 95 %.

In order to address the biological and technical (array) variability, P. putida S12 was grown on the four carbon sources, i.e. D-fructose [F] (µ=0·18±0·02 h–1), D-glucose [G] (µ=0·28±0·02 h–1), D-gluconate [N] (µ=0·21 h–1) and succinate [S] (µ=0·21±0·01 h–1), in triplicate in independent batch cultures. The cells were harvested and immediately quenched in order to prevent alterations in the mRNA composition of the samples (Pieterse et al., 2005Down). In one instance, two samples were harvested from the same fermenter (fermentation 3 of glucose-grown cells). Subsequently, mRNA was isolated from these samples. In all instances, RNA isolated from the third fructose fermentation was used as the reference. Two independent micorarray hybridizations of every sample were performed.

PCA analysis of transcriptomes
In order to identify the transcripts that are the most important for the differences between the cells grown on the different carbon sources, the multivariate data analysis tool PCA was applied. In Fig. 1Down(a), the results of the PCA analysis of the transcription datasets are visualized in a two-dimensional plot. A cloud of points is observed, with each point representing the transcriptome of the different samples (van der Werf et al., 2005Down). Transcriptomes that end up close together are overall more similar, while more dissimilar transcriptomes are further apart. It can clearly be seen that, with the exception of transcriptomes of gluconate- and succinate-grown cells, the transcriptomes of cells grown on the same carbon source are more similar than the transcriptomes of cells grown on different carbon sources. Also in plots of PC1 versus PC3 and of PC2 versus PC3, the transcriptome datasets of gluconate- and succinate-grown cells slightly overlapped (results not shown). The separation of the different groups of transcriptomes originating from the same carbon sources by PCA indicates that the overall variation in the datasets due to biological and technical variation is less than the differences introduced by changing the growth substrate of P. putida S12. PC1 explains 49 % of the total variance in these datasets, while PC2 explains only 6 % of the variance.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 1. PCA (a) and PCDA (b) plots of the transcription profiles. The letters F (fructose), G (glucose), N (gluconate) and S (succinate) refer to the carbon sources from which these transcription profiles were obtained. The line between the fructose and the glucose group in (b) is a new discriminant axis through the centre of the fructose group and the centre of the glucose group.

 
PCDA analysis
A supervised variant of PCA is PCDA (Hoogerbrugge et al., 1983Down). In contrast to PCA, PCDA also takes into account information about external variables (group information) when reducing the dimensionality of the datasets. Therefore, it is better suited than PCA for clustering analysis. The result of analysing the transcriptome data by PCDA is shown in Fig. 1(b)Up. The transcriptomes originating from cells grown on the same carbon source form much compacter groups compared to PCA analysis (Fig. 1aUp) and, moreover, the distances between the groups are larger, indicating a more optimal data analysis with respect to the carbon sources studied. The fact that non-overlapping groups are observed after PCDA analyses indicates that there is information present in the transcriptome datasets that is specific for all the four different carbon sources.

As with PCA, in PCDA the strongly correlating variables are combined into one new variable that is now called a discriminant (D). The discriminants are linear combinations of the original variables, i.e. the transcripts. When the (absolute values of the) loadings for each of the transcripts in the different D's are studied, transcripts can be identified that are the most important for the variance explained by that D. D1 is mainly responsible for explaining the difference between fructose and succinate. This can most easily be seen by projecting all transcriptomes in Fig. 1(b)Up on D1 (the x-axis). In a similar way, transcripts with a high absolute loading in D2 are important for explaining the difference between glucose and succinate. The 3'- and the 5'-ends of the inserts in the clones belonging to the spots with the highest absolute value for the loadings on D1 and D2 were subsequently sequenced. The identities of the genes present on these inserts (Table 1Down) were identified by performing a homology search using BLAST (Altschul et al., 1990Down) based mainly on the annotated P. putida KT2440 genome sequence (Nelson et al., 2002Down).


View this table:
[in this window]
[in a new window]
 
Table 1. Identity of the genes present on the 30 clones with the highest absolute loadings on the discriminants

 
Several of the spots that are the most important in D1 contain the genes encoding fructose-specific phosphotransferase and phosphofructokinase (Fig. 2Down). This operon was present on two separate clones with high loadings on D1, indicating that it is statistically relevant for the variance explained by D1, and that it is not a false positive. The presence of these genes is in agreement with the transcriptome groups originating from the same carbon sources separated on D1 (mainly fructose and succinate). Another group of genes that is present more than once amongst the 30 clones with the highest loading on D1 is that encoding the outer-membrane protein H1 and the transcriptional regulator PhoP/sensor protein PhoQ operon. For D2, (i) the gluconokinase/gluconate transporter, (ii) the ferric siderophore receptor, (iii) the 2-ketogluconate/2-ketogluconate kinase/epimerase/regulator, (iv) the genes involved in the biosynthesis of flagella (chemotaxis) and (v) the glucose-6-phosphate dehydrogenase/6-phosphogluconolactonase/2-dehydro-3-deoxyphosphogluconate aldolase gene clusters are present more than once amongst the variables that have the highest loading in D2 (see also Fig. 2Down). Also in this case, the fact that specifically these genes are important for D2 is not surprising in view of the fact that the difference between the groups of glucose- and succinate-grown cells is mainly explained by D2 (Fig. 1bUp).



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 2. Degradation of fructose, glucose and gluconate by the cyclic Entner–Doudoroff pathway in pseudomonads (adapted from Lessie & Phibbs, 1984Down). GLN, gluconolactone; 2KGA, 2-ketogluconate; 2KGP, 2-keto-6-phosphogluconate; 6PGA, 6-phosphogluconate; KDPG, 2-keto-3-deoxy-6-phosphogluconate; GAP, glyceraldehyde-3-phosphate; Pyr, pyruvate; DHAP, dihydroxyacetone phosphate; Fruc1,6diP, fructose 1,6-diphosphate; Fruc6P, fructose 6-phosphate; Fruc1P, fructose 1-phosphate; Gluc6P, glucose 6-phosphate; 6PGL, 6-phosphogluconolactone; DPGA, 1,3-diphosphoglycerate; 3PGA, 3-phosphoglycerate; 2PGA, 2-phosphoglycerate; PEP, phosphoenolpyruvate; PPP, pentose phosphate pathway; TCA cycle, tricarboxylic acid cycle. Gcd, glucose dehydrogenase; Gcl, gluconolactonase; Gad, gluconate dehydrogenase; Fpt, fructose phosphotransferase; Gct, glucose transporter; Gat, gluconate transporter; Kgt,2KGA transporter; Fck, Fruc1P kinase; Gck, glucose kinase; Gnk, gluconate kinase; Kgk, 2KGA kinase; Kgr, 2KGP reductase; Zwf, Gluc-6P dehydrogenase; Pgl, 6-phosphogluconolactonase;Gnd, 6PGA dehydrogenase; Edd, 6PGA dehydratase; Eda, KDPG aldolase; Pgi, Gluc6P dehydrogenase; Fdp, fructose-1,6-diphosphatase; Fda, Fruc1,6diP aldolase; Tpi, triosephosphate isomerase; Gpd, GAP dehydrogenase; Pgk, 3PGA kinase; Pgm, phosphoglyceromutase; Eno, enolase; Pyk, Pyr kinase.

 
Identification of transcripts important for the different carbon sources
Although the identification of the genes that have a high loading in the different D's is helpful in identifying genes that are important for the growth on a specific carbon source, it is not always possible to identify genes that are important for one of the different carbon sources: compare, for example, the glucose and gluconate group information on D1 and the gluconate and fructose group information on D2 (Fig. 1bUp). Therefore, the loadings under an angle were determined. Now, three new discriminants were defined through the centre of the fructose group and the centres of one of the three other groups (Fig. 1bUp). Subsequently, transcripts important for the difference between the fructose group and one of the other three groups were identified by selecting the original variables with the highest absolute values for the loadings (under an angle) on these newly defined discriminants. Table 2Down lists the results of this biplot analysis.


View this table:
[in this window]
[in a new window]
 
Table 2. Identity of the genes present on the 30 genome fragments present on the spots with the highest absolute loadings under an angle, after PCDA biplot analysis, in the direction from the centre of the fructose group to the centre of the glucose, gluconate and succinate groups, respectively

 
For the glucose group, (i) the ferric siderophore receptor, (ii) the PP0897–PP0904 gene cluster, (iii) the fructose transport (fructose-specific phosphotransferase)/phosphofructokinase, (iv) the gluconokinase/gluconate transporter, (v) the hypothetical protein PP3459–PP3462 and (vi) the 2-ketogluconate/2-ketogluconate kinase/epimerase/regulator gene clusters are present more than once amongst the variables that have the highest absolute loading under an angle in the direction of the glucose group. For the gluconate group, (i) the fructose transport (fructose-specific phosphotransferase)/phosphofructokinase, (ii) the PP0897–PP0904 gene cluster, (iii) the outer-membrane protein H1/transcriptional regulator PhoP/sensor protein PhoQ and (iv) the conserved hypothetical protein PP3797–PP3800 gene clusters are present more than once. For succinate, (i) the glycine betaine/carnitine/choline ABC transporter, (ii) the C4-dicarboxylate transport protein, (iii) the two-component regulator PhoP/PhoQ and (iv) the cold-shock domain family protein gene clusters are present more than once. The glucose and gluconate group have the fructose transport (fructose-specific phosphotransferase)/phosphofructokinase gene cluster in common. Moreover, this cluster is also highly ranked amongst the succinate-specific transcripts (results not shown), indicating that this cluster is not so important for glucose, gluconate and succinate, but in contrast is important for fructose-grown cells (see also Fig. 2Up), the reference condition for this differential transcript profiling study. Moreover, the genes of this cluster have a negative loading under an angle, indicating a negative correlation with the glucose, gluconate or succinate group, while most of the other important genes for these three carbon sources have a positive loading under an angle (Table 2Up). A similar phenomenon is observed for the spots containing genes PP0897–PP0904 (Table 2Up).

Comparing PCDA analysis with the fold-difference and hierarchical clustering approaches
Currently, two approaches other than the above-described PCA and PCDA analysis are commonly used for identifying the important transcripts from microarray studies: the fold-difference approach and HCA. We also applied the fold-difference approach to rank the transcripts based on the ratio or, when the ratio was <1, i.e. in the case of down-regulated genes, on 1/ratio. Again, inserts of the 30 clones belonging to the spots with the highest value for (1/)ratio were sequenced and the genes present on these inserts identified (see Table S1, available as supplementary data with the online version of this paper). By and large, completely different transcripts were identified compared to the PCDA approach: only 30 % of the spots were the same.

The datasets were also analysed by HCA (Fig. 3Down) and subsequently clusters of transcripts whose expression was specifically affected by (one of) the carbon sources were identified (yellow boxes in Fig. 3Down). All inserts of the spots in these clusters were sequenced and the genes present on these inserts identified (see supplementary Table S2). Most of the genes identified by HCA were also identified by the PCDA and/or the fold-difference approach (see also Table 3Down). However, several gene clusters, encoding proteins such as surface adhesion protein, formate dehydrogenase and flagellar proteins, were not identified in one of the other two approaches. Remarkable is the fact that also the gene clusters encoding proteins such as 6-phosphogluconate dehydratase/glucokinase (PP1010–PP1011 – downregulated in S), glucose-6-phosphate dehydrogenase/6-phosphogluconolactonase/2-dehydro-3-deoxyphosphogluconate aldolase (PP1022–PP1024 – slightly upregulated in G and N, downregulated in S) and cytochrome o ubiquinol oxidase (PP0812–PP0814 – upregulated in G, N and S), which are of key importance in the degradation of (one of the) carbon sources studied (Fig. 2Up), are clearly visible as clusters in the HCA plot (Fig. 3Down), but do not end up in the top 30 by either PCDA or ratio analysis.



View larger version (55K):
[in this window]
[in a new window]
 
Fig. 3. Hierarchical cluster of the transcription profiles of P. putida S12 grown on the four different carbon sources. Red indicates increased expression, green decreased expression and black unchanged expression.

 

View this table:
[in this window]
[in a new window]
 
Table 3. Overview of the data analysis methods by which specific groups of genes relevant for processes that proved to be important for growth on (one of the) different carbon sources were identified

The letters G, N and S in parentheses refer to the data analysis of the carbon source (see legend Fig. 1Up) with which these transcripts were identified. The ratios are the mean fold difference in expression of the different inserts on which the gene of interest was present.

 
The effectiveness of the different data analysis tools for identifying specific groups of genes relevant for processes that proved to be important for growth on (one of the) different carbon sources used (see Discussion) is shown in Table 3Up. It can clearly be seen that the genes involved in specific processes of importance for growth on (one of the) different carbon sources are not always identified amongst the 30 most important spots by the different data analysis methods tested in this study. In contrast, the different data analysis approaches seem to be complementary.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The true challenge in functional genomics is the translation of the avalanche of data generated by these analytical tools into information. So far, no gold standard for analysis of microarray data to accomplish this goal has emerged. Currently, many sophisticated data analysis tools for functional genomics data are being developed. This has led to the analysis of biological data tending to become the field for statisticians, where the development of new, ‘fancy’, data analysis tools seems to be more of an issue than their applicability for the extraction of relevant biological information from functional genomics data. As the method used has a profound influence on the interpretation of the results (Quackenbush, 2001Down), the back-up of the results of data analysis by biological studies is becoming a critical issue. There is, therefore, a need for comparing, preferably readily available and straightforward, data analysis tools in order to evaluate which tool is ‘the best’ for obtaining biologically relevant information (Carpentier et al., 2004Down). In this paper we compare four different data analysis approaches for selecting and ranking transcripts (genes) important for growth of P. putida on four carbon sources – fructose, glucose, gluconate and succinate – for which the degradation pathways have been well established in this micro-organism (Lessie & Phibbs, 1984Down; Temple et al., 1998Down).

In order to avoid bias during the data analysis, caused by a perceived immediate understanding of the importance of a specific transcript (or transcripts) being detected as relevant, and in that way directing the data analysis process, an anonymous clone-based array was used. Genome fragments of P. putida were cloned, and the inserts were amplified and spotted on glass slides. Only when spots were identified to be relevant by one of the data analysis methods were the inserts sequenced, and the identity of the transcripts unravelled. In many instances genes or operons of relevance proved to be present on multiple clones, indicating that clone-based arrays are a reliable means of identifying genes that are relevant for a specific biological process. This generic approach seems therefore very suitable for studying the comprehensive transcript response of micro-organisms whose full genome sequence is not available.

This study clearly shows that the data analysis method chosen has a profound effect on the transcripts identified as being the ‘most’ relevant; large differences were observed between the transcripts that ranked the highest based on the PCA approach (results not shown), the PCDA approach or the fold-difference approach. The most frequently used method for selecting data from transcriptomics experiments, the fold-difference or ratio approach, has the disadvantage that genes that inherently show a low response, like constitutively expressed genes, are overlooked (van der Werf, 2005Down; Wu, 2001Down; Slonim, 2002Down). In this study, this is clearly illustrated by the 6-phosphogluconate dehydratase/glucokinase (PP1010–PP1011) and glucose-6-phosphate dehydrogenase/6-phosphogluconolactonase/2-dehydro-3-deoxyphosphogluconate aldolase (PP1022–PP1024) operons that are only slightly up- or downregulated (at most a factor of 2; Table 3Up) compared to the responses of the 2-ketogluconate 6-phosphate reductase/2-ketogluconate transporter/2-ketogluconate kinase/epimerase KguE/transcriptional regulator PtxS (PP3376–PP3380) and the fructose degradation (PP0792–PP0795) gene clusters that are up- and downregulated by a factor of 7 to 24 (Table 3Up). The other frequently used tool for analysing transcriptome data, HCA, has the disadvantage that only gene clusters that show a specific expression profile are identified as relevant. For instance, in this study, the fructose degradation (PP0792–PP0795) gene cluster, which is strongly downregulated in cells grown on glucose, gluconate and succinate, was not identified by HCA as it ended up in the large bulk of downregulated genes (lower third of the hierarchical cluster; Fig. 3Up), and was therefore not specifically identified.

So far, there have only been a few isolated studies in which PC(D)A biplots, i.e. making use of the loadings under an angle, have been applied for analysing transcriptome data (Chapman et al., 2001Down). PCDA is particularly well suited for the analysis of functional genomics datasets derived from samples originating from more than two different biological groups. This study clearly demonstrates that the loadings under an angle resulting from PCDA analysis are an appropriate quantitative statistical parameter with which relevant transcripts for a specific phenotype can be ranked, as illustrated by the fact that many genes encoding enzymes known to be involved in the degradation of the carbon sources studied [Fig. 2Up – i.e. the fructose degradation operon, the gluconokinase/gluconate transporter (PP3415–PP3417) gene cluster, the 2-ketogluconate 6-phosphate reductase/2-ketogluconate transporter/2-ketogluconate kinase/epimerase KguE/transcriptional regulator PtxS (PP3376–PP3380) gene cluster, and the C4-dicarboxylate transporter (PP1188)], were identified in this way. Moreover, in many instances, a transcript identified to be relevant for a specific carbon source by PCDA analysis was present on genome fragments of several other spots that were the most important, again indicating that this is relevant information, and that it is not chance correlations that identified these genes as being relevant. In this respect, the many regulatory genes on the inserts of spots identified by PCDA to strongly correlate with one of the different carbon sources are of special interest.

However, PCDA did not identify all the genes that are involved in the degradation of the different carbon sources; the complete set of genes involved in the degradation of the carbon sources studied (Fig. 2Up) was only obtained by combining the results of PCDA analysis, the fold-difference and the HCA approach. In this respect, the complementary nature of the three approaches (HCA, fold-difference and PCDA; Table 3Up) is very notable.

Besides the complementary nature of the different data analysis tools studied, this paper also demonstrates the strength of the clone-based array approach for the identification of relevant transcripts. It resulted in the identification of all except two of the genes involved in the degradation of the different carbohydrates studied (Fig. 2Up; Lessie & Phibbs, 1984Down; Temple et al., 1998Down): the fructose utilization gene cluster (PP0792–PP0795), the glucose utilization gene cluster (PP1010–PP1012; Sage et al., 1996Down), the zwf–pgl–eda gene cluster (PP1021–PP1024; Petruschka et al., 2002Down; Hager et al., 2000Down), the 2-ketogluconate utilization gene cluster (PP3376–PP3380; Swanson et al., 2000Down) and the gluconate utilization gene cluster (PP3415–PP3417). Only the genes encoding glucose dehydrogenase and gluconate dehydrogenase were not positively identified. Unfortunately, the genes encoding these enzymes have so far not been isolated from a Pseudomonas species. Although these genes were annotated in the P. putida KT2240 genome (PP1444 and PP3383, respectively) a BLAST study showed no significant homology between these genes and any of the functionally characterized glucose and gluconate dehydrogenase genes. Therefore, it is possible that one of the many genes encoding hypothetical proteins identified in this study encodes one of these two enzymes. Also a sugar ABC transporter gene cluster (PP1015–PP1019; Wylie & Worobec, 1994Down, 1995Down) was identified that was specifically induced upon growth on glucose (Table 3Up). This gene cluster encodes a glucose porin and a sugar transporter, of which the sugar-binding protein (PP1015) is very likely the previously purified glucose-specific glucose binding protein (Stinson et al., 1977Down), as this gene encodes a protein of a similar size as the purified protein (44·5 kDa) and has a similar amino acid composition.

The clone-based-array approach, in combination with the different data analysis tools, not only resulted in the identification of genes encoding the enzymes known to be involved in the degradation of the carbon sources studied, but also gave new insights into the physiology of the degradation of the carbon sources studied. Most remarkable was an upregulation of a large number of genes that respond to iron limitation in glucose-, gluconate- or succinate-grown cells in comparison with fructose-grown cells (Table 3Up). This includes six different iron chelate receptors – the siderophore receptors (PP0267, PP0535, PP0861, PP3155 and PP4217) and the ferric citrate receptor FecA (PP0867; Enz et al., 2003Down) – TonB, which is involved in the translocation of the iron chelate bound to the siderophore and ferric citrate receptors across the outer membrane (PP3612; Moeck & Coulton, 1998Down), and the iron-responsive transcript FagA (PP0943; Hassett et al., 1997Down). This coincides with an upregulation of two RNA polymerase {sigma}70 factors of the ECF subfamily (PP0162 and PP0704) that are involved in the regulation of siderophore biosynthesis (Redly & Poole, 2003Down). All these transcripts are under control of the Fur repressor protein. The Fur repressor (PP4730) was not amongst the 200 clones that were sequenced in this study.

Moreover, also several cytochrome- or quinone-associated genes were specifically upregulated (i.e. PP0812, PP0813, PP0814, PP0815, PP3606) or downregulated (i.e. PP0071, PP1841, PP2010 and PP3823). This indicates that the composition of the electron-transport chain in P. putida S12 is different in fructose-grown cells compared to glucose-, gluconate- or succinate-grown cells. This is in agreement with the distinct degradation pathway for fructose compared to that of glucose and gluconate in pseudomonads (Fig. 2Up; Lessie & Phibbs, 1984Down; Temple et al., 1998Down). Potentially, the different composition of the electron-transport chain reflects its greater importance in cells grown on glucose or gluconate, which are initially degraded extracellularly in a sequence from glucose to gluconate and subsequently 2-ketogluconate by a PQQ-dependent glucose dehydrogenase (Matsushita & Ameyama, 1982Down) and a cytochrome-containing gluconate dehydrogenase (Matsushita et al., 1979Down), respectively. Both glucose and gluconate dehydrogenase are directly linked to the electron-transport chain. The observed iron limitation when cells were grown on glucose or gluconate suggests that there is a larger demand for iron as the prosthetic groups in proteins, such as cytochrome-containing enzymes, when P. putida S12 is cultivated on either of these carbon sources. Gluconate dehydrogenase is known to contain dihaem cytochrome c as a prosthetic group (Matsushita et al., 1979Down). Also succinate dehydrogenase is directly linked to the electron-transport chain and is a cytochrome-containing enzyme (http://www.tigr.org/tigr-scripts/CMR2/GenomePage3.spl?database=gpp).

The multivariate data analysis tools that are currently used in functional genomics research originate from other research fields. Although specific adaptations have been made to these tools in order to optimize them for biological purposes (Eisen et al., 1998Down; Heyer et al., 1999Down; Tamayo et al., 1999Down; Tavazoie et al., 1999Down), this paper demonstrates an important role for multiple complementary approaches. In the near future further improvement of multivariate data analysis methods for analysing functional genomics datasets is to be expected using mathematical considerations that are based on a molecular biological rationale. Further improvements are also expected to overcome the problem of having far more variables than samples available for statistical analysis. This can lead to both false positives and false negatives with the existing multivariate data analysis tools when applied to functional genomics datasets.

In conclusion, this paper clearly demonstrates that the data analysis method used has a large effect on the ranking of the transcripts that are relevant for a specific phenotype. The methods used in this study were complementary: only when the results of the transcripts that were ranked the highest were combined did a complete picture of the processes important for the catabolism of the different carbon sources studied become apparent. For the more subtle, regulatory processes in a cell, especially the multivariate data analysis tool PCDA seemed to be very effective, as relatively more regulator genes were identified by this method. Moreover, this study showed that anonymous cloned-based arrays provide a reliable means of identifying relevant genes from micro-organisms whose full genome sequence is not available.


    ACKNOWLEDGEMENTS
 
We thank Roelie Bijl, Evelyn Wesseling, Martien Caspers, Mieke Havekes and Alie Kat Angelino for technical assistance, Ted van der Lende for fruitful discussions, Carina Rubingh and Sabina Bijlsma for critically reading the manuscript, and Baseclear (Leiden, The Netherlands) for constructing the chromosomal library of P. putida S12 and sequence analysis.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Akopyants, N. S., Clifton, S. W., Martin, J., Pape, D., Wylie, T., Li, L., Kissinger, J. C., Roos, D. S. & Beverley, S. M. (2001). A survey of the Leishmania major Friedlin strain V1 genome by shotgun sequencing: a resource for DNA microarrays and expression profiling. Mol Biochem Parasitol 113, 337–340.[CrossRef][Medline]

Altschul, S. F., Gish, W., Miller, W., Meyers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215, 403–410.[CrossRef][Medline]

Carpentier, A.-S., Riva, A., Tisseur, P., Didier, G. & Henaut, A. (2004). The operons, a criterion to compare the reliability of transcriptome analysis tools: ICA is more reliable than ANOVA, PLS and PCA. Comput Biol Chem 28, 3–10.[CrossRef][Medline]

Chapman, S., Schenk, P., Kazan, K. & Manners, J. (2001). Using biplots to interpret gene expression patterns in plants. Bioinformatics 18, 202–204.

Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 14863–14868.[Abstract/Free Full Text]

Enz, S., Brand, H., Orellana, C., Mahren, S. & Braun, V. (2003). Sites of interaction between the FecA and FecR signal transduction proteins of ferric citrate transport in Escherichia coli K-12. J Bacteriol 185, 3745–3752.[Abstract/Free Full Text]

Hager, P. W., Calfee, M. W. & Phibbs, P. V. (2000). The Pseudomonas aeruginosa devB/SOL homolog, pgl, is a member of the hex regulon and encodes 6-phosphogluconolactonase. J Bacteriol 182, 3934–3941.[Abstract/Free Full Text]

Hartmans, S., van der Werf, M. J. & de Bont, J. A. M. (1990). Bacterial degradation of styrene involving a novel flavin adenine dinucleotide-dependent styrene monooxygenase. Appl Environ Microbiol 56, 1347–1351.[Abstract/Free Full Text]

Hassett, D. J., Howell, M. L., Ochsner, U. A., Vasil, M. L., Johnson, Z. & Dean, G. E. (1997). An operon containing fumC and sodA encoding fumarase C and manganese superoxide dismutase is controlled by the ferric uptake regulator in Pseudomonas aeruginosa: fur mutants produce elevated alginate levels. J Bacteriol 179, 1452–1459.[Abstract/Free Full Text]

Heyer, L. J., Kruglyak, S. & Yooseph, S. (1999). Exploring expression data: indentification and analysis of coexpressed genes. Genome Res 9, 1106–1115.[Abstract/Free Full Text]

Hoogerbrugge, R., Willig, S. J. & Kistemaker, P. G. (1983). Discriminant analysis by double stage principal component analysis. Anal Chem 55, 1710–1712.[CrossRef]

Lessie, T. G. & Phibbs, P. V. (1984). Alternative pathways of carbohydrate utilization in Pseudomonas. Annu Rev Microbiol 38, 359–387.[CrossRef][Medline]

Matsushita, K. & Ameyama, M. (1982). D-Glucose dehydrogenase from Pseudomonas fluorescens, membrane-bound. Methods Enzymol 89, 149–154.[CrossRef][Medline]

Matsushita, K., Shinagawa, E., Adachi, O. & Ameyama, M. (1979). Membrane-bound D-gluconate dehydrogenase from Pseudomonas aeruginosa. J Biochem 85, 1173–1181.[Abstract/Free Full Text]

Michaud, D. J., Marsh, A. G. & Dhurjati, P. S. (2003). eXPatGen: generating dynamic expression patterns for the systematic evaluation of analytical methods. Bioinformatics 19, 1140–1146.[Abstract/Free Full Text]

Moeck, G. S. & Coulton, J. W. (1998). TonB-dependent iron acquisition: mechanisms of siderophore-mediated active transport. Mol Microbiol 28, 675–681.[CrossRef][Medline]

Nelson, K. E., Weinel, C., Paulsen, I. T. & 40 other authors (2002). Complete genome sequence and comparative analysis of the metabolically versatile Pseudomonas putida KT2440. Environ Microbiol 4, 799–808.[CrossRef][Medline]

Orr, M. S. & Scherf, U. (2002). Large-scale gene expression analysis in molecular target discovery. Leukemia 16, 473–477.[CrossRef][Medline]

Petruschka, L., Adlf, K., Burchardt, G., Dernedde, J., Jurgensen, J. & Hermann, H. (2002). Analysis of the zwf-pgl-eda operon in Pseudomonas putida strains H and KT2440. FEMS Microbiol Lett 215, 89–95.[CrossRef][Medline]

Pieterse, B., Jellema, R. H. & van der Werf, M. J. (2005). Quenching of microbial samples for increased reliability of microarray data. J Microbiol Methods doi:10.1016/j.mimet.2005.04.035

Quackenbush, J. (2001). Computational analysis of microarray data. Nat Rev Genet 2, 418–427.[CrossRef][Medline]

Redly, G. A. & Poole, K. (2003). Pyoverdine-mediated regulation of FpvA synthesis in Pseudomonas aeruginosa: involvement of a probable extracytoplasmic-function sigma factor, FpvI. J Bacteriol 185, 1261–1265.[Abstract/Free Full Text]

Sage, A. E., Proctor, W. D. & Phibbs, P. V. (1996). A two-component response regulator, gltR, is required for glucose transport activity in Pseudomonas aeruginosa PAO1. J Bacteriol 178, 6064–6066.[Abstract/Free Full Text]

Slonim, D. K. (2002). From patterns to pathways: gene expression data analysis comes of age. Nat Genet Suppl 32, S502–S508.[CrossRef]

Stinson, M. W., Cohen, M. A. & Merrick, J. M. (1977). Purification and properties of the periplasmic glucose-binding protein of Pseudomonas aeruginosa. J Bacteriol 131, 672–681.[Abstract/Free Full Text]

Swanson, B. L., Hager, P., Phibbs, P., Ochsner, U., Vasil, M. & Hamood, A. N. (2000). Characterization of the 2-ketogluconate utilization operon in Pseudomonas aeruginosa PAO1. Mol Microbiol 37, 561–573.[CrossRef][Medline]

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. & Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A 96, 2907–2912.[Abstract/Free Full Text]

Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. & Church, G. M. (1999). Systematic determination of genetic network architecture. Nat Genet 22, 281–285.[CrossRef][Medline]

Tefferi, A., Bolander, M. E., Ansell, S. M., Wieben, E. D. & Spelsberg, T. C. (2002). Primer on medical genomics. Part III: microarray experiments and data analysis. Mayo Clin Proc 77, 927–940.[Abstract/Free Full Text]

Temple, L. M., Sage, A. E., Schweizer, H. P. & Phibbs, P. V. (1998). Carbohydrate catabolism in Pseudomonas aeruginosa. In Pseudomonas, pp. 35–72. Edited by T. C. Montie. New York: Plenum.

van der Werf, M. J. (2005). Towards replacing closed with open target selection strategies. Trends Biotechnol 23, 11–16.[CrossRef][Medline]

van der Werf, M. J., Jellema, R. H. & Hankemeier, T. (2005). Microbial metabolomics: replacing trial-and-error by the unbiased selection and ranking of targets. J Ind Microbiol Biotechnol 32, 234–252.[CrossRef][Medline]

Wan, C.-Y. & Wilkins, T. A. (1994). A modified hot borate method significantly enhances the yield of high-quality RNA from cotton (Gossypium hirsutum L.). Anal Biochem 223, 7–12.[CrossRef][Medline]

Wu, T. D. (2001). Analysing gene expression data from DNA microarrays to identify candidate genes. J Pathol 195, 53–65.[CrossRef][Medline]

Wylie, J. L. & Worobec, E. A. (1994). Cloning and nucleotide sequence of the Pseudomonas aeruginosa glucose-selective OrpB porin gene and distribution of oprB within the family Pseudomonaceae. Eur J Biochem 220, 505–512.[Medline]

Wylie, J. L. & Worobec, E. A. (1995). The OprB porin plays a central role in carbohydrate uptake in Pseudomonas aeruginosa. J Bacteriol 177, 3021–3026.[Abstract/Free Full Text]

Received 17 June 2005; revised 21 October 2005; accepted 26 October 2005.


This article has been cited by other articles:


Home page
J. Bacteriol.Home page
T. del Castillo, J. L. Ramos, J. J. Rodriguez-Herva, T. Fuhrer, U. Sauer, and E. Duque
Convergent Peripheral Pathways Catalyze Initial Glucose Catabolism in Pseudomonas putida: Genomic and Flux Analysis
J. Bacteriol., July 15, 2007; 189(14): 5142 - 5152.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
T. Koide, R. Z. N. Vencio, and S. L. Gomes
Global Gene Expression Analysis of the Heat Shock Response in the Phytopathogen Xylella fastidiosa.
J. Bacteriol., August 1, 2006; 188(16): 5821 - 5830.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplementary data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by van der Werf, M. J.
Right arrow Articles by Jellema, R. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by van der Werf, M. J.
Right arrow Articles by Jellema, R. H.
Agricola
Right arrow Articles by van der Werf, M. J.
Right arrow Articles by Jellema, R. H.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
INT J SYST EVOL MICROBIOL MICROBIOLOGY J GEN VIROL
J MED MICROBIOL ALL SGM JOURNALS
Copyright © 2006 Society for General Microbiology.