3080
Proceedings of the 18
th
International Conference on Soil Mechanics and Geotechnical Engineering, Paris 2013
6 GENE SEQUENCE DATA ANALYSIS
Sequence data is usually provided in a text file in FASTA
format, where there a description line and then the sequence of
nucleotides reported as single-letter codes (A,G,C,T). In a
Geoenvironmental context, the purpose of sequencing a gene is
usually to identify the species from which the sequence came.
This is done by comparison with open-access databases such as
GenBank (
), the EMBL
nucleotide sequence database (
/), or
the DNA Data Bank of Japan
/).
These databases are maintained by public bodies in the USA,
Europe and Japan collaborating as the International Nucleotide
Sequence Database Collaboration
/).
Sequences obtained from samples can be compared with
sequences in the database using a variety of free, public domain
software. BLAST (Basic Local Alignment Search Tool) makes
pair-wise comparisons with sequences in the chosen database
and reports the statistically most significant matches.
SEQMATCH available from the Ribosomal Database Project
(
performs a similar function,
and readily allows the user to restrict the quality of sequences to
which matches are reported (e.g. type species, isolates, long
read lengths, “good” quality).
CLASSIFIER, which is also available from the Ribosomal
Database Project, is a naïve Bayesian Classifier that can place
bacterial 16S rRNA sequences within Bergey’s Taxonomic
Outline of the Prokaryotes (Wang et al. 2007). It is easy to use,
and can be used for classifying single rRNA gene sequences or
for the analysis of libraries of thousands of sequences.
For some types of analysis it may be necessary to align
sequences from the same gene of different species prior to
detailed analysis. An alignment is a way of arranging gene
sequences to identify regions of similarity that indicate
functional, structural, or evolutionary relationships between the
sequences (Mount, 2004). There is a variety of open-access
software available for aligning gene sequences, two of the more
popular of which are ClustalW (Cluster Analysis) and
MUSCLE (MUltiple Sequence Comparison by Log-
Expectation) both of which are available from the European
Bioinformatics Institute website (amongst other sources).
Phylogentic relationships between the aligned sequences can be
displayed as phylogenetic trees using software such as
TreeView (
Page, 1996), or organised into “operational taxonomic units”
(OTUs) using software such as MOTHUR (
.
mothur.org/; Schloss et al., 2009). In this context an OTU is a
grouping defined by sequence similarity, which can be set by
the user to correspond roughly with phylum, class, order,
family, genus, species, as appropriate. Rarefaction analysis
(which can also be undertaken by MOTHUR) can characterize
the diversity of a clone library using either rarefaction curves or
a numerical indicator such as the Shannon Index (Krebs, 1999).
Next generation sequencing can produce 2-3 orders of
magnitude more data than traditional approaches based on
cloning and sequencing. Thus, while the basic stages in analysis
are similar to the traditional approach, the task of applying it to
many thousands of sequences in parallel usually requires the use
of different software. The RDP project (described above) has a
pyrosequencing pipeline that “processes and converts the data to
formats suitable for common ecological and statistical
packages”. Similarly, QIIME (Quantitative Insights Into
Microbial Ecology) is an open source software package for
analysing high-throughput amplicon sequencing data, such as
16S rRNA gene sequences (
).
7 DISCUSSION AND CONCLUSIONS
Microbes can be expected to impact most if not all processes
occurring in the geo-environment, and geotechnical engineers
should be aware of the potential for harnessing microbial
metabolism to bring about desired aims. PCR based
methodologies permit the detection of the microbes present and
how they change with changing conditions. PCR is relatively
easy to use in an engineering setting and the availability of
reagents in kit form along (with detailed protocols) means that
the barriers to adoption are reasonably low. However this is a
rapidly moving field and the advent of high throughput deep
sequencing technologies have led to the development of
‘metagenomics’ and ‘metatranscriptomics’ which investigates
the composite genetic potential of an ecological niche.
Instrumentation and cost of sample analysis are still relatively
high but likely to fall as capacity and technology increase. The
sheer volume of data generated poses a significant challenge in
terms of bioinformatics and fully exploiting these technologies
will require multidisciplinary collaborations between engineers,
molecular biologists and informaticians.
8 REFERENCES
Acinas S.G. et al. 2004. Fine-scale phylogenetic architecture of a
complex bacterial community. Nature 430(6999), 551-554
Acinas, S. G. et al. 2005. "PCR-Induced Sequence Artifacts and Bias:
Insights from Comparison of Two 16S rRNA Clone Libraries
Constructed from the Same Sample." Appl. Environ. Microbiol.
71(12): 8966-8969.
Borneman, J. & Triplett, E.W. 1997. Molecular microbial diversity in
soils from eastern Amazonia: evidence for unusual microorganisms
and microbial population shifts associated with deforestation. Appl.
Environ. Microbiol. 63:2647-2653
Burke, I.T. et al 2012. Biogeochemical reduction processes in a hyper-
alkaline affected leachate soil profile. Geomicrobiology Journal 29
(9), 769–779.
Cardinale, M. et al. 2004 Comparison of different primer sets for use in
automated ribosomal intergenic spacer analysis of complex
bacterial communities. Appl. Environ. Microbiol. 70, 6147-6156.
Krebs, C.J. 1999. Ecological Methodology. Addison-Welsey
Educational Publishers Inc, Menlo Park, CA.
Islam, F.S.. et al. 2004. Role of metal-reducing bacteria in arsenic
release from Bengal delta sediments. Nature, 430, 6995, 68-71.
Marchetti, A., et al. 2012 Comparative metatranscriptomics identifies
molecular bases for the physiological responses of phytoplankton to
varying iron availability. PNAS
/pnas.1118408109
Metzker, M.L. 2010 Sequencing technologies the next generation.
Nature Reviews Genetics 11, 31-46.
Mount, D.W. 2004. Bioinformatics: Sequence and Genome Analysis.
Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY.
Polz, M. F. and C. M. Cavanaugh 1998. "Bias in Template-to-Product
Ratios in Multitemplate PCR." Appl. Environ. Microbiol. 64(10):
3724-3730.
Promega
2012.
GoTaq®
DNA
Polymerase
Protocol.
/. Last accessed 4
th
December 2012
Qiu, X., Wu, L. et al. (2001). "Evaluation of PCR-Generated Chimeras,
Mutations, and Heteroduplexes with 16S rRNA Gene-Based
Cloning." Appl. Environ. Microbiol. 67(2): 880-887.
Roche 2011a. FastStart Taq DNA Polymerase dNTPpack: Version 7.
. Last accessed 4th December 2012.
Roche 2011b. 454 Sequencing System Guidelines for Amplicon
Experimental Design.
. Last accessed 10-12-12.
Sunar, N.M. et al. 2009. Enumeration of salmonella in compost material
by a non-culture based method. Sardinia 2009: 12
th
Int. Waste
Management and Landfill Symp., 1005-1006.
Page, R.D.M. 1996. TREEVIEW: An application to display
phylogenetic trees on personal computers. Computer Applications
in the Biosciences 12: 357-358.
Wang, Q. et al. 2007. Naive Bayesian classifier for rapid assignment of
rRNA sequences into the new bacterial taxonomy, Appl. Environ.
Microbiol. 73 5261–5267.
Wang, Z. et al. 2009. RNA-seq a revolutionary tool for transcriptomics.
Nature Review Genetics 10 57-63