|
Methods
|
|
Click here for a schematic representation of the methods (PDF format). Input data, masking of low quality and contaminant sequences Zea mays genomic survey sequences (GSSs) from the inbred line B73 obtained from methyl-filtration (MF), high Cot (HC)-filtration, shotgun, or BAC-end sequencing were used for ISU MAGIs series 2 assemblies (e.g., version 2.31). The MAGI 3.1 assembly used only MF, HC, and TIGR-generated shotgun sequences within TraceDB, which yield a significantly lower inherent error rate post-trimming(Fu et al., 2004) with the parameters Bracket [20 0.003], Window [10 0.01] and Error [0.005 0.002]. The latest MAGI assembly, version 4.0, uses the same input as version 3.1 but also incorporates similarly trimmed JGI-generated shotgun sequences, BAC shotgun reads from the Consortium for Maize Genomics and Methyation-spanning Linker Library (MSLL) BAC end sequences generated at the University of Arizona trimmed with the default Lucy parameters(Bracket[10 0.02], Window[50 0.08 10 0.3] and Error[0.025 0.02]). Please view Table 1-3 for more details. All sequences were obtained from the GenBank TraceDB. Note that the base calling of the PGIR clones(Table 1)in TraceDB differ from the base calling off the corresponding GSSs in GenBank. These data were first checked for sequences with a large percentage of undetermined bases, external sequence contamination, and extensive simple repeats with the SeqClean script. Short vector contamination at the terminal ends of sequences was located using NCBI's Univec_Core database and trimmed. Bacterial, mitochondrial, and chloroplast contamination was identified via strong sequence similarity to any one of the following: E. coli K12 (GenBank accession U00096), bacteriophage phi_X174 (GenBank accession J02482), Zea mayschloroplast genome (GenBank accession NC_001666), and the draft Zea mays mitochondria NB genome (C. Fauron, University of Utah). Because assembly including the most substantial mitochondrial contamination produces contigs with equally high similarity to the mitochondria genome, masking should not throw away any automosomal regions. Determining and masking repetitive elements Masking prior to assembly uses a Perl script that relies upon standalone BLAST. BLAST hits with at least 80% identity over thirty bases with an associated minimum E-value of 5e-4 are masked, along with any hit with 80% identity over more than sixty bases. The latter criterion was added to mask AT-rich LTRs that do not meet the statistical criteria outlined above because of their biased composition. Based on a series of tests, the false positive and negative rates of masking are very close to the minimum BLAST E-value used. Assembly of non-uniform genomic fragments using PaCE Sequence alignment is a time-expensive operation. The problem with non-uniform sampling is there are potentially a quadratic number of promising pairs of sequences that need to be aligned to determine overlaps. Our pipeline uses the parallel EST clustering tool PaCE (Parallel Clustering of ESTs, Kalyanaraman et al., 2003) to significantly reduce the problems inherent in non-uniform samples to quickly assemble maize genomic islands. Full details of the advantages of our "clustering-layout-consensus" assembly strategy can be found in Emrich et al. (2004) and Emrich et al. (2005). More recently, we have updated and optimized this PaCE software to run on IBM's Bluegene/L supercomputer (BG/L), and this software was used to generate the MAGI version 4.0 assembly. High-performance genome assembly on IBM BG/L will be discussed elsewhere (Kalyanaraman et al., 2006). Assembly of PaCE clusters MAGI versions 3 and higher incorporate both improved sequence quality (Fu et al., 2004) and clone pair information. Besides increasing overall quality due to these physical constraints, clone pair information also allows bridging potential "repeat-masked" gaps since unmasked fragments are used in assembly of PaCE clusters. MAGI version 4 incorporates quality file information for even greater improvement in overall assembly sequence fidelity CAP3 is our current assembly engine and is used with the following parameters: 98% identity, 80 bp overlap, 60 bp overhang. Using a more stringent assembly options, as supported by empirical estimations of sequencing errors, allows our pipeline to potentially differentiate more paralogs within the maize genome when compared to a lower threshold. |
_01.jpg)
_02.jpg)
_03.jpg)