Resources![]()
We recently completed a project involving a metagenomics-scale comparison of the output from several next-gen sequence runs against a Refseq Genome collection.
Learn More»
A very high throughput sequencing service facility using next generation sequencing technology produces more than one million sequences per day. An equally fast, or faster, data QA process was required to enable timely high quality data shipments to the facility's customers.
The best way to evaluate new sequence data quality for errors such as contamination and sample mix-ups is to compare the new data to all available existing data, including both finished and unfinished genomes. Using BLAST, such a comparison will identify the genetic sequence of contaminating organisms as well as any other out-of-order data.
The sequence output from a modest sequencing facility can be evaluated for quality using relatively straightforward, single-step BLAST comparisons run on a typical bioinformatics facility server, or even on a desktop computer with algorithm-specific hardware enhancements. But such solutions are no match for the output of high-throughput sequencing facilities, which use “next-generation” instrumentation to produce 3,000 times as many sequences per day. The difference of scale between new sequence data analysis requirements and current bioinformatics capabilities is made even more acute by the ongoing exponential growth of the worldwide genome sequence data collections which provide the basis for QA analysis of new sequences. A strategic effort to maintain parity by adding dedicated BLAST servers, and support personnel at rates proportional to sequence data production and growth would soon overwhelm the original charter of the facility.
Facility directors determined that, using their existing bioinformatics solutions, pre-delivery QA of a single sequencing order consisting of approximately one-half million sequences would require several weeks of analysis and processing time. It seemed that the service facility would have to choose between delivering data of unexamined quality, or an extremely constrained ability to ship the data for revenue and project turnaround times of more than a month.
GenomeQuest Inc.'s consulting service organization provided another option. Using GenomeQuest's “fast high identity” sequence comparison algorithm, the GenomeQuest High-Throughput Extension (HTx), BLAST, and a small Linux cluster composed of off-the-shelf hardware, the single run of approximately one-half million sequences was completed and analyzed within six hours.
Using a phased approach, GenomeQuest first found all the public database sequences that were potentially identical to those in the new sequence data set. This step reduced the size and complexity of the remaining analysis requirement considerably. In the second phase, the potential sequence identities were evaluated and confirmed using a more precise BLAST-like alignment tool. This approach provided results in a much faster than an ordinary single-step BLAST comparison. For more details, read about our HS3 algorithm, specifically designed for next-gen.
Though the hardware requirements for this solution were modest, the GenomeQuest HTx engine and fast high precision sequence comparison algorithm were required enabling technologies. The entire consulting project was limited in scope to a single set of approximately one-half million sequences, and required two weeks of time to design the workflow, run the analysis, and produce the reports.
Now that the workflow exists, sequencing facility staff members can use the GenomeQuest High-Throughput Extension to run new data sets themselves. The six-hour time frame required for this QA analysis is approximately commensurate with the rate of data production from the sequencing service facility. With the GenomeQuest solution, the sequencing facility's data can be analyzed and delivered nearly as rapidly as they are produced.