IBM and CLC bio provide an accelerated genomics research platform to convert sequencer data to usable genomic insight.
Imagine a world where medical diagnoses and treatment regimens are based on a person’s specific genetic makeup—reducing side effects and improving patient outcomes. That’s the promise of personalized medicine, which is rapidly becoming a reality through advances in genomic sequencing and analysis.
APPLYING GENOMIC SEQUENCING TO THERAPEUTICS
Dr. Lukas Wartman has firsthand experience with the power of genomic sequencing. A genetics researcher at Washington University in St. Louis, Missouri, Dr. Wartman ended up contracting the very disease he was studying: adult acute lymphoblastic leukemia. His condition deteriorated rapidly, and there was no known treatment for the cancer.
His colleagues decided to fully sequence the genes of both his cancerous cells and healthy cells using the High Performance Computing cluster housed in the Genome Institute at Washington University. They discov-ered something completely unexpected: one of Dr. Wartman’s normal genes, FLT3, was malfunctioning, producing massive quantities of a protein that was feeding the cancer.
The team found a drug typically used to control the overactive FLT3 gene in patients with kidney cancer. Dr. Wartman became the first person to take this drug for leukemia, and his cancer is now in remission. Dr. Wartman’s case demonstrates how genomic sequencing enables researchers to understand the role of genes in fueling a specific cancer. Consequently, cancer treatment could be customized with drugs that tar-get a gene rather than the tumor or tissue where the cancer first appears.
IBM Technical Computing
ESTABLISHING HIGH-THROUGHPUT PERFORMANCE
Because each human genome comprises over three billion base pairs, whole genomic sequencing requires tremendous process-ing power and storage capacity in order to correlate the variants in the genome with the relevant patient symptoms. Facing increased demand for sequencing, the industry is challenged
to drive down cost while speeding up the assembly, mapping and analysis involved in the sequencing process.
To address these issues, IBM and CLC bio have undertaken a joint effort to develop the IBM Application Ready Solution for CLC bio, a next-generation sequencing (NGS) platform. The system was built for practitioners, requiring little IT administration, yet it is scalable, flexible and extendable. This end-to-end solution integrates a computing cluster built on advanced IBM hardware and software, CLC Genomics Server
software for high-throughput sequencing, and CLC Genomics Workbench client/desktop software for analyzing and visualiz-ing NGS data.
The cluster compute nodes consist of IBM® Flex System™ x240 powered by Intel® Xeon® E5-2680v2 processors. These nodes are connected to an IBM Storwize® V7000 Unified network attached storage system that consolidates block and file workloads. The Storwize V7000 Unified system also has a single, easy-to-use management interface that supports both block and file storage, helping to simplify administration.
Storwize V7000 Unified system supports file data storage using the IBM General Parallel File System (GPFS™). With its leading file system performance and its ability to scale based on customer needs, GPFS is used in the world’s largest high-performance computing (HPC) installations in addition to mainstream technical computing environments. Plus, CLC bio software uses a shared-disk file management solution that provides fast, reliable access to NGS data for optimizing performance.
Life Sciences
To simplify the deployment and management of the cluster, IBM Platform™ HPC provides a complete set of technical and high performance computing (HPC) management capabilities in a single product. The rich set of out-of-the-box features reduces the complexity and cost of managing and running an optimized genomics sequencing cluster. Integrated workload management features have been designed to help improve time-to-results and asset utilization.
PROVIDING A SCALABLE, TURNKEY SOLUTION
IBM Application Ready Solution for CLC bio has been developed in partnership with CLC bio to deliver a scalable, high performance genomics sequencing platform based on an IBM reference architecture. A turnkey solution is available from IBM business partner Re-Store, LLC. It comes pre-integrated with CLC Genomics Server and CLC Genomics Workbench and includes global support and service. The solution is easy
to deploy and use, simplifying IT administration and boosting productivity. It has also been designed to scale as workloads expand over time. The solution provides up to 90 TB of effective storage capacity, and administrators can easily add storage extensions and more compute nodes as necessary.
These three analytics solutions have been benchmarked for their mapping, variant calling and filtering performance.
CLC Genomics Workbench 6.5 and Platform HPC enabled Genomics Server 5.5 were installed on an IBM server under Storwize V7000 Unified and GPFS. The benchmark was executed using the 37x coverage human genome data set (1,415,483,596 reads, 100 bp/read) and 150x coverage Exome reads (NA12878) from Illumina Genome Analyzer II. Benchmarking showed that the change to Analytics Solutions will perform as follows (see Figures 1 on page 3).
2
Life Sciences | |||||||
IBM Technical Computing | |||||||
Turnkey solution options: | |||||||
Small Analytics Solution | Medium Analytics Solution | Large Analytics Solution | |||||
Workload size per week | 15 human genome (37x) or | 30 human genome (37x) or | 60 human genome (37x) or | ||||
120 human exome (150x) | 240 human exome (150x) | 480 human exome (150x) | |||||
Applications | CLC Genomics Server 5.5x, | CLC Genomics Server 5.5x, | CLC Genomics Server 5.5x, | ||||
CLC Genomics Workbench: | CLC Genomics Workbench: | CLC Genomics Workbench: | |||||
9 static licenses | 12 static licenses | 15 static licenses | |||||
Application maintenance | Three years of full maintenance | Three years of full maintenance | Three years of full maintenance | ||||
(support and all upgrades) | (support and all upgrades) | (support and all upgrades) | |||||
on CLC bio software | on CLC bio software | on CLC bio software | |||||
Management software | IBM® Platform™ HPC | IBM Platform HPC | IBM Platform HPC | ||||
System rack | One 25U rack | One 25U rack | One 42U rack | ||||
System switch | Top-of-rack network switch | Top-of-rack network switch | Top-of-rack network switch | ||||
System manage- | One IBM Flex System x240 with | One IBM Flex System x240 with | One IBM Flex System x240 with | ||||
ment node | 16 CPU cores and 64 GB RAM | 16 CPU cores and 64 GB RAM | 16 CPU cores and 64 GB RAM | ||||
System compute nodes | Three IBM Flex System x240 with | Six IBM Flex System x240 with | Twelve IBM Flex System x240 with | ||||
60 CPU cores and 384 GB RAM | 120 CPU cores and 768 GB RAM | 240 CPU cores and 1536 GB RAM | |||||
CPU/compute node | 2 Intel Xeon 10C Processor Model | 2 Intel Xeon 10C Processor Model | 2 Intel Xeon 10C Processor Model | ||||
E5-2680v2 115W | E5-2680v2 115W | E5-2680v2 115W | |||||
2.8GHz/1866MHz/25MB | 2.8GHz/1866MHz/25MB | 2.8GHz/1866MHz/25MB | |||||
Memory/compute node | 128 GB DDR3 | 128 GB DDR3 | 128 GB DDR3 | ||||
System internal storage | 6 TB, 7,200 rpm NL SAS | 6 TB, 7,200 rpm NL SAS | 6 TB, 7,200 rpm NL SAS | ||||
Storwize 7000 Unified | 20 TB effective storage capacity | 55 TB effective storage capacity | 90 TB effective storage capacity | ||||
System maintenance | 3 Year Onsite Repair 24x7, 4 Hour | 3 Year Onsite Repair 24x7, 4 Hour | 3 Year Onsite Repair 24x7, 4 Hour | ||||
Response | Response | Response | |||||
HH:MM:SS | |||||||
0:00:00 | |||||||
21:36:00 | |||||||
19:12:00 | |||||||
16:48:00 | |||||||
14:24:00 | |||||||
12:00:00 | |||||||
9:36:00 | |||||||
7:12:00 | |||||||
4:48:00 | |||||||
2:24:00 | |||||||
0:00:00 | 37x Coverage WGS | 150x Coverage WEX | |||||
filtering | 0:19:32 | 0:14:31 | |||||
variant calling | 16:33:33 | 1:27:21 | |||||
mapping | 5:56:04 | 0:42:05 |


Figure 1. NGS Workflow benchmark performance of 37x coverage whole human genome reads and 150x coverage whole human exome reads on IBM singlecompute node. The workflow includes read mapping, variant calling to filter variants against known database (common SNAPs/INDELs database).
3
PROVIDING A FOUNDATION FOR FULL-GENOME ANALYSIS
In the future, a person’s entire genome sequence will become part of his or her electronic medical records. A full individual genome can be compared to a reference human genome, which previously could take weeks or months to assemble, map and analyze. But benchmarking shows that the exceptional performance of IBM Application Ready Solution for CLC bio integrated with CLC Genomics Server enables researchers to obtain this critical information in a matter of days, even hours. The solution provides a scalable, flexible, high-performance platform that helps accelerate genomic research and leads
to a deep understanding of the associations between genetic variations and diseases—and potential cures.