With backing from the National Science Foundation, National Institutes of Health and other private academic and commercial clients, the National Center for Genome Resources has taken a leadership position in the bio-informatics arena since its establishment in 1994. NCGR uses Sybase Adaptive Server Enterprise (ASE) to manage its ever-growing pool of genome information and support the multiple applications accessing it.
Supporting the Human Genome Project
The nonprofit National Center for Genome Resources (NCGR) was originally spun off from Los Alamos National Laboratory to support the Human Genome Project. Since 1994, this information-intensive organization has relied on Sybase Adaptive Server Enterprise (ASE) to hold its genetic sequence data. ASE plays a key role in the success of the organization and the pursuit of its mission to improve human health and nutrition. To supply the endless demands of scientists and practitioners, NCGR use ASE to manage a total data store of around 14 terabytes, staking its claim among very large databases (VLDB).
Bioinformatics on the Building Blocks of Life
Genomic research looks at differences and similarities between genetic sequences. The human genome and many plant and animal DNA sequences have been fully or partially mapped and used as baseline data. NCGR's services include bio-informatics –an in-depth analysis of customer-submitted sequences and statistical comparison against existing reference baselines. NCGR also provides sequencing services, where a test tube sample of organic material is analyzed to determine its genetic sequence.
The National Center for Genome Resources uses ASE for its data in multiple applications including the Grindstone Lab Information Management System (LIMS), which stores mission critical information in support of the genome sequencing lab, as well as sequencing data for its Alpheus® bioinformatics application. Alpheus® data includes two basic sets of information: Seven terabytes of reference sequences and another seven and a half terabytes of new data pipelined from the resequencing instruments. The new data is compared against the reference sequences. To maintain manageable table sizes, the data is commonly partitioned by species and sample. NCGR provides these services and information to a variety of clients – from academic organizations to commercial entities, such as agriculture and biotech industries.
Neil Miller, principal software engineer and informatics team lead, explains, "Many of our clients come to us for sequencing, as well as for data analysis assistance. Our industry is currently undergoing what is called the ‘Second Genome Revolution' because, in the last year or two, the cost of sequencing has dropped dramatically. However, it still costs more than a small lab might be able to afford, so they pay us to do the sequencing. Additionally, we perform complex informatics for many customers."
Simple Data, Daunting Quantities
The basics of genome research are very simple: adenine, cytosine, guanine, and thymine (ACGT) are the four bases which form pairs to make up the rungs of the DNA helix. The technology stores these (essentially simple character strings) as a group of 36 to 46 characters that make up a single sequence. The diversity of life on earth is in the number and ordering of these bases.
The human genome has about three billion base pairs, comprising approximately 27,000 individual genes that are grouped into 23 pairs of chromosomes. Species like frogs, with multiple life stages, have even more. While the basic information is very simple, the complexity exists in the quantity and the associated issues of storage, retrieval, annotation, and pattern matching.
NCGR has six sequencing instruments that analyze samples. Each instrument produces 50 million new sequences for comparison every three days. This translates into adding 32GB to ASE with each run, amounting to over ¼ TB of data each week. In fact, during the next year, NCGR plans to accommodate loading 60TB of data.
Alpheus® – 24x7 Software as a Service in Genome Research
A few years ago, a group of scientists approached NCGR to work on a Web-based bio-informatics system. Neil Miller recalls, "They had this particular idea for sequencing but there were no available informatics tools for handling what they needed. This group then approached NCGR to partner on the project. Over the course of six intense weeks, a group of us put together the first draft of the system. Initially, we designed the system to service one particular client. Once we realized the value contained in the tool, we began modifying it to broaden its use. The result is our hosted Alpheus® system."
Alpheus® is NCGR's Sequence Variant Detection Pipeline application. The Web-based, software as a service (SaaS) application imports user data in multiple, popular data formats and then matches the imported sequences against NCGR's reference data looking for notable characteristics and differences. The results are reported both visually and with downloaded datasets. Alpheus® is surprisingly easy to use considering the breadth of functionality contained in a Web application. The backend of Alpheus® is ASE providing massive quantities of data adapting to NCGR's multiple toolsets. Currently, NCGR hosts 49 Alpheus® databases containing data for 22 different organisms and used in a wide variety of studies in the areas of human health and nutrition.
Kathy Myers, the senior database administrator, describes Alpheus®, "With these GUI tools we now have the ability to drill down into a gene and see what it looks like in a graphical context, right down to the actual nucleotides. Alpheus® is a 24/7 system providing access worldwide. We had a client who actually located and identified a new gene in a plant and that is a huge new scientific breakthrough using ASE and the Alpheus® tool."
ASE at the Heart of Groundbreaking Bioinformatics Applications
While most businesses and development organizations refine existing, well-understood areas of knowledge; NCGR has been an early participant in genome information study and continues to innovate in a rapidly evolving field.
"This is a brave new world in genetics and we've managed to become one of the largest centers doing this type of work. We have a terrific history in creating the software tools and analysis techniques for genetic data, and, in this field, that's actually a fairly new capability," says Kathy Myers.
Neil Miller adds, "Our main product was created at a time when nothing existed for scientists and doctors to take advantage of the advances in sequencing technology. Using a relational database is a key component of creating the tools to help people analyze the data. Before that, without these kinds of tools, scientists were left holding this enormous bag of sequence data they didn't know how to make sense of and were scrambling to find tools they could use on their own."
Sybase has worked well for the team at NCGR, as they have developed many applications around ASE as the central database. The tools are written in Java, PERL, and Ruby on Rails. "Quite simply, it's robust. We trust our data integrity to it," says Neil Miller. Adds Kathy Myers, "ASE has been remarkably stable. It is an easy database to use, and an easy database to understand. I should also mention that the few times when I have needed technical support, it's been terrific. I immediately got to people who knew what they were doing, and they quickly gave me solutions to the few issues I had."
Applied Technology to Human Health for the Greater Good
NCGR's scope goes beyond providing tools for scientists and commercial users. NCGR also uses these tools for world-class research projects of their own. For example, NCGR is internationally recognized for its work on identifying genetic components of schizophrenia. This type of identification is a boon for both early diagnosis and treatment by taking some of the guesswork out of a disease that, untreated, or where treatment is delayed, can be devastating for sufferers and families.
Another study is a public-private multi-disciplinary collaboration involving investigators at NCGR along with other organizations including Duke University Medical Center, Henry Ford Hospital, Eli Lilly and Co., Pfizer, Inc., Roche Diagnostics Corp. and Metabolon. This study uses advanced bioinformatics technologies to identify specific biomarkers in patient blood samples that predict outcome in sepsis and community acquired pneumonia. Development of biomarker-based tests will enable patient-specific diagnoses and early targeted treatments with much higher success rates.
Mesothelioma is a cancer linked to asbestos exposure. With the help of NCGR, doctors from the International Mesothelioma Program at Brigham and Women's Hospital were able to characterize the genetic mutations found in the tumours. This characterization helps shed light on the underlying processes of the disease, including the specific genes involved in the mutation.. Kathy Myers explains, "This kind of conclusion is absolutely critical to ongoing prevention and treatment efforts."
The predictive nature of genetics is saving lives by identifying individuals who are at risk from what are benign exposures for most people. For example, NCGR recently worked with a company that was studying adverse reactions in a few people to a smallpox vaccine, and the reaction appears to have a genetic component. Perfecting this type of test can bring the benefit of the drug to large groups of people, while identifying and insuring the safety of the few who would suffer.
Neil Miller finds himself in a rare and enviable position as a programmer, "As a technical person I am always amazed that a technical product can be used to further our knowledge of genetics and actually make a difference in human health. Before I came here I did not study a lot of biology, and it is great to be working on something that people are putting to very good, long-term use."
Insuring the Food Supply
In the plant world, NCGR's Legume Information System catalogs the genetic similarities and differences of beans. Beans provide a ready source of protein throughout the world and are easily grown in depleted soils. The Legume Information System integrates genetic and molecular data from multiple legume species for use in agricultural research.
Water molds are a very common plant pathogen and worldwide they cause crop and ecological damage that are calculated in the hundreds of billions of dollars each year. Water molds impact food crops, ornamentals, forest products, and seafood. NCGR is highly involved in analyzing and annotating gene sequences of various water molds.
Indeed, anything living, even entities like viruses that are not necessarily a form of life, are worthy of study and analysis. Using ASE, the National Center for Genome Resources has become a pillar of this rapid-growth industry.
Over the life of ASE at NCGR, its use has evolved. "Sybase ASE is really the cornerstone of our research. ASE performs remarkably well, given the exceptional amount of data we are throwing at it right now. ASE will continue to be our solution for the individual database projects that will be generating data in the hundreds of gigabyte range," says Kathy Myers.
With regard to data sizes, Myers notes, "This whole area of research becomes a problem of statistics. How many healthy people will we need to sequence to come up with a true normal, a true genetic sequence to compare against? Once we establish a range for normal, then how many sets of patients will we need to compare against to have a statistically significant number? The numbers and amounts become enormous; for a database person this is a very exciting time."
With the Second Genomic Revolution, data sizes are exploding, and at nearly 15 terabytes, with a projected bloom in data store size on the near horizon, NCGR is looking to move into the next level of very large storage systems. A prime candidate is Sybase IQ, a column-based analytics server. With a simple migration path from ASE, many Sybase customers use a distributed data management architecture with a mix of Sybase ASE and Sybase IQ. As NCGR stores more reference sequences to use as baselines for informatics analysis, the needs are changing from a highly transactional based approach, into one of a large data warehouse.
Hitting the Target
The National Center for Genome Resources has a concrete and powerful mission: "To improve human health and nutrition by genome sequencing and analysis." NCGR plays an important role in genome research by furthering the state of the art in genetic research.
It could be said that genetic research has reached the end of the beginning. The core elements are in place, and with increasingly affordable sequencing costs, what comes next is an explosion of baseline sequence libraries and a surge in applied genome techniques. The National Center for Genome Resources is a nonprofit organization that is literally paying dividends to future generations.