학술논문

Declarative nested data transformations at scale and biomedical applications
Document Type
Electronic Thesis or Dissertation
Author
Source
Subject
Genomics
Data processing
Query languages (Computer science)
Big data
Language
English
Abstract
While large-scale distributed data processing platforms have become an attractive tar- get for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions. In this work, the TraNCE compilation framework is proposed that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. Biomedical case studies are presented that outline research and clinical applications for the platform, including data integration support for building feature sets for classification. An extensive experimental evaluation is provided using both synthetic and real-world dataset from the biomedical domain. The evaluation shows that the system is capable of outperforming the common alternative, based on "flattening" complex data structures, and runs efficiently when alternative approaches are unable to perform at all.

Online Access