Considerations around industrializing a distributed clinical genomics NGS bioinformatics data solution at scale

Cloud 800x800
Written by
Elaine Gee, PhD
Founder of BigHead Analytics Group

The core computational engine underlying a next-generation sequencing (NGS)-based assay is the data processing pipeline that is comprised of a series of complex bioinformatics algorithms and quality control steps. The analytic pipeline is just one component of the bioinformatics ecosystem, whereas the computational infrastructure supporting the analytics provides the critically important platform for maintaining security and performance at scale. Additionally, bioinformatics pipelines that serve clinical assays must address specific clinical utility and therefore undergo a rigorous validation process, amongst fulfilling other requirements prior to clinical use. Involvement of protected health information (PHI) in the bioinformatics process further requires that the infrastructure address data privacy compliance and security protections to abide by the appropriate laws and regulations defined at local, state, and national levels.

With the availability of many bioinformatics tools hosted by third party vendors as services, bioinformatics pipelines can outsource components, from running specialized variant calling algorithms to accessing up-to-date curated annotations to leveraging secure and scalable compute resources. Such distributed pipelines involve data transfer between physical locations and require special attention to ensure analytic performance as well as compliance with evolving laws and regulations regarding laboratory accreditation, data handling, and security compliance, for example. Furthermore, deploying bioinformatics on a global scale requires in depth and comprehensive expertise to understand the complex regulatory landscape defined by the various relevant organizations. Deploying clinical bioinformatics as part of a distributed global operation requires a foundational platform that is built with decentralized scalability and compliance while maintaining centralized data operations at the core.

Regardless of the complexities or scale of a clinical pipeline, the end goal of creating a high quality, robust, and comprehensive pipeline that addresses the clinical utility of the NGS assay at hand requires many considerations. BigHead Analytics Group and BlueBee have teamed up to share our top 10 considerations around designing a scalable clinical bioinformatics ecosystem, whether starting from scratch or modernizing a legacy system for increased interoperability and scalability. We have based these considerations on common experiences around building and validating analytical pipelines at scale, including global roll-out.

Top 10 considerations around industrializing a distributed clinical genomics NGS bioinformatics data solution at scale

  1. The bioinformatics analytical pipeline provides computational support for a clinical assay. Understand the analytic needs of the clinical assay, such as identifying the relevant target gene regions and variant types of interest with specified limits of detection, which are foundational to the development of the pipeline.
  2. The laboratory sample preparation process can present challenges to the downstream bioinformatics analyses. Upstream processes can introduce systematic errors or other artifacts into the sequencing data that require appropriate handling during analytical pipeline development.
  3. Some genomic regions can present challenges to bioinformatics processing of short read sequencing data. DNA sequence containing low complexity, complex repeat patterns, genomic segmental duplications, or pseudogenes can present mappability challenges to reference-based bioinformatics pipelines.
  4. Variant calling algorithms can generate false positive calls, overlook false negatives, and/or require variant normalization for proper annotation. Careful optimization and familiarization of the overall pipeline during the development stage is needed to understand the limitations of the analytic performance.
  5. There are many third party resources available for performing specialized functions within a bioinformatics pipeline, particularly for variant calling and annotation. Systematic review of the candidate tools is critical during design and development to identify the best option for integration into the bioinformatics pipeline. Iterative rounds of optimization and familiarization to explore parameter settings will fine tune the pipeline performance ahead of validation.
  6. Automation is key for operational efficiency and standardization in a clinical laboratory setting. Consider data provenance and workflow version control around data upload, pipeline execution, and report generation to create a traceable audit trail important to clinical implementation.
  7. Ensure that the data workflow and computational infrastructure integrate clinical metadata and any PHI in compliance with HIPAA, GDPR, and other required clinical data health standards and regulations. Distributed pipelines require compliance with applicable laws governing each region or country at the local, state, and national levels.
  8. A thorough validation of the analytic performance across the target variant types is necessary. Show that the complete end-to-end pipeline for the given laboratory processes meets the clinical utility of the assay and identifies the limitations of the analytic methods for communication in the test report.
  9. Develop a data management practice that adheres to your data ownership and data control governance that balances data availability with storage costs. Strike a balance between cheaper solutions for long term archiving versus investment in high performance nearline storage for frequently accessed data. Data aggregation solutions can help your organization leverage massive amounts of data to add value to clinical testing, by applying data science techniques to identify trends and patterns for QC analysis and more.
  10. At the center of every clinical operation are people—the patient, clinicians, molecular pathologists, genetic counsellors, laboratory staff, operations leaders, and more. Consider your bioinformatics pipeline and computational infrastructure from each of your stakeholder perspectives to ensure the pipeline development matches expected deliverables from the clinical and QC reports to the user interface. Once live in production, provide each stakeholder efficient resolution procedures for timely customer support.


In short, designing and deploying a high-performance clinical bioinformatics ecosystem at scale in a secure and compliant manner, and without sacrificing analytical performance, is a challenging endeavor. This requires vertical integration of resources and expertise from bioinformatics, software engineering and development, to IT. These considerations outlined in this article aim to serve as a foundation for building a scalable bioinformatics solution for NGS analytical workflows and encompasses a fraction of the full set of design considerations required by a mature clinical bioinformatics ecosystem. While every pipeline and platform development and deployment differs in opportunities for optimization, there are common themes to building successful clinical bioinformatics data solutions that provide high quality results in a scalable way.

About BigHead Analytics Group

BigHead Analytics Group (BHA) provides consulting services to help clients build and scale their bioinformatics pipelines, whether for research or clinical applications, with a focus on quality. Our expertise includes industry experience in designing, developing, and validating clinical NGS bioinformatic pipelines at scale to support laboratory developed tests. Other core competencies include algorithm and tool development that leverage advanced machine learning and data science techniques to develop data-driven solutions as well as analytic platform offerings for automating the clinical bioinformatics development lifecycle. Learn more about how we make big data byte sized. Connect with Elaine on LinkedIn.

About BlueBee

BlueBee offers secure and scalable, globally available, clinical bioinformatics solutions from single sample workflows through knowledge discovery and data management. BlueBee deploys fit-for-purpose data solutions that run on its foundational BlueBee Genomics Platform, applied by diagnostic and research assay manufacturers, clinical testing laboratories, population-scale genomics initiatives, and biopharma & CROs. BlueBee partners with you and experts in the field such as BigHead Analytics Group, to define the optimal pipeline and deploy it as a complete data solution under the shared mission—to power Precision Medicine.

Interested in discussing your bioinformatics data workflow? Reach out to BigHead Analytics Group and BlueBee here.