Overcoming High Sequence Complexity with the PacBio De Novo Assembly Pipeline, on the BlueBee Platform

Adobe Stock 40572339 doctor computer 800x800

Complex elements – repetitive regions, copy number variants, or structural variations – are challenging, if not impossible to sequence using short-read technologies. PacBio’s long-read technology based on Single Molecule, Real-Time (SMRT) sequencing, delivers long reads with uniform coverage, allowing comprehensive de novo genome assemblies that can overcome the short-read sequencing challenges of complex DNA regions. To enable the interpretation of these type of sequencing results, powerful compute resources are required.

The BlueBee de Novo Assembly Pipeline follows PacBio HGAP framework and it resolves sequence complexities via a three-step implementation which includes pre-assembly: shorter reads are aligned to the longest seed reads to create preads, assembly: preads are aligned to each other creating a draft genome assembly, and polishing: consensus sequence is generated of the sequencing results.

This new pipeline – released in October 2017 – supports researchers in their efforts to understand high sequence complexity, which until now represented a significant challenging task for both research applications as well as disease diagnosis.

Data input

The pipeline accepts one or multiple PacBio sub-reads (unmapped) BAM files belonging to a single sample.

Output files generated via the BlueBee de Novo Assembly Pipeline:

  • pre-assembly process statistics (.tsv)
  • draft assembly read reports (.tsv, .png)
  • draft assembly coverage (.tsv, .png)
  • polished assembly reports (.png, .tsv)
  • list of homozygous variants (.gff/.bed/.vcf)
  • draft assembly (.fasta) and final genome assembly (.fasta/.fastq).