Stage 1 pipeline

svCapture data analysis first entails execution of a pipeline with primary actions:

align = align input fastq files to the reference genome
collate = assemble read groups, make consensuses, and re-align to genome
extract = scan name-sorted reads for anomalous molecules with alignment discontinuities
find = scan anomalous molecules from one or more samples to make SV junction calls
assemble = filter and tabulate SVs across a series of previously executed find actions

This stepwise implementation:

allows users to perform alignment inside or outside of the svCapture pipeline
supports simultaneous SV finding across multiple related samples
permits code-sharing with other pipelines in the svx-mdi-tools code suite

As an example, a user might apply svCapture to multipe pre-existing bam files and then combine the extracted data in a single ‘find’ operation.

Pipeline inputs

In addition to the genome files described under Installation, you need the following:

Unaligned or pre-aligned reads

Most users will provide two FASTQ format read files per sample, obtained from a paired-end, short-read sequencing platform. The path to these read files are specified using options --input-dir and --input-name.

Alternatively, you can skip the align action and provide pre-aligned reads as a name-sorted bam file using option --bam-file.

Your capture target regions

svCapture requires that you provide a BED4-format file via option --targets-bed that lists all genome spans that were targeted - typically captured - for sequencing. These are used to categorize SV types, calculate coverage, etc.

OPTIONAL: Unique molecular identifiers

If your library strategy included unique molecular identifiers, you list and describe them using options --umi-file and --umi-skip-bases. The first column in each row should be the sequence of one known UMI value present at the end of a read sequence. As present, random UMIs are not supported.

Parallel SV analysis across multiple samples

Actions align, collate and extract are executed per sample. The find action can also be executed per sample, but often it is helpful to merge extracted anomalous reads so that SVs can be discovered in a manner that combines information from more than one sample, either to gain more evidence for the existence of an SV or to discover that a candidate SV is not unique to one sample.

The two modes of SV finding are communicated by how you set options --output-dir and --data-name. If the output directory already contains extracted data files, a single-sample find is executed. If the directory contains a set of sub-folders, each with extracted read files, a merged-sample find is executed.

Finally, the assemble action is used for highest order data integration across many samples from many find actions. Typically, find is applied to samples that were sequenced together in a batch, whereas assemble allow different experimental batches to be plotted together.

Pipeline outputs

The svCapture pipeline performs extensive grouping and consensus making to yield sets of output molecules that are deemed likely to correspond to single, independent source DNA molecules.

The most important pipeline outputs are the lists of characterized SV junction calls:

an R-compatible RDS file included in a date package for use in the svCapture app, *.find.structural_variants.rds
a gzipped flat file, *.find.structural_variants.gz
a VCF format file, *.find.structural_variants.vcf.bgz, for use with other SV analysis tools

Additional important output files are:

the app data packages for interactive visualization, *.mdi.package.zip

Additional pipeline options

Other options for the different pipeline actions can be listed as follows:

mdi svCapture <action> --help