svDJ
The svDJ pipeline and app analyzes long-read amplicon sequencing libraries for structural variant (SV) junctions relative to a reference genome.
Early pipeline actions support file conversion and offline basecalling of Oxford Nanopore long-read data.
Later actions support alignment to a reference genome and SV junction discovery, collation, and inter-molecule comparison. These actions might reasonably be applied to any long read data with minor modifications, but at present have only been applied to nanopore data.
Expectations
svDJ assumes that each input library of molecules represents amplification of a single amplicon family subjected to adapter ligation and sequencing. A set of such libraries may be barcoded and sequenced together on a single flow cell, but upstream processing (e.g., by MinKNOW) must have already demultiplexed the libraries into separate file sets. Thus, there should be separate folders that each hold one or more sequence files with reads from one or more PCR reactions performed with one forward and one reverse primer. It is acceptable, in fact expected, that those primers will have amplified more than more one type of source molecule. In the future, it could be possible to have mixed amplicons, i.e., primer sets, but this is not presently supported.
As its name implies, svDJ was developed to analyze V(D)J recombination junctions. However, it could be applied equally well to any amplicon configuration that matches the expectations described above.
Sequence read configurations
The following diagram illustrates the typical expected molecule structure:
Single-junction reads
outerNode1 junctionNode1 junctionNode2 outerNode2 | | | | *=========>-------------*//*-------------<=========* primer1 junction primer2
However, aberrant and more complex molecules are also invariably present, such as:
Multi-junction reads, such as might arise by novel biological insertions of other genomic segments, or by PCR artifact.
outerNode1 junctionNode1 junctionNode2 junctionNode3 junctionNode4 outerNode2 | | | | | | *=========>-------------*//*-------------------------*//*-------------<=========*
Chimeric amplicon fusions, resulting from either intermolecular ligation or miscalling of two DNA molecules in a single read:
outerNode1 junctionNode1 junctionNode2 falseNode1 falseNode2 junctionNode3 junctionNode4 outerNode2 | | | | | | | | *=========>-------------*//*-------------<=========*//*=========>-------------*//*-------------<=========*
Duplex reads, a special case of chimeric amplicon fusion in which the two strands of a single source molecule were both sequenced and fused (note the inverted repetition of the same junction nodes):
outerNode1 junctionNode1 junctionNode2 falseNode1 falseNode2 junctionNode2 junctionNode1 outerNode2 | | | | | | | | *=========>-------------*//*-------------<=========*//*=========>-------------*//*-------------<=========*
Truncated reads, resulting from strand breaks, etc.:
outerNode1 junctionNode1 junctionNode2 outerNode2 | | | | *=========>-------------*//*-------------<=*
svDJ seeks to:
- keep and characterize single-junction and multi-junction reads
- split chimeric amplicon fusions into their respective parts, which are each kept and analyzed as indepent molecules
- keep just one half of duplex reads, to prevent double-counting of non-independent source molecules
- discard truncated and off-target reads as untrustworthy
Read processing
To facilitate analysis and characterization of all molecules types, svDJ uses a node-edge graph structure. Each aligned segment of a read has two node positions, one at each end (see above). These nodes define two kinds of edges between them:
- alignments (A), the expected reference genome contiguity found by the aligner
- junctions (J), the unexpected fusion of two distant genomic positions, called as a split read by the aligner
Alignments have properties such as the CIGAR strings that describe their exact base matches to the reference genome. Junctions have properties such as the nature of any microhomologies and inserted bases.
During processing by action extract
, each read is converted to a set of rows in a table of edges
, as follows:
readI | edgeN | edgeType | node1 | node2 | additional columns |
---|---|---|---|---|---|
1 | 1 | A | 12 | 34 | alignment properties |
1 | 2 | J | 34 | -78 | junction properties |
1 | 3 | A | -78 | -56 | etc. |
2 | 1 | A | 56 | 78 | etc. |
2 | 2 | J | 78 | -34 | etc. |
2 | 3 | A | -34 | -12 | etc. |
3 | 1 | etc. |
where:
- a read always has an odd number of edges of the form A-J[-A-J…]-A
- node positions are 64-bit base coordinates along the entire concatenated genome
- negative node positions denote reads corresponding to the bottom reference strand
- most reads are expected to have one of the two possible strandedness configurations of the same outer node positions, as fixed by the amplification primers (shown in bold)
After extracting the complete edges table, svDJ parse
:
- performs ab initio discovery of the two primer outer node positions
- splits chimeric reads at internal junctions that match abutted primer nodes, to yield one or more
segments
- discards segments where the outer nodes do not match the expected primer1/primer2 amplicon
- discards one split segment when a duplex chimeric read repeats the same SV junction(s) in opposite orientation
- performs exact junction matching between segments to highlight biological or technical replicates
- assembles unique junction sequences into networks of similar junction sequences (see below)
Matching nodes and junctions
Importantly, sequencing errors lead to imprecision in the assigned values of node positions. svDJ therefore sometimes uses an adjancency tolerance when matching nodes. For example, molecules with outer nodes 12/-56 and 12/-57 would both be considered a match to primer nodes 12/-56.
In its first layer of junction matching, parse
uses exact matching based on the actual node positions and any novel inserted bases. This yields the following table of junctions
.
jxnKey | nMatchingSegments | node1 | node2 | insertSize | jxnSeq | etc. |
---|---|---|---|---|---|---|
xxx | 23 | 34 | -78 | 4 | CGTA | |
yyy | 1 | 33 | -78 | 5 | ACGTA | |
etc. |
However, this might undercall true replicates that diverged due to PCR or sequencing errors and are thus still found in different junction rows. The following examples illustrate some of the ways that even a single basecalling error can lead to a substantial change in the apparent junction call:
correct junction, 1-base microhomology: alignment 1: -----* read sequence: ~~~~~A~~~~~ alignment 2: *----- junction with A to T base error, called as 1-base insertion: alignment 1: ----* read sequence: ~~~~~t~~~~~ alignment 2: *----
correct junction, 3-base insertion: alignment 1: ----* read sequence: ~~~ATacg~~~~~ alignment 2: *---- junction with missing base "A", called as 4-base insertion: alignment 1: --* read sequence: ~~~Tacg~~~~~ alignment 2: *----
To help match junctions impacted by these factors, a second round of matching is performed based on the edit distance between junctions, irrespective of the node positions and insert size. The logic is that highly abundant junctions (based on exact matching) are likely to have spawned some reads that diverged from the parent molecule due to a method error at the junction.
The parse
action builds adjacency networks by analyzing a graph where all unique junctions above a user-defined coverage threshold are the nodes and edge weights are the edit distance between them. The graph is directed, such that lower frequency junctions point toward higher frequency junctions.
The most abundant junction is declared as the parent node of a first subgraph expected to have less frequent derivative junctions that arose by error processes. Each less frequent junction node is then analyzed to assess whether it is best modeled as a new parent node representing a distinct junction or as a child, i.e., derivative, of an existing parent node. This decision is controlled by user parameters of the maximum edit distance from a parent to a child and how much to weight the junction coverage. Larger edit distances and higher junction coverage favor calling a query node as the parent of a new junction subgraph.
Once all candidate junction nodes have been evaluated, parse
can continue on, if instructed, to add the lowest coverage nodes to existing junction networks, but they will not create new junction calls. Because this can entail processing large numbers of singleton sequences, the default behavior is to skip this process as it is slow and often has little impact on the final result.
Importantly, only the bases at the junction are used when comparing junctions to each other. Base differences in the amplicon “stems” on each side of the junction are assumed to be sequencing errors and are disregarded by assembling a “fake” representative sequence for each read segment, where the flanks are replaced with the corresponding amount of reference genome sequence (i.e., from outer primer node to the junction node).
Also important is the fact that a given read might correspond to either strand of a source DNA amplicon molecule, which must be accounted for during junction matching. svDJ achieves this by defining a canonical orientation for every junction, i.e., for every possible combination of node pairs. A read segment is on the canonical strand if it begins at the primer1 node, as defined by the primers table. Noncanonical segments are reverse complemented prior to junction comparison.
The resulting final level of aggregation is thus the table of junction networks
, as follows. All tables (primers, edges, junctions, and networks) are returned as RDS files for further exploration.
networkKey | nMatchingJunctions | nMatchingSegments | node1 | node2 | insertSize | jxnSeq | etc. |
---|---|---|---|---|---|---|---|
xxx | 2 | 24 | 34 | -78 | 4 | CGTA | |
etc. |
</body> </html>