Table Formats

The following are detailed descriptions of the table formats used to encode amplicon alignments and SVs. Columns are ordered for ease of understanding the information they encode; the actual order of the columns in R data objects or text files may differ. Tables are produced as data.tables in R and saved as RDS files.

Primers

The primers table has exactly two rows with metadata about the two primers inferred to have generated the read data. If it is a F/R primer pair relative to the genome, the forward (F) primer is always primer 1 and the reverse (R) primer is primer 2. Otherwise primers are sorted according to position along the concatenated genome.

column	data_type	description
primer	integer	the primer number, either 1 or 2; primer 1 defines side 1 (e.g., chrom1) in other tables
node	integer64	the 64-bit signed integer identifying the genome position and strand of the 5’-most primer base
chrom	character	the chromosome name
chromIndex	integer	the chromosome index
refPos	integer	the unsigned reference position on `chrom`
strand	character	the strand of the primer, either “+” or “-“
sequence	character	the genome reference sequence starting from (if `strand` == “+”) or ending at (if `strand` == “-“) `refPos`
rcSequence	character	reverse complement of `sequence`
seqLength	integer	length of `sequence`, equivalent to the longest amplicon sequence from refPos

Edges

The edges table has one row per edge, i.e., a connection between two nodes, with an odd number of edge rows per sequenced read segment, in format A-[J-A…]-A. The columns are listed here as found in the final edges table, i.e., after splitting of reads into segments and other filtering actions. The edges table carries the most comprehensive information about each read and is accordingly the most complex table.

column	data_type	description
source identifiers
sample	character	the name of the source sample, obtained from option `--data-name`
channel	integer	the nanopore channel on the ONT flow cell where the read was obtained
pod5File	character	path to the pod5/fast5 file that carries the nanopore sampling data
qName	character	name of the read as assigned by nanopore code
segmentName	character	extension of `qName` with `segmentN` to provide a unique identifier after chimeras are split
edge numbering
readI	integer	unique number of the read over all sample data
segmentN	integer	sequential unique number of the segment within a given read
blockN	integer	sequential unique number of a contiguity block within a given segment; blocks may disrupted by low quality, unused, and untrusted junctions
edgeN	integer	sequential unique number of this edge within a given segment
reference alignment
qStart	integer	start position of the edge in the original read, i.e., query sequence
qEnd	integer	end position of the edge in the original read
node1	integer64	stranded start position of the edge in the reference genome; negative values are on the bottom strand
node2	integer64	stranded end position of the edge in the reference genome
cigar	character	for alignments, the CIGAR string for the alignment of the query read to the reference genome
edge metadata
isCanonical	logical	TRUE if the read corresponds to the canonical orientation of a junction
edgeType	character	the kind of edge this is, one of A:alignment, D:deletion, U:duplication, V:inverstion, T:translocation (all values other than “A” are junction edges)
eventSize	integer	the length of an alignment on on the reference genome, or the size of the SV associated with a junction
insertSize	integer	if positive, the number of non-reference bases found at the junction; if negative, the number of bases of microhomology; if 0, it is a blunt joint
jxnSeq	character	the base sequence of the inserted or microhomologous bases
cJxnSeq	character	the same as `jxnSeq`, but reverse-complemented if `isCanonical` is FALSE
quality metrics
mapQ	integer	alignment MAPQ, i.e., mapping quality, as determined by the aligner; for junctions, the minimum value of the two flanking alignments
gapCompressedIdentity	double	the fraction of reference bases that matched the query as an indicator of sequence quality; for junctions, the minimum value of the two flanking alignments
baseQual	double	the average base quality score over all bases in the edge
alnBaseQual	double	for junctions, the minimum `baseQual` of the two flanking alignments
alnSize	integer	for junctions, the minimum `eventSize` of the two flanking alignments
passedBandwidth	logical	whether a series of junctions including this junction displaced the continuity of the read on the reference genome by more than `--min-sv-size`
read summaries
nTotalJunctions	integer	the number of junctions found in the read containing this edge
nKeptJunctions	integer	the number of high quality junctions in the read containing this edge that were used for junction analysis
nKeptSegments	integer	the number of segments contributed by this read after adapter splitting and quality filtering
junction comparison
jxnKey	character	a unique identifier for this distinct junction sequence, shared with all other junction edges with the same sequence
nSegments	integer	the number of edges with the same `jxnKey`

Edge columns clip5/3, score5/3, nBases5/3, start5/3, end5/3, and hasAdapter5/3 are used in the svPore pipeline for finding adapter sequences in reads. They should be ignored in svDJ tables. Edge columns node1/2IsPrimer and isChimeric were used for filtering the edges table prior to the final output. They have constant values and can also be ignored. Finally, column nStrands is not consistently used by svDJ and should be ignored.

Junctions

The junctions table has one row for every distinct junction sequence found in the edges table, after applying junction quality filters. Junction edges aggregated into the same junction row did not necessarily have the same sequence throughout the entire read (where base differences are most likely errors), but did have the same sequence at the junction itself, as defined by the combination of node1, node2, insertSize and, if insertSize > 0, jxnBases.

column	data_type	description
junction identifiers
jxnKey	character	a unique identifier for this distinct junction sequence, matching the same column in `edges`
junction metadata
nMatchingSegments	integer	the number of read segments that contained this junction
nCanonical	integer	`nMatchingSegments` where `isCanonical` is TRUE
nNonCanonical	integer	`nMatchingSegments` where `isCanonical` is FALSE
junction structure
node1	integer64	stranded start position of the junction in the reference genome; negative values are on the bottom strand
node2	integer64	stranded end position of the edge in the reference genome
insertSize	integer	if positive, the number of non-reference bases found at the junction; if negative, the number of bases of microhomology; if 0, it is a blunt joint
jxnSeq	character	the base sequence of the inserted or microhomologous bases, on the canonical strand
chrom1	character	the chromosome name on side 1 of the junction
chromIndex1	integer	the chromosome index on side 1 of the junction
refPos1	integer	the unsigned reference position on `chrom1`
strand1	character	the strand on side 1 of the junction, either “+” or “-“
chrom2	character	the chromosome name on side 2 of the junction
chromIndex2	integer	the chromosome index on side 2 of the junction
refPos2	integer	the unsigned reference position on `chrom2`
strand2	character	the strand on side 2 of the junction, either “+” or “-“
fakeSeq	character	an idealized sequence for this read where flanking alignments are replaced with reference genome bases but the junction itself is as it was sequenced
fakeLength	integer	number of bases in fakeSeq, best estimate of the original source molecule length
network membership
junctionI	integer	sequential number of this junction, sorted by decreasing value of `nMatchingSegments`
parentJunctionI	integer	`junctionI` of the parent junction of the network to which this junction belongs; NA for low frequency junctions not added to any network
parentDistance	integer	the number of edits from this junction to the parent junction of the network; NA for low frequency junctions not added to any network
junctionOnParent	character	map of this junction sequence onto its network parent, a string of same length as parent `fakeSeq`

Networks

The networks table further aggregates the junctions table after the construction of junction networks. The junction rows aggregated into a given network are inferred to have arisen from the same source molecule but diverged due to PCR or sequencing errors. There may be some “collisions” where two independent source molecules fortuitously had a small edit distance and were inappropriately placed into the same network. This outcome is infrequent due to the consideration of junction coverage during network analysis, but when it happens will reduce the true molecular complexity of the source sample in the networks table.

However, especially in the case of nanopore sequencing with its higher error rate, it is also likely that some reads diverged from their parent source molecules because they acquired multiple errors. Such reads fail to be merged into their true network, falsely increasing the apparent molecular complexity of the source sample. Importantly, these often result in junction networks with low coverage that can be filtered against. They also likely arise from lower quality reads, and can be filtered against by option --min-alignment-identity, which demands that the alignments flanking a junction had a high percentage of bases that matched the reference genome (implying that the junction sequence itself was also high quality).

It is up to the user to decide whether junctions or networks give the most reliable quantification of the data, whether to trust low-coverage networks, how aggressively to set quality filters, etc. In general, we think the network analysis is robust such that the networks table provides the fairest assessment of detected junctions and their relative abundance.

column	data_type	description
networkKey	character	the `jxnKey` of the parent junction of the network
parentJunctionI	integer	the `junctionI` of the parent junction of the network
maxDistance	integer	max(`parentDistance`) over all junctions in the network
nMatchingJunctions	integer	number of `junctions` rows aggregated into this network
parentNMatchingSegments	integer	`nMatchingSegments` of the parent junction
nextNMatchingSegments	integer	`nMatchingSegments` of the first non-parent junction added to the network
nMatchingSegments	integer	sum(`nMatchingSegments`) over all contributing junctions
nCanonical to fakeLength	various	as defined for the `junctions` table