Link Search Menu Expand Document

Table Formats

The following are detailed descriptions of the table formats used to encode amplicon alignments and SVs. Columns are ordered for ease of understanding the information they encode; the actual order of the columns in R data objects or text files may differ. Tables are produced as data.tables in R and saved as RDS files.

Primers

The primers table has exactly two rows with metadata about the two primers inferred to have generated the read data. If it is a F/R primer pair relative to the genome, the forward (F) primer is always primer 1 and the reverse (R) primer is primer 2. Otherwise primers are sorted according to position along the concatenated genome.

column data_type description
primer integer the primer number, either 1 or 2; primer 1 defines side 1 (e.g., chrom1) in other tables
node integer64 the 64-bit signed integer identifying the genome position and strand of the 5’-most primer base
chrom character the chromosome name
chromIndex integer the chromosome index
refPos integer the unsigned reference position on chrom
strand character the strand of the primer, either “+” or “-“
sequence character the genome reference sequence starting from (if strand == “+”) or ending at (if strand == “-“) refPos
rcSequence character reverse complement of sequence
seqLength integer length of sequence, equivalent to the longest amplicon sequence from refPos

Edges

The edges table has one row per edge, i.e., a connection between two nodes, with an odd number of edge rows per sequenced read segment, in format A-[J-A…]-A. The columns are listed here as found in the final edges table, i.e., after splitting of reads into segments and other filtering actions. The edges table carries the most comprehensive information about each read and is accordingly the most complex table.

column data_type description
source identifiers    
sample character the name of the source sample, obtained from option --data-name
channel integer the nanopore channel on the ONT flow cell where the read was obtained
pod5File character path to the pod5/fast5 file that carries the nanopore sampling data
qName character name of the read as assigned by nanopore code
segmentName character extension of qName with segmentN to provide a unique identifier after chimeras are split
edge numbering    
readI integer unique number of the read over all sample data
segmentN integer sequential unique number of the segment within a given read
blockN integer sequential unique number of a contiguity block within a given segment; blocks may disrupted by low quality, unused, and untrusted junctions
edgeN integer sequential unique number of this edge within a given segment
reference alignment    
qStart integer start position of the edge in the original read, i.e., query sequence
qEnd integer end position of the edge in the original read
node1 integer64 stranded start position of the edge in the reference genome; negative values are on the bottom strand
node2 integer64 stranded end position of the edge in the reference genome
cigar character for alignments, the CIGAR string for the alignment of the query read to the reference genome
edge metadata    
isCanonical logical TRUE if the read corresponds to the canonical orientation of a junction
edgeType character the kind of edge this is, one of A:alignment, D:deletion, U:duplication, V:inverstion, T:translocation (all values other than “A” are junction edges)
eventSize integer the length of an alignment on on the reference genome, or the size of the SV associated with a junction
insertSize integer if positive, the number of non-reference bases found at the junction; if negative, the number of bases of microhomology; if 0, it is a blunt joint
jxnSeq character the base sequence of the inserted or microhomologous bases
cJxnSeq character the same as jxnSeq, but reverse-complemented if isCanonical is FALSE
quality metrics    
mapQ integer alignment MAPQ, i.e., mapping quality, as determined by the aligner; for junctions, the minimum value of the two flanking alignments
gapCompressedIdentity double the fraction of reference bases that matched the query as an indicator of sequence quality; for junctions, the minimum value of the two flanking alignments
baseQual double the average base quality score over all bases in the edge
alnBaseQual double for junctions, the minimum baseQual of the two flanking alignments
alnSize integer for junctions, the minimum eventSize of the two flanking alignments
passedBandwidth logical whether a series of junctions including this junction displaced the continuity of the read on the reference genome by more than --min-sv-size
read summaries    
nTotalJunctions integer the number of junctions found in the read containing this edge
nKeptJunctions integer the number of high quality junctions in the read containing this edge that were used for junction analysis
nKeptSegments integer the number of segments contributed by this read after adapter splitting and quality filtering
junction comparison    
jxnKey character a unique identifier for this distinct junction sequence, shared with all other junction edges with the same sequence
nSegments integer the number of edges with the same jxnKey

Edge columns clip5/3, score5/3, nBases5/3, start5/3, end5/3, and hasAdapter5/3 are used in the svPore pipeline for finding adapter sequences in reads. They should be ignored in svDJ tables. Edge columns node1/2IsPrimer and isChimeric were used for filtering the edges table prior to the final output. They have constant values and can also be ignored. Finally, column nStrands is not consistently used by svDJ and should be ignored.

Junctions

The junctions table has one row for every distinct junction sequence found in the edges table, after applying junction quality filters. Junction edges aggregated into the same junction row did not necessarily have the same sequence throughout the entire read (where base differences are most likely errors), but did have the same sequence at the junction itself, as defined by the combination of node1, node2, insertSize and, if insertSize > 0, jxnBases.

column data_type description
junction identifiers    
jxnKey character a unique identifier for this distinct junction sequence, matching the same column in edges
junction metadata    
nMatchingSegments integer the number of read segments that contained this junction
nCanonical integer nMatchingSegments where isCanonical is TRUE
nNonCanonical integer nMatchingSegments where isCanonical is FALSE
junction structure    
node1 integer64 stranded start position of the junction in the reference genome; negative values are on the bottom strand
node2 integer64 stranded end position of the edge in the reference genome
insertSize integer if positive, the number of non-reference bases found at the junction; if negative, the number of bases of microhomology; if 0, it is a blunt joint
jxnSeq character the base sequence of the inserted or microhomologous bases, on the canonical strand
chrom1 character the chromosome name on side 1 of the junction
chromIndex1 integer the chromosome index on side 1 of the junction
refPos1 integer the unsigned reference position on chrom1
strand1 character the strand on side 1 of the junction, either “+” or “-“
chrom2 character the chromosome name on side 2 of the junction
chromIndex2 integer the chromosome index on side 2 of the junction
refPos2 integer the unsigned reference position on chrom2
strand2 character the strand on side 2 of the junction, either “+” or “-“
fakeSeq character an idealized sequence for this read where flanking alignments are replaced with reference genome bases but the junction itself is as it was sequenced
fakeLength integer number of bases in fakeSeq, best estimate of the original source molecule length
network membership    
junctionI integer sequential number of this junction, sorted by decreasing value of nMatchingSegments
parentJunctionI integer junctionI of the parent junction of the network to which this junction belongs; NA for low frequency junctions not added to any network
parentDistance integer the number of edits from this junction to the parent junction of the network; NA for low frequency junctions not added to any network
junctionOnParent character map of this junction sequence onto its network parent, a string of same length as parent fakeSeq

Networks

The networks table further aggregates the junctions table after the construction of junction networks. The junction rows aggregated into a given network are inferred to have arisen from the same source molecule but diverged due to PCR or sequencing errors. There may be some “collisions” where two independent source molecules fortuitously had a small edit distance and were inappropriately placed into the same network. This outcome is infrequent due to the consideration of junction coverage during network analysis, but when it happens will reduce the true molecular complexity of the source sample in the networks table.

However, especially in the case of nanopore sequencing with its higher error rate, it is also likely that some reads diverged from their parent source molecules because they acquired multiple errors. Such reads fail to be merged into their true network, falsely increasing the apparent molecular complexity of the source sample. Importantly, these often result in junction networks with low coverage that can be filtered against. They also likely arise from lower quality reads, and can be filtered against by option --min-alignment-identity, which demands that the alignments flanking a junction had a high percentage of bases that matched the reference genome (implying that the junction sequence itself was also high quality).

It is up to the user to decide whether junctions or networks give the most reliable quantification of the data, whether to trust low-coverage networks, how aggressively to set quality filters, etc. In general, we think the network analysis is robust such that the networks table provides the fairest assessment of detected junctions and their relative abundance.

column data_type description
networkKey character the jxnKey of the parent junction of the network
parentJunctionI integer the junctionI of the parent junction of the network
maxDistance integer max(parentDistance) over all junctions in the network
nMatchingJunctions integer number of junctions rows aggregated into this network
parentNMatchingSegments integer nMatchingSegments of the parent junction
nextNMatchingSegments integer nMatchingSegments of the first non-parent junction added to the network
nMatchingSegments integer sum(nMatchingSegments) over all contributing junctions
nCanonical to fakeLength various as defined for the junctions table