The Graphical Fragment Assembly (GFA) Format

Initially designed and widely used for sequence assembly graphs, GFA can also encode arbitrary bidirected sequence graphs. GFA primarily consists of two types of lines: segment lines (S-lines) and link lines (L-lines). An S-line represents a sequence; an L-line represents a connection between two oriented segments. The core of GFA can be loosly described by the following grammar:

<gfa>    <- <line>+
<line>   <- <S-line> | <L-line>
<S-line> <- 'S' <segId> <seq> <tag>*
<L-line> <- 'L' <segId> <strand> <segId> <strand> <cigar> <tag>*
<strand> <- '+' | '-'

where fields on each line are TAB delimited; <cigar> and <tag> follow the same format as in SAM. The figure below shows an example. For details, please see the GFA spec.

The Reference GFA (rGFA) Format

In GFA, each base can be indexed by a segment ID and an offset on the segment. This gives us the segment coordinate of the base. The segment coordinate is unstable -- when we split a segment in half, the coordinate is changed. As a pan-genome graph demands long-term stability, an ordinary GFA is not a good fit. rGFA address this issue.

rGFA is a strict subset of GFA. It disallows overlaps between segments and requires three additional tags on each segment. These tags trace the origin of the segment:

Tag	Type	Description
`SN`	`Z`	Name of stable sequence from which the segment is derived
`SO`	`i`	Offset on the stable sequence
`SR`	`i`	Rank. `0` if on a linear reference genome; `>0` otherwise

When segments don't overlap on stable sequences, each base in the graph is uniquely indexed by the stable sequence name and the offset on the stable sequence. This is called the stable coordinate of the base. The stable coordinate never changes as long as bases remain in the graph.

The figure on the right shows an example rGFA. In this example, thick arrows denote segments and thin gray lines denote links. Colors indicate ranks. We can pinpoint a postion such as chr1:9 and map existing notations onto the graph. We can also denote a walk or path in the stable coordinate. For example, path v1->v2->v3->v4 corresponds to chr1:0-17 and path v1->v2->v5->v6 corresponds to chr1:0-8=>foo:8-16. rGFA inherits the coordinate system of the current linear reference and smoothly extends it to the graph representation.

The Graph Alignment Format (GAF)

GAF is a TAB delimited format for sequence-to-graph alignments. It is a strict superset of the PAF format. Each GAF line consists of 12 mandatory fields:

Col	Type	Description
1	string	Query sequence name
2	int	Query sequence length
3	int	Query start (0-based; closed)
4	int	Query end (0-based; open)
5	char	Strand relative to the path: "+" or "-"
6	string	Path matching `/([><][^\s><]+(:\d+-\d+)?)+\|([^\s><]+)/`
7	int	Path length
8	int	Start position on the path (0-based)
9	int	End position on the path (0-based)
10	int	Number of residue matches
11	int	Alignment block length
12	int	Mapping quality (0-255; 255 for missing)

A path on column 6 is defined by the following grammar:

<path>       <- <stableId> | <orientIntv>+
<orientIntv> <- ('>' | '<') (<segId> | <stableIntv>)
<stableIntv> <- <stableId> ':' <start> '-' <end>

where <segId> is the second column on an S-line in GFA and <stableId> is an identifier on an SN tag in rGFA. With <segId>, GAF can encode alignments against an ordinary GFA. <stableId> is preferred for alignments against rGFA. Notably, if a path consists of a single interval on a stable sequence, we can simplify the entire path to <stableId> without <start> and <end>, and assume the orientation to be forward >. Such a GAF line is reduced to PAF.

The following example shows the read mapping for GTGGCT and CGTTTCC against the example rGFA above. In the unstable segment coordinate, the GAF is:

read1  6  0  6  +  >s2>s3>s4           12  2   8  6  6  60  cg:Z:6M
read2  7  0  7  +  >s2>s5>s6           11  1   8  7  7  60  cg:Z:7M

In the stable coordinate, the GAF becomes:

read1  6  0  6  +  chr1                17  7  13  6  6  60  cg:Z:6M
read2  7  0  7  +  >chr1:5-8>foo:8-16  11  1   8  7  7  60  cg:Z:7M

In the stable form, GAF remains the same as long as bases are not removed from rGFA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rGFA.md

rGFA.md

The Graphical Fragment Assembly (GFA) Format

The Reference GFA (rGFA) Format

The Graph Alignment Format (GAF)

Files

rGFA.md

Latest commit

History

rGFA.md

File metadata and controls

The Graphical Fragment Assembly (GFA) Format

The Reference GFA (rGFA) Format

The Graph Alignment Format (GAF)