GFF 1, 2 and 3
The essence of GFF file format is to avoid databases
Content
- General Feature Format
- General Feature Format Version 1
- General Feature Format Version 2
- General Feature Format Version 3
- Gene Transfer Format
- How to parse GFF/GTF file
General Feature Format
A 'Feature' could mean complete gene, RNA transcript or protein structure or pretty much anything
- Plain text file format
- The format was proposed as a means to transfer feature information
- 'We do not intend GFF format to be used for complete data management of the analysis and annotation of genomic sequence'
- Originally wanted file format that will be easily parsable by Unix tools such
grep
,sort
andawk
General Feature Format Version 1 (GFF1)
- I have never seen one
- I think it is obsolete and was superseded by GFF2
The main change from version 1 to 2 is the requirement for a tag-value type structure
General Feature Format Version 2 (GFF2)
- 9 fields
- Tab separated
<SeqName>\t<Source>\t<Feature>\t<start>\t<end>\t<score>\t<strand>\t<frame>\t[Attributes]\t[Comments]
-
The attribute field is some what free form:
- It must have Tag-Value Pairing, where each pair is separated by semicolon
- The Tag name can be anything within
[A-Za-z][A-Za-z0-9_]*
- The value can be anything. Free text must be surrounded by double quotes
- All 'special' Unix character must be properly escaped e.g newline as
\n
and tab as\t
All other flavours of GFF's and GTF's are divergent of GFF2
General Feature Format Version 3 (GFF3)
- Reason for 'new', GFF3 format is that GFF2 has become insuffcient for bioinformatcians
-
Key aspects about GFF3:
- Adds a mechanism for representing more than one level of hierarchical grouping of features and subfeatures.
- Separates the ideas of group membership and feature name/id.
- Constrains the feature type field to be taken from a controlled vocabulary.
- Allows a single feature, such as an exon, to belong to more than one group at a time.
- Provides an explicit convention for pairwise alignments.
- Provides an explicit convention for features that occupy disjunct regions.
Tag-Value pairing is now must be separated by an
=
sign- Tag-Value pairs are still separated by
;
-
Predefined meaning for some tags
- ID Indicates the ID of the feature. IDs for each feature must be unique within the scope of the GFF file. In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a single feature.
- Name Display name for the feature. This is the name to be displayed to the user. Unlike IDs, there is no requirement that the Name be unique within the file.
- Alias A secondary name for the feature. It is suggested that this tag be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Unlike ID, there is no requirement that Alias be unique within the file.
- Parent Indicates the parent of the feature. A parent ID can be used to group exons into transcripts, transcripts into genes, an so forth. A feature may have multiple parents. Parent can only be used to indicate a partof relationship.
- Target Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is "target_id start end [strand]", where strand is optional and may be "+" or "-". If the target_id contains spaces, they must be escaped as hex escape %20.
- Gap The alignment of the feature to the target if the two are not collinear (e.g. contain gaps). The alignment format is taken from the CIGAR format described in the Exonerate documentation. See "THE GAP ATTRIBUTE" for a description of this format.
- Derives_from Used to disambiguate the relationship between one feature and another when the relationship is a temporal one rather than a purely structural "part of" one. This is needed for polycistronic genes. See "PATHOLOGICAL CASES" for further discussion.
- Note A free text note.
- Dbxref A database cross reference. See the section "Ontology Associations and Db Cross References" for details on the format.
- Ontology_term A cross reference to an ontology term. See the section "Ontology Associations and Db Cross References" for details.
- Is_circular A flag to indicate whether a feature is circular. See extended discussion below.
Able to add sequence to the GFF3 file. Use
## FASTA
as a separator line between annotation and sequence- Generic Feature Format Version 3 (GFF3) specifications
- Tag-Value pair separated by
=
- Tag-Value pairs separated by
;
- You can now associate features together through
ID
/Parent
tag
Gene Transfer Format (GTF)
-
GTF
is a refinement ofGFF2
and is sometimes referred to asGFF2.5
Be careful about this assumption ! Some tools might produce/convert your other GFF
file to GFF2.5
,
but this new GFF2.5
file might not be compatible with your specific tool that expects GTF
file
-
GTF
file is somewhat a subclass ofGFF
-
Original GTF specification said that all features must have two mandatory attributes:
-
gene_id
value -
transcript_id
value
-
This is to handle different transcript from the same gene
- However GENCODE has more mandatory fields
- Tag-Value pair separated by space !
- Tag-Value paies separated by
;
- Gene Transfer Format (GTF) specification
To me at least GTF
is the most established and predefined gene annotation format
How to parse GFF/GTF file..?
-
Unix tools
grep Grhl1 yourAnnotationFile.gff | less
-
grep -w gene yourAnnotationFile.gff | less
grep -w exon yourAnnotationFile.gff | less
cut -f1,3,4,5,7 yourAnnotationFile.gff | less
-
Text Editor or spreadsheet tools
- Vim
- Gedit
- sublime
- LibreOffice
- MS Office
-
Programmatically
- Python
- R
- Perl
Parse GFF/GTF with Python
- Tricky to associate features together e.g exon to transcript and transcript to gene
- Need to 'look ahead and look behind', but how far ..?
- Different gene will have different number of features (lines) associated with it
- GFF/GTF is rather big file around 1.0 - 1.5 Gb (the size will really depend on the species)
- Can make one big hash (dictionary) and keep it all in memory..
- Best I think to write your personalised hash (dictionary) to a file
Python packages for dealing with GFF/GTF
-
Nice bunch of functions from Gist
- Found this too slow and didn't want to keep it all in memory
-
- Also found it to be slow and confusing
-
Really liked it using it and now use if all the time
-
Two step process:
- Make database file
.db
from your GFF/GTF file - Parse anything you want from your
.db
file
- Make database file
db = gffutils.FeatureDB(dbFile, keep_order=True)
features = db.all_features()
-
features
is now your generator object, meaning you can loop over it
db = gffutils.FeatureDB(dbFile, keep_order=True)
features = db.all_features()
for feature in features:
print feature
- And now you can get all this things for your single feature (GFF line)
astuple
attributes
bin
calc_bin
chrom
dialect
end
extra
featuretype
file_order
frame
id
keep_order
score
seqid
sequence
sort_attribute_values
source
start
stop
strand
By trying very hard to move away from databases we ended having database as the best solution