-
Notifications
You must be signed in to change notification settings - Fork 51
IsoAnnotLite
IsoAnnotLite is a python script that allows you to transfer isoform-level functional feature annotations from an existing tappAS-like GFF3 file to the SQANTI3 output (i.e. a set of long read-defined transcripts). The resulting file has the proper structure to be loaded into tappAS for isoform-level functional analysis. If the reference GFF3 annotation file is not provided, only structural information will be included in the IsoAnnotLite-generated GFF3 file. The transference is done at the genomic positon level for transcript and protein annotations, while all gene-level annotations (e.g. Gene Ontology terms) are automatically transferred by gene ID.
Mandatory: the basic input required to run IsoAnnotLite are three of the output files generated by SQANTI3 QC:
-
GTF file: output as
*_corrected.gtf
. Contains the long read-defined transcript sequences in GTF format. -
Classification file: output as
*_classification.txt
. Tab-delimited file containing one entry per transcript and a large number of columns containing SQANTI3 attributes. -
Junction file: output as
*_junctions.txt
. Tab-delimited file containing junction-level information for all the transcripts included in the classification file.
Optional: the -gff3
argument can be used tu supply an already-annotated GFF3 file containing transcript-level functional features. When this argument is not used, a tappAS-compatible GFF3 will still generated for the user to load into tappAS, however, it will only contain structural information (transcript length, exons, UTR, etc.). Pre-computed tappAS annotation files for human, mouse, Drosophila, Arabidopsis and maize are available here.
All of the following parameters are optional:
-
-o
: name for the resulting GFF3 file. -
-stdout
: name for the statistics results file. Only when GFF3 it is used. If not used, no file will be generated. -
novel
: if it is used, all the transcripts will be treated as novel trasncripts, meaning that each transcript will be annotated using functional information from all the transcripts that belong to the same gene in the reference GFF3, instead of transferring annotations from transcripts with matching IDs (default). -
nointronic
: if provided, intronic features will not be annotated (e.g. RBPs). -
statistics
: deprecated (now IsoAnnotLite always shows the statistic results). -
saveTranscriptIDs
: when supplied to IsoAnnotLite, five additional output files will be created to report the IDs of the transcripts that generated problems during the feature transfer process:- Transcripts with a reference
associated_transcript
not annotated by positional transference:file_trans_not_annot_by_PF.txt
. - Novel transcripts not annotated by positional transference:
file_novel_not_annot_by_by_PF.txt
. - SQ3 reference gene was not found in the GFF3 annotation
file_reference_gene_not_annot.txt
. - Transcripts not annotated because no features in the GFF3 annotation could be matched by position:
file_reference_transcript_not_annot.txt
. - SQ reference gene (
associated_gene
) field was empty:file_transcript_wo_gene_ID.txt
.
- Transcripts with a reference
As described in the SQANTI3 QC documentation, IsoAnnotLite is run internally by SQANTI3 QC if the --isoAnnotLite
flag is supplied. In addition, users may add a tappAS reference GFF3 file via the --gff3
argument to perform functional feature transfer. Within SQANTI3, IsoAnnotLite is run with the following parameters by default:
python path_to_utilities/IsoAnnotLite_SQ3.py
\ *_corrected.gtf *_classification.txt *_junctions.txt
\ -gff3 *.gff3
\ -o out_prefix -novel -stdout out_dir
Note that this behavior cannot be modified when running SQANTI3: instead, if you wish for IsoAnnotLite to behave differently (i.e. supply flags other than -novel, which is activated by default in the SQANTI3 QC run), you will need to run in independently after obtaining the output of SQANTI3 QC.
The IsoAnnotLite_SQ3.py
script included in SQ3's utilities
folder is consistently updated as new versions of IsoAnnotLite are released. If you are running IsoAnnotLite independently, please make sure you use the latest version. To download the current version of IsoAnnotLite (v2.7.3), click here.
If used, the reference annotation file (GFF3) is read. The function creates different Python dictionaries to save all the relevant transcript-level information.
IsoAnnotLite next reads the three input files generated by SQ3 and creates an auxiliar GFF3 (to prevent memory overload). This auxiliar GFF3 will later be used to create the final GFF3 file.
Using the SQ3 input files and GFF3 information, IsoAnnotLite transforms all CDS local positions to genome positions using exon information.
Using the reference GFF3 file, transcript-level functional feature positions are transformed to genomic position by the same methodology.
Create a dictionary where gene, feature and genomic positions are stored.
Features are transferred by matching genomic positions: features that are positionally defined within a long read-defined transcript in the SQ3 output will be annotated as belonging to that transcript.
In the process of transferring features between the GFF3 and the SQ3 transcripts, IsoAnnotLite behaves differently depending on the type of feature that is being handled:
- Transfer a feature located in the 3/5 UTR region.
- Transfer a feature located in the CDS.
- Transfer protein features.
- Transfer gene-level characteristics.
In addition, two different feature transfer methodologies are implemented depending on whether the SQ3 transcript has a match in the GFF3 file (this will be the case for FSM and ISM transcripts) or needs to be treated as a novel transcript (i.e. NIC and NNC).
- In case a matching reference transcript is found (and the
-novel
flag is not used), only the matching reference transcript will be used to transfer features at all the levels mentioned above. - For novel transcripts (or for all transcripts if the
-novel
flag is used), IsoAnnotLite will iterate all the same-gene transcripts found in the GFF3 (using theassociated_gene
column in the SQANTI3 classification file) to retrieve positionally-defined features. The procedure will be the same, but it is running as many times as the number of transcripts the gene has.
To transfer UTR features, its genomic feature position must be inside the transcript exons but outside CDS region.
For CDS transcript features, two requirements must be met: 1) the feature must be contained within the transcript's exons as well as inside the CDS region; and 2) if a feature has start and end positions situated in different exons, the end and the start of the exons for both transcripts must be the same in order for IsoAnnotLite to transfer the feature.
Furthermore, for protein features, we check both transcripts are coding and have similar CDS. If all CDS exons are the same for both transcripts, all protein features are transferred. In the other case, at least one CDS exon should be a partial match. At least one CDS genomic region overlaps between both transcripts. In that case, IsoAnnotLite checks if any protein feature can be transferred by genomic position.
For gene-level characteristics (such as GOTerms), we decided to always transfer the information across matching gene IDs. Novel transcript procedure is equivalent, but annotations can come from multiple transcripts. Duplicated annotations are then removed.
During this step, a feature type flag is added to the GFF3 columns. This flag is required in some of tappAS' analysis and indicates if the annotation is defined at the transcript, protein or gene level.
Currently, this method is a hand-written methodology and cannot automatically detect feature type -this means that we have pre-defined the feature type for a long list of functional feature categories that we generally work with in the tappAS framework. Therefore, if you are working with a different kind of functional feature, IsoAnnotLite will display a warning. In this case, we recommend manually editing the updateGTF()
function inside IsoAnnotLite to add your new feature, following the same structure as in the code.
The feature types supported by tappAS are:
- T for transcript-level features.
- P for protein-level features.
- N for gene-level features.
Sorting the GFF3 by transcript ID, that is, all entries corresponding to the same transcripts need to be together in the same chunk of the GFF3.
If missing from the SQANTI3 GTF, gene descriptions are updated using the information in the reference GFF3 file. After this step, the final GFF3 is ready.
Statistics are divided into three main sections:
-
Transcript-level summary. This section will display several entries with the number of transcripts that have been successfully annotated or not, and the reason. The information is shown independently for FSM/ISM and novel transcripts. Users will also see:
- No. of total transcripts that have been annotated by genomic positional feature transference.
- No. of transcripts not annotated because the reference transcript has no annotations to transfer.
- No. of transcripts that were not annotated because no features matched by positional transference.
- No. of transcripts that were not annotated because the gene that they are assigned to by the
associated_gene
SQ3 column was not found in the reference GFF3.
-
Feature transfer summary. For each of the functional feature categories included in the reference GFF3, this section will show how many transcripts have been annotated with at least one feature from that category.
-
Feature-level summary. This section contains the same information as in (2), but in this case, the information is displayed by number of features instead of my number of transcripts. This summarizes how may features have been transferred from the reference GFF3 into the new GFF3 annotation file.
At the end of the statistics file, a summary line is shown including the percentage (%) of features have been transferred in total. However, keep in mind that this count is only exact when no novel transcripts are annotated: otherwise, since a novel transcript can receive annotations from multiple reference transcripts (all of which are counted, even if they are collapsed afterwards to remove redundancies), annotated features can be counted several times. Therefore, the result will correspond to the total number of features that have been annotated against the total number of feature transfer "events" that have been tried by IsoAnnotLite.
The tappAS functional annotation GFF3 file follows the basic Generic Feature Format 3 (GFF3). However, it has been slightly modified to suit the application: the “score” and “phase” columns are not used and some of the attributes may not fully abide by the formal specifications. The file consists of a set of annotation features for each transcript.
The file is divided into blocks, each corresponding to a transcript. Within each block, the set of features is divided into sections as follows:
- Transcript 1
- Transcript Level Feature Annotations – basic transcript information, UTR motifs, microRNAs, etc.
- Genomic Level Feature Annotations – exons, splice junctions, etc.
- Protein Level Feature Annotations – gene ontology features, domains, phosphorylation sites, etc.
- Transcript 2
- ...
- Transcript 3
- …
The annotation features must be named as expected by tappAS:
Source | Feature | Description |
---|---|---|
tappAS | transcript | Start of transcript features |
tappAS | gene | Gene information |
tappAS | CDS | CDS information |
tappAS | genomic | Start of genomic features |
tappAS | exon | Exon |
tappAS | splice_junction | Splice junction |
tappAS | protein | Start of protein features |
In addition, the remaining attributes must be named as follows:
Attribute | Description |
---|---|
ID | Feature ID |
Name | Feature name |
Desc | Feature description |
Chr | Feature chromosome |
For reference, here is a snippet of a tappAS-formatted GFF3 file (header should NOT be included):
SeqName | Source | Feature | Start | End | Score | Strand | Phase | Attributes |
---|---|---|---|---|---|---|---|---|
PB.3189.4 | tappAS | transcript | 1 | 1399 | . | + | . | ID=XM_006524897.1; primary_class=full_splice_match; PosType=T |
PB.3189.4 | tappAS | gene | 1 | 1399 | . | + | . | ID=Qpct; Name=Qpct; Desc=glutaminyl-peptide cyclotransferase (glutaminyl cyclase); PosType=T |
PB.3189.4 | tappAS | CDS | 10 | 951 | . | + | . | ID=XP_006524960.1; PosType=T |
PB.3189.4 | UTRsite | 3’UTRmotif | 1288 | 1295 | . | + | . | ID=U0023; Name=K-BOX; Desc=K-Box; PosType=T |
PB.3189.4 | UTRsite | PAS | 1380 | 1399 | . | + | . | ID=U0043; Name=PAS; Desc=Polyadenylation Signal; PosType=T |
PB.3189.4 | mirWalk | miRNA | 986 | 993 | . | + | . | ID=mmu-miR-495-5p; Name=mmu-miR-495-5p; Desc=UTR3; PosType=T |
PB.3189.4 | tappAS | genomic | 1 | 1 | . | + | . | Chr=chr17; PosType=G |
PB.3189.4 | tappAS | exon | 79052257 | 79052388 | . | + | . | Chr=chr17; PosType=G |
PB.3189.4 | tappAS | exon | 79070673 | 79070951 | . | + | . | Chr=chr17; PosType=G |
PB.3189.4 | tappAS | exon | 79077482 | 79077658 | . | + | . | Chr=chr17; PosType=G |
PB.3189.4 | tappAS | exon | 79079467 | 79079566 | . | + | . | Chr=chr17; PosType=G |
PB.3189.4 | tappAS | exon | 79081747 | 79081863 | . | + | . | Chr=chr17; PosType=G |
PB.3189.4 | tappAS | exon | 79089623 | 79090216 | . | + | . | Chr=chr17; PosType=G |
PB.3189.4 | tappAS | splice_junction | 79052388 | 79070673 | . | + | . | ID=known_canonical; Chr=chr17; PosType=G |
PB.3189.4 | tappAS | splice_junction | 79070951 | 79077482 | . | + | . | ID=known_canonical; Chr=chr17; PosType=G |
PB.3189.4 | tappAS | splice_junction | 79077658 | 79079467 | . | + | . | ID=known_canonical; Chr=chr1; PosType=G |
… | … | … | … | … | … | … | … | … |
PB.3189.4 | tappAS | protein | 1 | 313 | . | + | . | ID=NP_001303658.1; PosType=P |
Note that generating an annotation file is not a trivial task and it’s not recommended unless you have a good programming background and knowledge of annotation features. IsoAnnotLite has been specifically developed to assist with the file formatting and functional feature transference task.
Wiki index
- Introduction to SQANTI3
- Dependencies and installation
- Version history
- Isoform classification: categories and subcategories
- Running SQANTI3 quality control
- Understanding the output of SQANTI3 QC
- IsoAnnotLite
- Running SQANTI3 filter
- Running SQANTI3 rescue
- Tutorial: running SQANTI3 on an example dataset
- Running SQANTI-reads
- Memory requirements to use parallelization