Skip to content

IsoAnnotLite

Ángeles Arzalluz-Luque edited this page Oct 2, 2023 · 5 revisions

Table of contents:


Summary

IsoAnnotLite is a python script that allows you to transfer isoform-level functional feature annotations from an existing tappAS-like GFF3 file to the SQANTI3 output (i.e. a set of long read-defined transcripts). The resulting file has the proper structure to be loaded into tappAS for isoform-level functional analysis. If the reference GFF3 annotation file is not provided, only structural information will be included in the IsoAnnotLite-generated GFF3 file. The transference is done at the genomic positon level for transcript and protein annotations, while all gene-level annotations (e.g. Gene Ontology terms) are automatically transferred by gene ID.

isoannotlite

Arguments and parameters in IsoAnnotLite

Input files:

Mandatory: the basic input required to run IsoAnnotLite are three of the output files generated by SQANTI3 QC:

  • GTF file: output as *_corrected.gtf. Contains the long read-defined transcript sequences in GTF format.
  • Classification file: output as *_classification.txt. Tab-delimited file containing one entry per transcript and a large number of columns containing SQANTI3 attributes.
  • Junction file: output as *_junctions.txt. Tab-delimited file containing junction-level information for all the transcripts included in the classification file.

Optional: the -gff3 argument can be used tu supply an already-annotated GFF3 file containing transcript-level functional features. When this argument is not used, a tappAS-compatible GFF3 will still generated for the user to load into tappAS, however, it will only contain structural information (transcript length, exons, UTR, etc.). Pre-computed tappAS annotation files for human, mouse, Drosophila, Arabidopsis and maize are available here.

Parameters:

All of the following parameters are optional:

  • -o: name for the resulting GFF3 file.
  • -stdout: name for the statistics results file. Only when GFF3 it is used. If not used, no file will be generated.
  • novel: if it is used, all the transcripts will be treated as novel trasncripts, meaning that each transcript will be annotated using functional information from all the transcripts that belong to the same gene in the reference GFF3, instead of transferring annotations from transcripts with matching IDs (default).
  • nointronic: if provided, intronic features will not be annotated (e.g. RBPs).
  • statistics: deprecated (now IsoAnnotLite always shows the statistic results).
  • saveTranscriptIDs: when supplied to IsoAnnotLite, five additional output files will be created to report the IDs of the transcripts that generated problems during the feature transfer process:
    • Transcripts with a reference associated_transcript not annotated by positional transference: file_trans_not_annot_by_PF.txt.
    • Novel transcripts not annotated by positional transference: file_novel_not_annot_by_by_PF.txt.
    • SQ3 reference gene was not found in the GFF3 annotation file_reference_gene_not_annot.txt.
    • Transcripts not annotated because no features in the GFF3 annotation could be matched by position: file_reference_transcript_not_annot.txt.
    • SQ reference gene (associated_gene) field was empty: file_transcript_wo_gene_ID.txt.

IsoAnnotLite integration within SQANTI3 QC

As described in the SQANTI3 QC documentation, IsoAnnotLite is run internally by SQANTI3 QC if the --isoAnnotLite flag is supplied. In addition, users may add a tappAS reference GFF3 file via the --gff3 argument to perform functional feature transfer. Within SQANTI3, IsoAnnotLite is run with the following parameters by default:

python path_to_utilities/IsoAnnotLite_SQ3.py 
             \  *_corrected.gtf *_classification.txt *_junctions.txt 
             \  -gff3 *.gff3 
             \  -o out_prefix -novel -stdout out_dir

Note that this behavior cannot be modified when running SQANTI3: instead, if you wish for IsoAnnotLite to behave differently (i.e. supply flags other than -novel, which is activated by default in the SQANTI3 QC run), you will need to run in independently after obtaining the output of SQANTI3 QC.

The IsoAnnotLite_SQ3.py script included in SQ3's utilities folder is consistently updated as new versions of IsoAnnotLite are released. If you are running IsoAnnotLite independently, please make sure you use the latest version. To download the current version of IsoAnnotLite (v2.7.3), click here.

Appendix: how does IsoAnnotLite work?

1. Reading reference annotation file and creating data variables

If used, the reference annotation file (GFF3) is read. The function creates different Python dictionaries to save all the relevant transcript-level information.

2. Reading SQANTI3 files and creating an auxiliar GFF3

IsoAnnotLite next reads the three input files generated by SQ3 and creates an auxiliar GFF3 (to prevent memory overload). This auxiliar GFF3 will later be used to create the final GFF3 file.

3. Transforming CDS local positions to genomic position

Using the SQ3 input files and GFF3 information, IsoAnnotLite transforms all CDS local positions to genome positions using exon information.

4. Transforming feature local positions to genomic position

Using the reference GFF3 file, transcript-level functional feature positions are transformed to genomic position by the same methodology.

5. Generating transcript-level information per gene

Create a dictionary where gene, feature and genomic positions are stored.

6. Mapping transcript features between GFF3s

Features are transferred by matching genomic positions: features that are positionally defined within a long read-defined transcript in the SQ3 output will be annotated as belonging to that transcript.

In the process of transferring features between the GFF3 and the SQ3 transcripts, IsoAnnotLite behaves differently depending on the type of feature that is being handled:

  • Transfer a feature located in the 3/5 UTR region.
  • Transfer a feature located in the CDS.
  • Transfer protein features.
  • Transfer gene-level characteristics.

In addition, two different feature transfer methodologies are implemented depending on whether the SQ3 transcript has a match in the GFF3 file (this will be the case for FSM and ISM transcripts) or needs to be treated as a novel transcript (i.e. NIC and NNC).

  • In case a matching reference transcript is found (and the -novel flag is not used), only the matching reference transcript will be used to transfer features at all the levels mentioned above.
  • For novel transcripts (or for all transcripts if the -novel flag is used), IsoAnnotLite will iterate all the same-gene transcripts found in the GFF3 (using the associated_gene column in the SQANTI3 classification file) to retrieve positionally-defined features. The procedure will be the same, but it is running as many times as the number of transcripts the gene has.

To transfer UTR features, its genomic feature position must be inside the transcript exons but outside CDS region.

For CDS transcript features, two requirements must be met: 1) the feature must be contained within the transcript's exons as well as inside the CDS region; and 2) if a feature has start and end positions situated in different exons, the end and the start of the exons for both transcripts must be the same in order for IsoAnnotLite to transfer the feature.

Furthermore, for protein features, we check both transcripts are coding and have similar CDS. If all CDS exons are the same for both transcripts, all protein features are transferred. In the other case, at least one CDS exon should be a partial match. At least one CDS genomic region overlaps between both transcripts. In that case, IsoAnnotLite checks if any protein feature can be transferred by genomic position.

For gene-level characteristics (such as GOTerms), we decided to always transfer the information across matching gene IDs. Novel transcript procedure is equivalent, but annotations can come from multiple transcripts. Duplicated annotations are then removed.

7. Adding extra information to the GFF3 columns

During this step, a feature type flag is added to the GFF3 columns. This flag is required in some of tappAS' analysis and indicates if the annotation is defined at the transcript, protein or gene level.

Currently, this method is a hand-written methodology and cannot automatically detect feature type -this means that we have pre-defined the feature type for a long list of functional feature categories that we generally work with in the tappAS framework. Therefore, if you are working with a different kind of functional feature, IsoAnnotLite will display a warning. In this case, we recommend manually editing the updateGTF() function inside IsoAnnotLite to add your new feature, following the same structure as in the code.

The feature types supported by tappAS are:

  • T for transcript-level features.
  • P for protein-level features.
  • N for gene-level features.

8. Reading and sorting the GFF3 file

Sorting the GFF3 by transcript ID, that is, all entries corresponding to the same transcripts need to be together in the same chunk of the GFF3.

9. Updating gene descriptions

If missing from the SQANTI3 GTF, gene descriptions are updated using the information in the reference GFF3 file. After this step, the final GFF3 is ready.

Appendix II: IsoAnnotLite statistics

Statistics are divided into three main sections:

  1. Transcript-level summary. This section will display several entries with the number of transcripts that have been successfully annotated or not, and the reason. The information is shown independently for FSM/ISM and novel transcripts. Users will also see:

    • No. of total transcripts that have been annotated by genomic positional feature transference.
    • No. of transcripts not annotated because the reference transcript has no annotations to transfer.
    • No. of transcripts that were not annotated because no features matched by positional transference.
    • No. of transcripts that were not annotated because the gene that they are assigned to by the associated_gene SQ3 column was not found in the reference GFF3.
  2. Feature transfer summary. For each of the functional feature categories included in the reference GFF3, this section will show how many transcripts have been annotated with at least one feature from that category.

  3. Feature-level summary. This section contains the same information as in (2), but in this case, the information is displayed by number of features instead of my number of transcripts. This summarizes how may features have been transferred from the reference GFF3 into the new GFF3 annotation file.

At the end of the statistics file, a summary line is shown including the percentage (%) of features have been transferred in total. However, keep in mind that this count is only exact when no novel transcripts are annotated: otherwise, since a novel transcript can receive annotations from multiple reference transcripts (all of which are counted, even if they are collapsed afterwards to remove redundancies), annotated features can be counted several times. Therefore, the result will correspond to the total number of features that have been annotated against the total number of feature transfer "events" that have been tried by IsoAnnotLite.

Appendix III: tappAS GFF3 file format

The tappAS functional annotation GFF3 file follows the basic Generic Feature Format 3 (GFF3). However, it has been slightly modified to suit the application: the “score” and “phase” columns are not used and some of the attributes may not fully abide by the formal specifications. The file consists of a set of annotation features for each transcript.

The file is divided into blocks, each corresponding to a transcript. Within each block, the set of features is divided into sections as follows:

  • Transcript 1
    • Transcript Level Feature Annotations – basic transcript information, UTR motifs, microRNAs, etc.
    • Genomic Level Feature Annotations – exons, splice junctions, etc.
    • Protein Level Feature Annotations – gene ontology features, domains, phosphorylation sites, etc.
  • Transcript 2
    • ...
  • Transcript 3

The annotation features must be named as expected by tappAS:

Source Feature Description
tappAS transcript Start of transcript features
tappAS gene Gene information
tappAS CDS CDS information
tappAS genomic Start of genomic features
tappAS exon Exon
tappAS splice_junction Splice junction
tappAS protein Start of protein features

In addition, the remaining attributes must be named as follows:

Attribute Description
ID Feature ID
Name Feature name
Desc Feature description
Chr Feature chromosome

For reference, here is a snippet of a tappAS-formatted GFF3 file (header should NOT be included):

SeqName Source Feature Start End Score Strand Phase Attributes
PB.3189.4 tappAS transcript 1 1399 . + . ID=XM_006524897.1; primary_class=full_splice_match; PosType=T
PB.3189.4 tappAS gene 1 1399 . + . ID=Qpct; Name=Qpct; Desc=glutaminyl-peptide cyclotransferase (glutaminyl cyclase); PosType=T
PB.3189.4 tappAS CDS 10 951 . + . ID=XP_006524960.1; PosType=T
PB.3189.4 UTRsite 3’UTRmotif 1288 1295 . + . ID=U0023; Name=K-BOX; Desc=K-Box; PosType=T
PB.3189.4 UTRsite PAS 1380 1399 . + . ID=U0043; Name=PAS; Desc=Polyadenylation Signal; PosType=T
PB.3189.4 mirWalk miRNA 986 993 . + . ID=mmu-miR-495-5p; Name=mmu-miR-495-5p; Desc=UTR3; PosType=T
PB.3189.4 tappAS genomic 1 1 . + . Chr=chr17; PosType=G
PB.3189.4 tappAS exon 79052257 79052388 . + . Chr=chr17; PosType=G
PB.3189.4 tappAS exon 79070673 79070951 . + . Chr=chr17; PosType=G
PB.3189.4 tappAS exon 79077482 79077658 . + . Chr=chr17; PosType=G
PB.3189.4 tappAS exon 79079467 79079566 . + . Chr=chr17; PosType=G
PB.3189.4 tappAS exon 79081747 79081863 . + . Chr=chr17; PosType=G
PB.3189.4 tappAS exon 79089623 79090216 . + . Chr=chr17; PosType=G
PB.3189.4 tappAS splice_junction 79052388 79070673 . + . ID=known_canonical; Chr=chr17; PosType=G
PB.3189.4 tappAS splice_junction 79070951 79077482 . + . ID=known_canonical; Chr=chr17; PosType=G
PB.3189.4 tappAS splice_junction 79077658 79079467 . + . ID=known_canonical; Chr=chr1; PosType=G
PB.3189.4 tappAS protein 1 313 . + . ID=NP_001303658.1; PosType=P

Note that generating an annotation file is not a trivial task and it’s not recommended unless you have a good programming background and knowledge of annotation features. IsoAnnotLite has been specifically developed to assist with the file formatting and functional feature transference task.

Clone this wiki locally