Immuannot is a pipeline built on minimap2 to detect and annotate immunological genes for human genome assembly. By taking advantage of gene sequences from IPD-IMGT/HLA, IPD-KIR, and [RefSeq](accession number NG_011638.1) as references, it is able to annotate HLA and KIR allele at full precision (if exists in the reference data set) and to report novel allele by locating new mutations that do not exist in the reference allele set.
Last update date : 04/09/2024
An HLA gene is detected if any reference allele sequences of that gene are well mapped to the target assembly (overlapping rate > 90%). In a region with multiple mapping allele sequences, the one with the smallest edit-distance and longest matching length is chosen as the template allele, which is futher used for gene structure annotation.
If the edit-distance of the chosen template allele is zero, which means a perfect matching, the contig then is reproted to carry that allele. Otherwise, CDS is extracted for calling the allele type.
C4 gene is detected through split alignment. Taking the exons from one gene sequence as the query, the boundaries of each exon, the length of the 9th intron, and key pipetides in the 26th exon can be estimated through aligning the query to the target contig. Those information are used for gene structure annotation and gene typing.
Because each reference gene sequence could be mapped to different regions of the target genome, copy number of a particular gene is reported naturally with the number of mapping clusters.
See our manuscript for more details:
Full resolution HLA and KIR genes annotation for human genome assemblies (doi: 10.1101/gr. 278985.124)
HLA-A,HLA-B,HLA-C,HLA-DMA,HLA-DMB,HLA-DOA,HLA-DOB,HLA-DPA1,HLA-DPA2,HLA-DPB1,HLA-DPB2,HLA-DQA1,HLA-DQA2,HLA-DQB1,HLA-DQB2,HLA-DRA,HLA-DRB1,HLA-DRB3,HLA-DRB4,HLA-DRB5,HLA-E,HLA-F,HLA-G,HLA-HFE,HLA-H,HLA-J,HLA-K,HLA-L,HLA-N,HLA-P,HLA-S,HLA-T,HLA-U,HLA-V,HLA-W,HLA-Y,MICA,MICB,TAP1,TAP2,C4A,C4B
KIR2DL1,KIR2DL2,KIR2DL3,KIR2DL4,KIR2DL5A,KIR2DL5B,KIR2DP1,KIR2DS1,KIR2DS2,KIR2DS3,KIR2DS4,KIR2DS5,KIR3DL1,KIR3DL2,KIR3DL3,KIR3DP1,KIR3DS1
This pipeline is developped and desgined under linux environment, following programs are required to be pre-intalled and added in the system searching PATH:
- minimap2 (test version 2.17-r941)
- python3 (test version 3.9.13)
- bash (test version 4.4.20)
Download the reference data set 'Data-2024Feb02.tar.gz' from zenodo:
git clone https://github.com/YingZhou001/Immuannot.git
cd Immuannot
# put Data-2024Feb02.tar.gz here and
tar xvf Data-2024Feb02.tar.gz
Testing :
>> bash scripts/immuannot.sh
Error: target contig seq is required.
Usage: bash scripts/immuannot.sh [OPTION] value
[ -c | --contig target assembly (.fa, .fa.gz) ]
[ -r | --refdir references ]
[ -o | --outpref output prefix (optional) ]
[ -t | --thread num of thread (optional, default 3) ]
[ --overlaprate OVERLAP (optional, default 0.9) ]
[ --diff DIFF (optional, default 0.03) ]
Immuannot takes genome assembly (-c/--contig, fasta/fastq format) and gene reference sequences data directory (-r/--refdir along this software) as inputs. It is better to set your own output prefix, otherwise all the results will be prefixed with "immuannot-out".
Optically, the number of threads ( -t/--thread), the proportion of the query sequence mapped to the target (--overlaprate), and the differential ratio cutoff of the matched region (--diff) can all be customized as need.
The output annotation is in gzip-compressed gtf format.
Immuannot outputs several pieces of related information for HLA/KIR allele calling in the final gtf output.
- The feature "gene" row includes "template_allele" that used for gene structure annotation and "template_distance" that indicates the edit distance between the target sequence and the template allele.
- The feature "transcript" row includes "consensus" call as a summary of the closest alleles from the IMGT database. The "new" tag at the rightest field indicates a novel gene sequence and the effect of the undocumented mutation. Closest alleles from IMGT are also included in this row.
Please do use "consensus" or "allele" field but not "template_allele" as the allele typing result.
Additionally, CDS sequence is also examined during the template search, typically, 'template_warning' would give warnings as "no-start_codon", "no-stop_codon", "incomplete_CDS", and "inframe_stop" for gene structure annotation.
Intermediate results are also saved in the output folder.
A running example is included along this pipeline. The target file 'test.fa.gz' include two contigs, one for HLA region and the other for KIR region.
User can test the pipeline by running the script bellow (a few mins):
## check the file path before running, take a few minutes
ctg=example/test.fa.gz
script=scripts/immuannot.sh
refdir=Data-2024Feb02
outpref=test-run
bash ${script} -c ${ctg} -r ${refdir} -o ${outpref}
Immuannot would output file "test-run.gtf.gz" for annotation and a folder named "test-run" for intermediate results.
Because Immuannot is mainly based on gene sequence alignment, except for C4 gene, a novel gene may not be reported if it is largely different from the reference data sequences.
Immuannot takes one set of genome as the input for each run. Annotating multiple-sample sequences in one run could lead to lots of missing calls.
- Combine split-alignment and gene sequence aligment to get a better annotation for large novel indels
use zenodo link to replace data larger than 1Mb