DeminTR - de novo mutations in Tandem Repeats (pronunced di-MEN-ter)
This repository contains tools and scripts for analysing de novo mutations in tandem repeat loci
Run the main script with your input file:
python demintr.py -i K1463.CHM13v2.dnms_with_parental_haplotype_information.tsv
demintr decomposes simple/complex tandem repeat loci in an allele using ribbit (a repeat identification tool). It can also compare any two alleles and identify the mutation i.e., sequence difference between the two alleles and further assign the mutation to a tandem repeat in case of a complex tandem repeat locus.
demintr uses ribbit, a repeat identification tool on an allele sequence to decompose the sequence it into simple tandem repeats of one or more motifs. The decomposition essentially reports corresponding position ranges identified as a tandem repeat of a certain motif.
In the comparison module, demintr compares the alleles of two different samples for a given tandem repeat (TR) locus. This module is commonly used to resolve de novo mutations by comparing the alleles of a child against those of a parent.
First, the repeat structure in each individual allele is decomposed. One allele is selected as the reference, typically based on the relationship between the individuals. The two alleles are then aligned, with the goal of identifying mutations—particularly insertions and deletions (indels).
Next, the positions of the identified indels are cross-referenced with the repeat positions in the reference allele to determine the specific repeat locus where the mutation occurred.
For complex repeats—where repeat loci overlap and share a stretch of sequence that can be interpreted with different repeat motifs—the mutation sequence (i.e., the inserted or deleted segment) is compared against the motif sequences of both repeats. The mutation is assigned to the locus whose motif has the highest similarity to the mutation sequence.
Additionally, the method gives priority to a repeat locus when the mutation length is an exact multiple of the repeat motif length, supporting the interpretation that the mutation likely arose through slippage, a common mutation mechanism in tandem repeats.
demintr works with a genotype file, with allele information of the samples to be compared and a pedigree file which has the relationship information.
The genotype file could be either a joint vcf file generated by any tandem repeat genotyping tool such as ExapnsionHunter, TRGT, longTR, ATaRVa and others. From such a file demintr uses the allele sequences for decomposition and then comparison.
Another is a tab separated tabular data file with a proper header along with information on the columns which correspond to the parent allele and the child alleles to be compared to identify de novo mutations.
Example file currently used
trid | sample_id | genotype | index | motifs | parent_of_origin | denovo_allele_sequence | precursor_sequence_in_parent | untransmitted_sequence_in_parent | precursor_AP | untransmitted_AP |
---|---|---|---|---|---|---|---|---|---|---|
chr1_9369_9380_trsolve | 2212 | 2 | 1 | TA | 289 | TATATATATATATATATATATATATATATATATATATATATATAT | TATATATATATATATATATATATATATATATATATATATATATATAT | TATATATATATATATATATAT | 0.9791669845581056 | 0.9545450210571288 |
chr2_1211163_12122_trsolve | 2212 | 1 | 0 | GT,GTCCCC | 252 | GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC | GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC | GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC | 1.0 | 1.0 |
demintr creates two output files
File name: K1463.CHM13v2.dnms.tsv
This is a tab separated tabular file with following fields
locus_id
: The id of the tandem repeat locuschild_sample
: The sample name of the childparent_sample
: The sample name of the parentrepeat_motifs
: The motifs part of the tandem repeatinfo
: The info field with semicolon separated key=value pairs with information of the mutation.
MutationPosition=
indicates the range of positions the mutation affects in the reference. Example: "549-552"MutationType=
Type of mutationI:insertion
andD:deletion
MutationLength=
Length of the mtuationMutationSequence=
Indicates the sequence of mutation i.e., the sequence of the indelParentRepeat=
Repeat locus in the parent the mutation is part of. Has the structure Example538-564:GGGGT:0.828
MUTATIONS=
Indicates if the locus hasSINGLE
orMULTIPLE
mutations
This file contains the alignment of the denovo allele with the parent allele for each locus as a text file.
Example
chr2_121116093_121116213_trsolve|2212|mom:252
. . . . . . . .
--------GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
. . .
GTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC
||||||||||||||||||||||||||||||||||||
GTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC
Contributions are welcome! Please open issues or submit pull requests.
This project is licensed under the MIT License.