Skip to content

dashnowlab/demintr

Repository files navigation

DeMinTR

DeminTR - de novo mutations in Tandem Repeats (pronunced di-MEN-ter)

This repository contains tools and scripts for analysing de novo mutations in tandem repeat loci

Usage

Run the main script with your input file:

python demintr.py -i K1463.CHM13v2.dnms_with_parental_haplotype_information.tsv

Working

demintr decomposes simple/complex tandem repeat loci in an allele using ribbit (a repeat identification tool). It can also compare any two alleles and identify the mutation i.e., sequence difference between the two alleles and further assign the mutation to a tandem repeat in case of a complex tandem repeat locus.

Decomposition

demintr uses ribbit, a repeat identification tool on an allele sequence to decompose the sequence it into simple tandem repeats of one or more motifs. The decomposition essentially reports corresponding position ranges identified as a tandem repeat of a certain motif.

Comparison

In the comparison module, demintr compares the alleles of two different samples for a given tandem repeat (TR) locus. This module is commonly used to resolve de novo mutations by comparing the alleles of a child against those of a parent.

First, the repeat structure in each individual allele is decomposed. One allele is selected as the reference, typically based on the relationship between the individuals. The two alleles are then aligned, with the goal of identifying mutations—particularly insertions and deletions (indels).

Next, the positions of the identified indels are cross-referenced with the repeat positions in the reference allele to determine the specific repeat locus where the mutation occurred.

For complex repeats—where repeat loci overlap and share a stretch of sequence that can be interpreted with different repeat motifs—the mutation sequence (i.e., the inserted or deleted segment) is compared against the motif sequences of both repeats. The mutation is assigned to the locus whose motif has the highest similarity to the mutation sequence.

Additionally, the method gives priority to a repeat locus when the mutation length is an exact multiple of the repeat motif length, supporting the interpretation that the mutation likely arose through slippage, a common mutation mechanism in tandem repeats.

Input

demintr works with a genotype file, with allele information of the samples to be compared and a pedigree file which has the relationship information.

The genotype file could be either a joint vcf file generated by any tandem repeat genotyping tool such as ExapnsionHunter, TRGT, longTR, ATaRVa and others. From such a file demintr uses the allele sequences for decomposition and then comparison.

Another is a tab separated tabular data file with a proper header along with information on the columns which correspond to the parent allele and the child alleles to be compared to identify de novo mutations.

Example file currently used

trid sample_id genotype index motifs parent_of_origin denovo_allele_sequence precursor_sequence_in_parent untransmitted_sequence_in_parent precursor_AP untransmitted_AP
chr1_9369_9380_trsolve 2212 2 1 TA 289 TATATATATATATATATATATATATATATATATATATATATATAT TATATATATATATATATATATATATATATATATATATATATATATAT TATATATATATATATATATAT 0.9791669845581056 0.9545450210571288
chr2_1211163_12122_trsolve 2212 1 0 GT,GTCCCC 252 GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC 1.0 1.0

Output

demintr creates two output files

Mutations file

File name: K1463.CHM13v2.dnms.tsv

This is a tab separated tabular file with following fields

  1. locus_id: The id of the tandem repeat locus
  2. child_sample: The sample name of the child
  3. parent_sample: The sample name of the parent
  4. repeat_motifs: The motifs part of the tandem repeat
  5. info: The info field with semicolon separated key=value pairs with information of the mutation.

Key value pairs in the info field.

  • MutationPosition= indicates the range of positions the mutation affects in the reference. Example: "549-552"
  • MutationType= Type of mutation I:insertion and D:deletion
  • MutationLength= Length of the mtuation
  • MutationSequence= Indicates the sequence of mutation i.e., the sequence of the indel
  • ParentRepeat= Repeat locus in the parent the mutation is part of. Has the structure Example 538-564:GGGGT:0.828
  • MUTATIONS= Indicates if the locus has SINGLE or MULTIPLE mutations

Alignements file

This file contains the alignment of the denovo allele with the parent allele for each locus as a text file.

Example

chr2_121116093_121116213_trsolve|2212|mom:252
         .         .         .         .         .         .         .         .
--------GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
        ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
         .         .         .      
GTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC
||||||||||||||||||||||||||||||||||||
GTGTGTGTGTGTGTGTGTGTGTGTGTCCCCGTCCCC

Contributing

Contributions are welcome! Please open issues or submit pull requests.

License

This project is licensed under the MIT License.

About

Toolkit to analyse de novo mutations at tandem repeats

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages