Skip to content

[mpileup2] How does mpileup choose the reads for later calling, and is the -d flag behaving strangely? #2234

@goeckeritz

Description

@goeckeritz

Hi bcftools team,

I have a conceptual question about mpileup and I think the -d flag is either being weird or I'm misunderstanding how it works. I am using bcftools v. 1.18, and the .bam file in question contains a single sample.

I have reads piling up at some problematic genomic regions (we're talking 10-20X the average depth), and I don't really want to trust the variants in those areas. For example, there will be 95% of reads at a site supporting the REF allele, and 5% supporting an ALT allele, so even if all or the majority of reads were included, I'd expect the site to be called homozygous REF. I should note that almost all of the reads at this site will be Primary, too, with a MAPQ > 30.

However, even when the -d flag is set to 250, I'll have only 30 or so reads out of ~600 be chosen for future variant calling and contributing to DP, so this variant site doesn't get filtered out with my max depth filter later on in my pipeline. Moreover, those 30 reads will include almost ALL of those reads from the ALT allele; this results in this site being called heterozygous REF/ALT.

So my first question is: How does mpileup choose which reads to include for later calling -- i.e., how does it choose how many and which reads contribute to DP? It definitely doesn't appear to be random, since in the example above it grabbed almost all of the ALT supporting reads, despite them being 5% of the high quality mappings at that site.

My second question is: Why is the DP at this site not 250, if that's my maximum number of reads to give to mpileup? When I set -d to 1000 or 2000, mpileup grabs 166 reads for this site (DP=166). It's not until I set -d to 10000 does it grab most of the reads at that site for future variant calling. It looks like this issue was touched on in #1694 -- but in the opposite direction and I'm not sure what was meant by "Hence the filtering isn't as strict as it may sound, an is perhaps closer to an average maximum depth threshold rather than something to take as a literal limit." Also, I'm not necessarily trying to filter on the -d flag either. I just want more of the data represented in genotype likelihood assessments, especially when they're relatively high quality -- and therefore I'll know that this site is a problematic region with lots of reads piling up so I CAN filter it out later.

Since -d or --max-depth is defined as 'At a position, read maximally INT reads per input file', this behavior doesn't seem intuitive to me.. But it sounds like there is more to the -d flag that I'm not comprehending, and that it isn't necessarily a bug, right?

Thanks in advance for any time you might have for answering these questions. If it's easier to just point to some reading material, I'm happy to do some more legwork to make your life easier!

Kindly,
Charity

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions