sCFL (not)-gappy #245

lccra · 2024-06-25T20:27:20Z

lccra
Jun 25, 2024

Hi, I have some follow-up questions wrt Issue #128 and its discussion.

I'm writing up some work I did around the time v2.2.2.7 was released where I'd initially run sCFL (10k replicates) on an alignment in v2.2.2.5 and then tried that with 2.2.2.7 as well.

This was mostly exploratory given the lack of any existing hypotheses for most of the relationships in my tree. I was using a concatenated nucleotide alignment where the alignment stats from v2.2.2.7 were:

Alignment has 121 sequences with 483060 columns, 296485 distinct patterns
203368 parsimony-informative, 38970 singleton sites, 240722 constant sites
...
WARNING: 6 sequences contain more than 50% gaps/ambiguity
****  TOTAL                          28.54%  75 sequences failed composition chi2 test (p-value<5%; df=3)

I've noticed that the values of sN are very important, where the respective assignments of concordance/discordance-1/discordance-2 topology become seemingly random below a certain threshold (which I guess makes sense based on sampling and rounding) but it's not clear to me how you would decide what this threshold should be objectively?

Secondly, (at least for my alignment) the quartet accumulation process of current sCFL implementation seems be closely tied to the sizes of partitions, e.g. there were dramatic differences between corresponding branch sCFL values between my alignment if I included (with -p) a 490-metapartitoning (from -m MFP+MERGE) scheme (largest sN values) vs using a 200-metapartitoning scheme vs omitting a partition file (smallest sN values). Is this the expected behavior?

Lastly, I just wanted check if the current sCFL implementation will be formally published or further modified in the near future?

Cheers

roblanf · 2024-06-27T03:16:50Z

roblanf
Jun 27, 2024
Maintainer

@lccra, good questions. I'll do my best to answer them...

I've noticed that the values of sN are very important, where the respective assignments of concordance/discordance-1/discordance-2 topology become seemingly random below a certain threshold (which I guess makes sense based on sampling and rounding) but it's not clear to me how you would decide what this threshold should be objectively?

There is no objective threshold. But of course when you are calculating a proportion from counts the confidence intervals will get wider as the sample size gets lower. One way to look at this is to estimate the confidence intervals on the count data. I have the written a short tutorial on that here:

(will be on the actual docs site soon, but for now just on the wiki until I've finished editing it)
https://github.com/iqtree/iqtree2/wiki/Estimating-gene,-site,-and-quartet-concordance-vectors, note that this recipe accompanies the paper that Matt Hahn and I just wrote, which may also be of interest:

https://ecoevorxiv.org/repository/view/6484/

The bootstrap I use there falls over a bit at very low sample sizes (more info on that in the recipe), in that case you could shift to other methods like the Wilson score method (described nicely here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval). In general though you should just remain aware that if the sample size is so low that you can't do a good bootstrap (e.g. <1 decisive site associated with a tree, and a low total sample size) then it really means that you don't know much about about that particular proportion.

Secondly, (at least for my alignment) the quartet accumulation process of current sCFL implementation seems be closely tied to the sizes of partitions, e.g. there were dramatic differences between corresponding branch sCFL values between my alignment if I included (with -p) a 490-metapartitoning (from -m MFP+MERGE) scheme (largest sN values) vs using a 200-metapartitoning scheme vs omitting a partition file (smallest sN values). Is this the expected behavior?

Yes. I expect the sCFl values to change with the model. The method uses ML to calculate the partial likelihoods at each of the four important internal nodes for a branch. So, it's certainly expected that changing the model will change those partial likelihoods, and that this will then change the sCFl values. When sample sizes are low I wouldn't be surprised if these differences are large. In terms of knowing what the best estimates are, I'd just look at the best model via BIC or AIC, and rely mostly on that. However, if the different models are sensible, and completely change your conclusions, then I would just conclude that you are probably unable to reject either conclusion.

The sN could also change with the model for similar reasons. The model determines the partial likelihoods, and this will in turn determine the number of decisive sites. For example, imagine a single site with 4 taxa. And imagine that taxa 1-3 have bases A, A, C. If taxon 4 has a partial likelihood that leans away from C (i.e. the likelihood of C is a lot lower than the likelihood of other states), then this site will tend to be reconstructed as not decisive. But if taxon 4 has a partial likelihood that leans towards C, then this site will tend to be reconstructed as decisive, so sN will change. A more concise way to say this is that the higher the relative likelihood of C for this site in taxon 4, the more likely it is that we would reconstruct this as a decisive site. Expand this across a huge alignment, and it's plausible that sN could change substantially based on the models.

Lastly, I just wanted check if the current sCFL implementation will be formally published or further modified in the near future?

Thanks for the reminder here. In fact, we had written up a short post to describe the change, and then life got busy and we completely dropped the ball. For now I will post the important parts of that below, so that you can refer to it.

W.r.t. future modifications, it's possible, yes. All methods can be improved and when we figure out ways to improve them we usually try to do that. In this case, we improved the method but didn't communicate that well.

@bqminh feel free to add anything I've forgotten.

The problem

A number of IQ-TREE users recently pointed out (on the IQ-TREE user group and in emails to us) that they were getting some odd-looking results when applying the new likelihood-based estimates of site concordance factors (scfl; https://pubmed.ncbi.nlm.nih.gov/36383168/) to their datasets. Thanks to these users, we’ve been able to identify a shortcoming in our original implementation, and to address it.

The problem was in how the original implementation of the scfl treated gaps. In describing the problem it will help to describe how the original scf based on parsimony (which I’ll call scfp here) handles gaps, too. In the parsimony-based scf, we repeatedly pick four taxa around the split of interest, and then just look for all the decisive sites from those four taxa. A decisive site in parsimony with four taxa is simply one where there are two of one base and two of another. Anything else, including a site where one or more of the bases is a gap or an ambiguous base (like an R, Y, or N) is simply ignored. In other words, in each quartet of taxa the parsimony-based scf just ignores sites with gaps altogether. However, because the parsimony-based scf repeatedly samples quartets of taxa, it doesn’t ignore every column in the alignment that has a gap. For example, if there are 100 taxa in the alignment, and for a particular column only taxon 1 has a gap, then that column of the alignment will only be ignored when taxon 1 is included in one of the randomly sampled quartets used to calculate the parsimony-based scf. So, although scfp has its own issues (e.g. dealing with homoplasy, which is why we introduced the likelihood-based version in the first place), it does a quite sensible job of coping with gaps by, for the most part, ignoring them only in the taxa in which the gaps occur.

The likelihood-based scf we developed does not ignore sites with gaps. Indeed, it includes every site with gaps. In most cases, this is just fine—the way that the likelihood calculations work mean that the scf is usually perfectly sensible. But the case that we didn’t think about is the case where gaps have distributions on the tree that can give signals that conflict with the tree itself. In the four taxon case, this is simple to describe. Imagine we have a tree of four taxa, where taxa A and B are sisters, and taxa C and D are also sister; i.e. the tree is ((A,B),(C,D)). Now imagine that we have an alignment of constant sites, except that taxa A and C have a lot of gaps in common. In this situation, the likelihood-based scf can be misled into thinking that there is a lot of evidence for taxa A and C being more closely related to each other than they are to either of the other two taxa. This then causes the scf of the central branch of the tree to go down and the sDF (site discordance factor) of the tree that groups taxa A and C to go up. This, of course, is not what we want at all. And the problem is wider than that—it can affect likelihood-based sCF value when any subtree of a bigger tree has all gaps for a particular column of the alignment. The issue here is that we simply failed to realise this when developing the likelihood-based sCF, and so in alignments that have gaps (and certainly in alignments with a lot of gaps), the likelihood-based sCFs can be very different from the parsimony estimates, and very wrong.

The solution

The solution
One obvious solution here is just to ignore any site with a gap when calculating the likelihood-based sCF. However, that would render the likelihood-based sCF useless for many alignments, particularly alignments with a lot of taxa where most columns have at least one gap. Instead, the solution we use is to adapt the sCF for every branch. Specifically, when we are calculating the sCF for a given branch, we ignore any site where all the taxa in any one (or more) of the four subtrees induced by that branch has only gaps. This avoids the problem, while minimising the reduction in sample size of the sites used to calculate the sCF in the first place.

#Citation

This is still fundamentally the method implemented in the sCFl paper, so please cite that: https://pubmed.ncbi.nlm.nih.gov/36383168/.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iqtree

sCFL (not)-gappy #245

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

iqtree

sCFL (not)-gappy #245

Uh oh!

Uh oh!

lccra Jun 25, 2024

Replies: 1 comment

Uh oh!

Uh oh!

roblanf Jun 27, 2024 Maintainer

The problem

The solution

lccra
Jun 25, 2024

roblanf
Jun 27, 2024
Maintainer