Replies: 1 comment
-
|
@lccra, good questions. I'll do my best to answer them...
There is no objective threshold. But of course when you are calculating a proportion from counts the confidence intervals will get wider as the sample size gets lower. One way to look at this is to estimate the confidence intervals on the count data. I have the written a short tutorial on that here: (will be on the actual docs site soon, but for now just on the wiki until I've finished editing it) https://ecoevorxiv.org/repository/view/6484/ The bootstrap I use there falls over a bit at very low sample sizes (more info on that in the recipe), in that case you could shift to other methods like the Wilson score method (described nicely here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval). In general though you should just remain aware that if the sample size is so low that you can't do a good bootstrap (e.g. <1 decisive site associated with a tree, and a low total sample size) then it really means that you don't know much about about that particular proportion.
Yes. I expect the sCFl values to change with the model. The method uses ML to calculate the partial likelihoods at each of the four important internal nodes for a branch. So, it's certainly expected that changing the model will change those partial likelihoods, and that this will then change the sCFl values. When sample sizes are low I wouldn't be surprised if these differences are large. In terms of knowing what the best estimates are, I'd just look at the best model via BIC or AIC, and rely mostly on that. However, if the different models are sensible, and completely change your conclusions, then I would just conclude that you are probably unable to reject either conclusion. The sN could also change with the model for similar reasons. The model determines the partial likelihoods, and this will in turn determine the number of decisive sites. For example, imagine a single site with 4 taxa. And imagine that taxa 1-3 have bases
Thanks for the reminder here. In fact, we had written up a short post to describe the change, and then life got busy and we completely dropped the ball. For now I will post the important parts of that below, so that you can refer to it. W.r.t. future modifications, it's possible, yes. All methods can be improved and when we figure out ways to improve them we usually try to do that. In this case, we improved the method but didn't communicate that well. @bqminh feel free to add anything I've forgotten. The problemA number of IQ-TREE users recently pointed out (on the IQ-TREE user group and in emails to us) that they were getting some odd-looking results when applying the new likelihood-based estimates of site concordance factors (scfl; https://pubmed.ncbi.nlm.nih.gov/36383168/) to their datasets. Thanks to these users, we’ve been able to identify a shortcoming in our original implementation, and to address it. The problem was in how the original implementation of the scfl treated gaps. In describing the problem it will help to describe how the original scf based on parsimony (which I’ll call scfp here) handles gaps, too. In the parsimony-based scf, we repeatedly pick four taxa around the split of interest, and then just look for all the decisive sites from those four taxa. A decisive site in parsimony with four taxa is simply one where there are two of one base and two of another. Anything else, including a site where one or more of the bases is a gap or an ambiguous base (like an R, Y, or N) is simply ignored. In other words, in each quartet of taxa the parsimony-based scf just ignores sites with gaps altogether. However, because the parsimony-based scf repeatedly samples quartets of taxa, it doesn’t ignore every column in the alignment that has a gap. For example, if there are 100 taxa in the alignment, and for a particular column only taxon 1 has a gap, then that column of the alignment will only be ignored when taxon 1 is included in one of the randomly sampled quartets used to calculate the parsimony-based scf. So, although scfp has its own issues (e.g. dealing with homoplasy, which is why we introduced the likelihood-based version in the first place), it does a quite sensible job of coping with gaps by, for the most part, ignoring them only in the taxa in which the gaps occur. The likelihood-based scf we developed does not ignore sites with gaps. Indeed, it includes every site with gaps. In most cases, this is just fine—the way that the likelihood calculations work mean that the scf is usually perfectly sensible. But the case that we didn’t think about is the case where gaps have distributions on the tree that can give signals that conflict with the tree itself. In the four taxon case, this is simple to describe. Imagine we have a tree of four taxa, where taxa A and B are sisters, and taxa C and D are also sister; i.e. the tree is ((A,B),(C,D)). Now imagine that we have an alignment of constant sites, except that taxa A and C have a lot of gaps in common. In this situation, the likelihood-based scf can be misled into thinking that there is a lot of evidence for taxa A and C being more closely related to each other than they are to either of the other two taxa. This then causes the scf of the central branch of the tree to go down and the sDF (site discordance factor) of the tree that groups taxa A and C to go up. This, of course, is not what we want at all. And the problem is wider than that—it can affect likelihood-based sCF value when any subtree of a bigger tree has all gaps for a particular column of the alignment. The issue here is that we simply failed to realise this when developing the likelihood-based sCF, and so in alignments that have gaps (and certainly in alignments with a lot of gaps), the likelihood-based sCFs can be very different from the parsimony estimates, and very wrong. The solutionThe solution #Citation This is still fundamentally the method implemented in the sCFl paper, so please cite that: https://pubmed.ncbi.nlm.nih.gov/36383168/. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I have some follow-up questions wrt Issue #128 and its discussion.
I'm writing up some work I did around the time v2.2.2.7 was released where I'd initially run sCFL (10k replicates) on an alignment in v2.2.2.5 and then tried that with 2.2.2.7 as well.
This was mostly exploratory given the lack of any existing hypotheses for most of the relationships in my tree. I was using a concatenated nucleotide alignment where the alignment stats from v2.2.2.7 were:
I've noticed that the values of
sNare very important, where the respective assignments of concordance/discordance-1/discordance-2 topology become seemingly random below a certain threshold (which I guess makes sense based on sampling and rounding) but it's not clear to me how you would decide what this threshold should be objectively?Secondly, (at least for my alignment) the quartet accumulation process of current sCFL implementation seems be closely tied to the sizes of partitions, e.g. there were dramatic differences between corresponding branch sCFL values between my alignment if I included (with -p) a 490-metapartitoning (from -m MFP+MERGE) scheme (largest sN values) vs using a 200-metapartitoning scheme vs omitting a partition file (smallest sN values). Is this the expected behavior?
Lastly, I just wanted check if the current sCFL implementation will be formally published or further modified in the near future?
Cheers
Beta Was this translation helpful? Give feedback.
All reactions