Beginner's issue: Trade-off between model overfitting and LBA + taxa underrepresentation #480

LH21Ancestor · 2025-08-01T03:47:36Z

LH21Ancestor
Aug 1, 2025

Dear iqtree community,

I am new to phylogenetics and I want to conduct ASR on proteins only of around 350aa length. For that matter I created a phylogeny with ~ 450 taxa to span over my divergent protein family. However, after using Model finder during phylogenetic inference, I received the warning that I am at risk of overfitting my data due my small sample size compared to the number of model parameters. I tried to see how many sequences I would have to remove with CD-Hit to just reach the tipping point of this trade-off but it would be a substantial fraction of my dataset (>400 sequences) which would remove a lot of taxonomic diversity and force a lot of long branches.

I am wondering how much of a burden would overfitting be in my case? Furthermore, I am not sure how to reduce parameter size by predefining certain model parameters without biasing my data based on this choice. I tried to find more information about similar issues, yet I could not come across much. My plan was to reduce the number of taxa as much as I still can and predefine some model parameters - but I am not exactly sure how.
Thus, I would be very grateful for any suggestions / ideas on this matter.
Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iqtree

Beginner's issue: Trade-off between model overfitting and LBA + taxa underrepresentation #480

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

iqtree

Beginner's issue: Trade-off between model overfitting and LBA + taxa underrepresentation #480

Uh oh!

LH21Ancestor Aug 1, 2025

Replies: 0 comments

LH21Ancestor
Aug 1, 2025