You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am new to phylogenetics and I want to conduct ASR on proteins only of around 350aa length. For that matter I created a phylogeny with ~ 450 taxa to span over my divergent protein family. However, after using Model finder during phylogenetic inference, I received the warning that I am at risk of overfitting my data due my small sample size compared to the number of model parameters. I tried to see how many sequences I would have to remove with CD-Hit to just reach the tipping point of this trade-off but it would be a substantial fraction of my dataset (>400 sequences) which would remove a lot of taxonomic diversity and force a lot of long branches.
I am wondering how much of a burden would overfitting be in my case? Furthermore, I am not sure how to reduce parameter size by predefining certain model parameters without biasing my data based on this choice. I tried to find more information about similar issues, yet I could not come across much. My plan was to reduce the number of taxa as much as I still can and predefine some model parameters - but I am not exactly sure how.
Thus, I would be very grateful for any suggestions / ideas on this matter. Thank you in advance!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Dear iqtree community,
I am new to phylogenetics and I want to conduct ASR on proteins only of around 350aa length. For that matter I created a phylogeny with ~ 450 taxa to span over my divergent protein family. However, after using Model finder during phylogenetic inference, I received the warning that I am at risk of overfitting my data due my small sample size compared to the number of model parameters. I tried to see how many sequences I would have to remove with CD-Hit to just reach the tipping point of this trade-off but it would be a substantial fraction of my dataset (>400 sequences) which would remove a lot of taxonomic diversity and force a lot of long branches.
I am wondering how much of a burden would overfitting be in my case? Furthermore, I am not sure how to reduce parameter size by predefining certain model parameters without biasing my data based on this choice. I tried to find more information about similar issues, yet I could not come across much. My plan was to reduce the number of taxa as much as I still can and predefine some model parameters - but I am not exactly sure how.
Thus, I would be very grateful for any suggestions / ideas on this matter.
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions