Protein Engineer • Machine Learning Scientist • Computational Biologist
Creator of EmbedDiff-ESM 🧬 and EmbedDiff-Dayhoff 🔄
🔬 I’m a hybrid protein engineer and ML scientist with deep experience in both wet-lab experimentation and machine learning for protein design.
I bridge experimental biochemistry with generative AI, building next-gen tools to accelerate biologics discovery.
- 🧠 Currently developing: EmbedDiff-ESM (ESM-2 backbone) and EmbedDiff-Dayhoff (Dayhoff ablation) — exploring how protein LMs affect generative design
- 🧬 Passionate about generative AI in biotech & synthetic biology
- 🧪 Experienced in sequence modeling, folding, and structure–function pipelines
- Languages: Python, PyTorch, R, SQL, Bash
- Tools: Git, Docker, VS Code, Conda, Jupyter, SnapGene, PyMOL, Prism, ELN, Tableau
- ML/BioAI: ESM-2, Dayhoff Atlas, AlphaFold, Transformers, Diffusion Models, t-SNE, BLAST
- Enzyme characterization: Km, Vmax, kcat
- Thermal stability: Prometheus Panta, residual activity assays
- Protein visualization: SDS-PAGE, Western blot
- Molecular biology: PCR, qPCR, SDM, Golden Gate, high-throughput cloning
- PPIs: FRET assays
- Automation: Tecan, Echo, LabChip, ZAG
- Purification: SEC-MALS, IEX, affinity (FPLC/ÄKTA)
- Biophysics: DLS, BLI (Octet® RH96), FT-IR, TGA
- Quantification: MS, analytical SEC (HPLC)
- Microscopy: confocal, SEM, EDS
- Crystallization & genotype screening, Agrobacterium methods
Complementary pipelines for de novo protein design with diffusion models, probing how ESM-2 vs Microsoft Dayhoff-3B shape generative outcomes.
👉 EmbedDiff-ESM report
👉 EmbedDiff-Dayhoff report
I developed and compared two parallel latent diffusion pipelines for de novo protein design, each conditioned on a different pretrained embedding backbone: EmbedDiff-ESM2, which leverages Meta’s ESM-2 protein language model trained at evolutionary scale, and EmbedDiff-Dayhoff, which uses Microsoft’s Dayhoff-3B model trained on clustered UniRef with substitution-aware geometry. Both pipelines share the same workflow—embedding natural protein sequences into latent space, training a denoising diffusion model to learn biologically meaningful manifolds, and decoding embeddings into amino acid sequences with a Transformer-based decoder—followed by rigorous multi-metric evaluation. Unlike traditional structure-based or template-driven design approaches, EmbedDiff explores protein sequence space without structural supervision, enabling us to test how different embedding backbones influence novelty, plausibility, and functional diversity. To benchmark generated sequences, I combined perplexity scoring with ProtT5, t-SNE domain clustering, logistic regression probes, entropy vs identity trade-offs, cosine similarity distributions, and domain overlays, providing a holistic view of backbone performance. Our results show that both models produce very high-perplexity sequences, confirming that diffusion pushes into novel sequence space beyond the immediate training manifold, while global plausibility remains comparable between ESM-2 and Dayhoff. At the local level, however, differences emerge: ESM-2 tends to generate more conservative, higher-identity outputs that preserve natural priors, whereas Dayhoff explores higher-entropy, more divergent solutions. Together, these findings demonstrate that embedding choice directly steers generative exploration of protein space, with ESM-2 offering stability and conservation, and Dayhoff driving evolutionary exploration—two complementary strategies for advancing generative protein engineering.
Takeaway: ESM-2 embeddings produce slightly tighter domain separation, while Dayhoff preserves broader evolutionary diversity in latent space.
ESM-2 | Dayhoff |
![]() |
![]() |
Takeaway: Both backbones retain strong class separability, validating that embeddings encode sufficient biological signal for downstream classifiers.
ESM-2 | Dayhoff |
![]() ![]() |
![]() ![]() |
Takeaway: Training dynamics are comparable across backbones, with both models converging steadily under diffusion noise scheduling.
ESM-2 | Dayhoff |
![]() |
![]() |
Takeaway: Both backbones show a comparable global entropy–identity distribution.
ESM-2 | Dayhoff |
![]() |
![]() |
Takeaway: Identity and cosine similarity histograms reveal overlapping regimes for both ESM-2 and Dayhoff.
ESM-2 — Identity | Dayhoff — Identity |
![]() |
![]() |
ESM-2 — All cosine histograms | Dayhoff — All cosine histograms |
![]() |
![]() |
Takeaway: Generated sequences cluster near real domains but backbone choice shifts how tightly generated points adhere to natural evolutionary space.
ESM-2 | Dayhoff |
![]() |
![]() |
Takeaway: Despite absolute perplexity being high for both, distributions overlap strongly—suggesting backbone choice does not dramatically alter global plausibility.
ESM-2 vs Dayhoff |
![]() |
These side-by-side comparisons reveal how the embedding backbone steers generative design — domain separation, entropy/identity trade-offs, and similarity structure all shift with the latent geometry learned by ESM-2 vs Dayhoff.