Skip to content
View mgarsamo's full-sized avatar

Block or report mgarsamo

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 250 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
mgarsamo/README.md

Hi there 👋, I am Melaku!

🔬 Melaku Garsamo

Protein Engineer • Machine Learning Scientist • Computational Biologist
Creator of EmbedDiff-ESM 🧬 and EmbedDiff-Dayhoff 🔄


👋 About Me

🔬 I’m a hybrid protein engineer and ML scientist with deep experience in both wet-lab experimentation and machine learning for protein design.
I bridge experimental biochemistry with generative AI, building next-gen tools to accelerate biologics discovery.

  • 🧠 Currently developing: EmbedDiff-ESM (ESM-2 backbone) and EmbedDiff-Dayhoff (Dayhoff ablation) — exploring how protein LMs affect generative design
  • 🧬 Passionate about generative AI in biotech & synthetic biology
  • 🧪 Experienced in sequence modeling, folding, and structure–function pipelines

🔧 Skills & Tools

  • Languages: Python, PyTorch, R, SQL, Bash
  • Tools: Git, Docker, VS Code, Conda, Jupyter, SnapGene, PyMOL, Prism, ELN, Tableau
  • ML/BioAI: ESM-2, Dayhoff Atlas, AlphaFold, Transformers, Diffusion Models, t-SNE, BLAST

🧪 Wet Lab Expertise

  • Enzyme characterization: Km, Vmax, kcat
  • Thermal stability: Prometheus Panta, residual activity assays
  • Protein visualization: SDS-PAGE, Western blot
  • Molecular biology: PCR, qPCR, SDM, Golden Gate, high-throughput cloning
  • PPIs: FRET assays
  • Automation: Tecan, Echo, LabChip, ZAG
  • Purification: SEC-MALS, IEX, affinity (FPLC/ÄKTA)
  • Biophysics: DLS, BLI (Octet® RH96), FT-IR, TGA
  • Quantification: MS, analytical SEC (HPLC)
  • Microscopy: confocal, SEM, EDS
  • Crystallization & genotype screening, Agrobacterium methods

🚀 Featured Projects

Complementary pipelines for de novo protein design with diffusion models, probing how ESM-2 vs Microsoft Dayhoff-3B shape generative outcomes.

👉 EmbedDiff-ESM report
👉 EmbedDiff-Dayhoff report


📑 Comparative Benchmark: EmbedDiff-ESM2 vs EmbedDiff-Dayhoff

I developed and compared two parallel latent diffusion pipelines for de novo protein design, each conditioned on a different pretrained embedding backbone: EmbedDiff-ESM2, which leverages Meta’s ESM-2 protein language model trained at evolutionary scale, and EmbedDiff-Dayhoff, which uses Microsoft’s Dayhoff-3B model trained on clustered UniRef with substitution-aware geometry. Both pipelines share the same workflow—embedding natural protein sequences into latent space, training a denoising diffusion model to learn biologically meaningful manifolds, and decoding embeddings into amino acid sequences with a Transformer-based decoder—followed by rigorous multi-metric evaluation. Unlike traditional structure-based or template-driven design approaches, EmbedDiff explores protein sequence space without structural supervision, enabling us to test how different embedding backbones influence novelty, plausibility, and functional diversity. To benchmark generated sequences, I combined perplexity scoring with ProtT5, t-SNE domain clustering, logistic regression probes, entropy vs identity trade-offs, cosine similarity distributions, and domain overlays, providing a holistic view of backbone performance. Our results show that both models produce very high-perplexity sequences, confirming that diffusion pushes into novel sequence space beyond the immediate training manifold, while global plausibility remains comparable between ESM-2 and Dayhoff. At the local level, however, differences emerge: ESM-2 tends to generate more conservative, higher-identity outputs that preserve natural priors, whereas Dayhoff explores higher-entropy, more divergent solutions. Together, these findings demonstrate that embedding choice directly steers generative exploration of protein space, with ESM-2 offering stability and conservation, and Dayhoff driving evolutionary exploration—two complementary strategies for advancing generative protein engineering.


🧭 Domain-Colored t-SNE (Overview)

Takeaway: ESM-2 embeddings produce slightly tighter domain separation, while Dayhoff preserves broader evolutionary diversity in latent space.

ESM-2 Dayhoff


✅ Logistic Regression Backbone Check (Classification Sanity)

Takeaway: Both backbones retain strong class separability, validating that embeddings encode sufficient biological signal for downstream classifiers.

ESM-2 Dayhoff



🔬 Latent Diffusion Training — Cross-Entropy Loss

Takeaway: Training dynamics are comparable across backbones, with both models converging steadily under diffusion noise scheduling.

ESM-2 Dayhoff

🧮 Entropy vs Sequence Identity

Takeaway: Both backbones show a comparable global entropy–identity distribution.

ESM-2 Dayhoff

📊 Identity & Similarity Distributions

Takeaway: Identity and cosine similarity histograms reveal overlapping regimes for both ESM-2 and Dayhoff.

ESM-2 — Identity Dayhoff — Identity
ESM-2 — All cosine histograms Dayhoff — All cosine histograms

🧩 Domain Overlay — Real vs Generated (t-SNE)

Takeaway: Generated sequences cluster near real domains but backbone choice shifts how tightly generated points adhere to natural evolutionary space.

ESM-2 Dayhoff

🧮 Perplexity (ESM-2 vs Dayhoff Results)

Takeaway: Despite absolute perplexity being high for both, distributions overlap strongly—suggesting backbone choice does not dramatically alter global plausibility.

ESM-2 vs Dayhoff
Perplexity Comparison (Dayhoff)

These side-by-side comparisons reveal how the embedding backbone steers generative design — domain separation, entropy/identity trade-offs, and similarity structure all shift with the latent geometry learned by ESM-2 vs Dayhoff.


🌍 Let’s Connect


Popular repositories Loading

  1. domain-boundary-parser-- domain-boundary-parser-- Public

    Detects and visualizes confident structural domains from AlphaFold2 models using pLDDT scores.

    Python 1

  2. Unique-DNA-Barcodes-Generator Unique-DNA-Barcodes-Generator Public

    Generate diverse and unique DNA barcodes for sample identification and genetic tracking with this open-source Python tool.

    Jupyter Notebook

  3. geez-biotech geez-biotech Public

    HTML

  4. EmbedDiff EmbedDiff Public

    🧬 EmbedDiff: A modular machine learning pipeline combining ESM2 embeddings, latent diffusion, and transformer-based decoding for de novo protein design

    Jupyter Notebook

  5. mgarsamo mgarsamo Public

    Hybrid protein engineer & ML scientist building generative AI tools for protein design.

  6. ereft ereft Public

    Python