Optimize phasing pipeline for scalability.

Now that we've reconfigured and refactored the phasing pipeline to improve SV performance on 10Mbp, we probably need to make additional changes to improve scalability for whole chromosome/genome runs.

Here are a few submissions to reference:

- old pipeline, chr6, ~$170, HiPhase 16G, Shapeit4 96G (SV) / 128G (scaffold), scaffold took **~120 hours**: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio/job_history/ebfa6196-d25c-4830-acc3-23bd77dd691b
- new pipeline, chr6, **~$400** + ~$30 + $???, HiPhase 16G (1st run, few samples failed) -> 32G (2nd run), Shapeit4 32G (2nd run, failed) -> 128G (3rd run, still running after 13 hours): https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/9cd3ef6f-1f4c-4ba7-8f67-ecf76f91c391, https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/72113057-45f0-4e02-b978-316ab261d2d0, and https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/dea0b0fd-fa39-400e-8a10-655be8c99eb2 (still running)

I think I will leave this last run going, but probably we should make some of the following improvements and rerun:

- Reexamine necessity of BAM/VCF subsetting. BAM subsetting by region was added to the new workflow to facilitate running on 10Mbp. I believe VCF subsetting by sample was also added, but I'm not sure this was necessary, as it looks like HiPhase can take as input a multisample VCF and will only retrieve the samples needed. I think **we still need to be able to run the workflow on arbitrary genomic regions, but let's take out subsetting and reduce the number of per-sample jobs where possible**. I hope this will also improve call caching---I was getting a lot of misses and/or slow caches, and I suspect having a lot of per-sample jobs exacerbated this.
- Examine possibility of batching HiPhase. It looks like HiPhase can run on multiple samples in a single invocation. It also looks like a single chromosome takes only ~10min, so perhaps we should just run over the whole genome (or analysis region). This is probably lower priority than removing extraneous per-sample subsetting jobs.
- Do not log HiPhase verbosely. On the chr6 runs, it looks like this generates 2x 100Mb logs (stdout + the *.log copy) per sample.
- Shard Shapeit4. This is a **must do**, IMO. Several days is too long for any job and the tool is explicitly designed to be run on genomic chunks that are then subsequently ligated.

I think some of the slowness and inefficiency is due to Cromwell/Terra (e.g., slow or missed caching), but we can certainly clean things up on our end.

@hangsuUNC @SHuang-Broad @kvg take note. In particular, I'm curious if @SHuang-Broad has more insight about making things more cache friendly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize phasing pipeline for scalability. #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize phasing pipeline for scalability. #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions