-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Now that we've reconfigured and refactored the phasing pipeline to improve SV performance on 10Mbp, we probably need to make additional changes to improve scalability for whole chromosome/genome runs.
Here are a few submissions to reference:
- old pipeline, chr6, ~$170, HiPhase 16G, Shapeit4 96G (SV) / 128G (scaffold), scaffold took ~120 hours: https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio/job_history/ebfa6196-d25c-4830-acc3-23bd77dd691b
- new pipeline, chr6, ~$400 + ~$30 + $???, HiPhase 16G (1st run, few samples failed) -> 32G (2nd run), Shapeit4 32G (2nd run, failed) -> 128G (3rd run, still running after 13 hours): https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/9cd3ef6f-1f4c-4ba7-8f67-ecf76f91c391, https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/72113057-45f0-4e02-b978-316ab261d2d0, and https://app.terra.bio/#workspaces/allofus-drc-wgs-lr-prod/AoU_DRC_WGS_LongReads_PacBio%20PAPER%20COPY/job_history/dea0b0fd-fa39-400e-8a10-655be8c99eb2 (still running)
I think I will leave this last run going, but probably we should make some of the following improvements and rerun:
- Reexamine necessity of BAM/VCF subsetting. BAM subsetting by region was added to the new workflow to facilitate running on 10Mbp. I believe VCF subsetting by sample was also added, but I'm not sure this was necessary, as it looks like HiPhase can take as input a multisample VCF and will only retrieve the samples needed. I think we still need to be able to run the workflow on arbitrary genomic regions, but let's take out subsetting and reduce the number of per-sample jobs where possible. I hope this will also improve call caching---I was getting a lot of misses and/or slow caches, and I suspect having a lot of per-sample jobs exacerbated this.
- Examine possibility of batching HiPhase. It looks like HiPhase can run on multiple samples in a single invocation. It also looks like a single chromosome takes only ~10min, so perhaps we should just run over the whole genome (or analysis region). This is probably lower priority than removing extraneous per-sample subsetting jobs.
- Do not log HiPhase verbosely. On the chr6 runs, it looks like this generates 2x 100Mb logs (stdout + the *.log copy) per sample.
- Shard Shapeit4. This is a must do, IMO. Several days is too long for any job and the tool is explicitly designed to be run on genomic chunks that are then subsequently ligated.
I think some of the slowness and inefficiency is due to Cromwell/Terra (e.g., slow or missed caching), but we can certainly clean things up on our end.
@hangsuUNC @SHuang-Broad @kvg take note. In particular, I'm curious if @SHuang-Broad has more insight about making things more cache friendly.
Metadata
Metadata
Assignees
Labels
No labels