Skip to content

Optimize phasing pipeline for scalability. #38

@samuelklee

Description

@samuelklee

Now that we've reconfigured and refactored the phasing pipeline to improve SV performance on 10Mbp, we probably need to make additional changes to improve scalability for whole chromosome/genome runs.

Here are a few submissions to reference:

I think I will leave this last run going, but probably we should make some of the following improvements and rerun:

  • Reexamine necessity of BAM/VCF subsetting. BAM subsetting by region was added to the new workflow to facilitate running on 10Mbp. I believe VCF subsetting by sample was also added, but I'm not sure this was necessary, as it looks like HiPhase can take as input a multisample VCF and will only retrieve the samples needed. I think we still need to be able to run the workflow on arbitrary genomic regions, but let's take out subsetting and reduce the number of per-sample jobs where possible. I hope this will also improve call caching---I was getting a lot of misses and/or slow caches, and I suspect having a lot of per-sample jobs exacerbated this.
  • Examine possibility of batching HiPhase. It looks like HiPhase can run on multiple samples in a single invocation. It also looks like a single chromosome takes only ~10min, so perhaps we should just run over the whole genome (or analysis region). This is probably lower priority than removing extraneous per-sample subsetting jobs.
  • Do not log HiPhase verbosely. On the chr6 runs, it looks like this generates 2x 100Mb logs (stdout + the *.log copy) per sample.
  • Shard Shapeit4. This is a must do, IMO. Several days is too long for any job and the tool is explicitly designed to be run on genomic chunks that are then subsequently ligated.

I think some of the slowness and inefficiency is due to Cromwell/Terra (e.g., slow or missed caching), but we can certainly clean things up on our end.

@hangsuUNC @SHuang-Broad @kvg take note. In particular, I'm curious if @SHuang-Broad has more insight about making things more cache friendly.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions