Skip to content

Conversation

@holmeso
Copy link
Contributor

@holmeso holmeso commented Oct 31, 2024

Description

The ByteArrayStopCodec class in htsjdk (fork) has been updated to use a PushbackInputStream, which means that reading data from the input stream can be made quicker. A jar file from the htsjdk fork is used by qpicard which will impact all code that reads and writes BAM/SAM/CRAM files.

Also created a copy of the ValidateSamFile and SamFileValidator in qmule. The minor changes to these files allow them to use the AsyncCRAMReader class (also in qmule) that reads CRAMRecords into a queue.

The motivation for this change was to reduce the time that the ValidateCRAM process takes in the FTUB_WGGSS wdl. This was taking over 7 hrs to validate a large CRAM file (~2 billion reads).
The included changes reduce this time by around half.

A downside to this approach is that when desired updates appear in htsjdk, we will need to update the folk and release an updated jar file. We don't do this often so am happy to see how it goes.
Another option is to raise a PR with htsjdk directly to see if they are interested in the change, but I am not sure if it is the structure of our CRAMs that are benefitting so much from the change, or if all CRAMs would.

Type of change

  • Performance enhancement

How Has This Been Tested?

Existing unit tests pass, and code has been run and compared against existing output from existing tools

Are WDL Updates Required?

Not required, but once released, FTUB_WGGSS should be updated to use qmule's ValidateSamFile rather than picards.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

… for CRAM file reading

THe ByteArrayStopCodec class in htsjdk (fork) has been updated to use a PushbackInputStream, which
means that reading data from the input stream can be made quicker. A jar file from the htsjdk fork
is used by qpicard which will impacet all code that reads and writes BAM/SAM/CRAM files.
@holmeso holmeso merged commit 101167e into master Oct 31, 2024
1 check passed
@holmeso holmeso deleted the forked_htsjdk branch October 31, 2024 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants