-
Notifications
You must be signed in to change notification settings - Fork 5
feat(qmule): new classes for creating fastqs from BAMs and vice versa #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
These new classes preserve trimmed bases and additional header information in the fastq file through to (unmapped) BAM and back to fastq
| if (MAX_Q > SAMUtils.MAX_PHRED_SCORE) return new String[]{"MAX_Q must be <= " + SAMUtils.MAX_PHRED_SCORE}; | ||
| return null; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| System.exit(exitStatus); | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| assertEquals("aaaaAAAA+????????", samRec.getAttribute(FastqToSamWithHeaders.ZT_ATTRIBUTE)); | ||
| assertEquals(" 1.2.ACGTTGCA/1", samRec.getAttribute(FastqToSamWithHeaders.ZH_ATTRIBUTE)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth replicating the Picard FastqToSam tests here just so we're happy it's still doing all the usual things correctly? ("NO" is a reasonable answer!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added extra tests to try and replicate the tests in FastqToSam
| /* | ||
| If this contains the trimmed bases flag (TB:) then put that in a separate tag | ||
| */ | ||
| int tbIndex = additionalHeader.indexOf(" TB:"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
" TB:" as a static final ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
|
|
||
| @Test | ||
| public void createSAMRecordAdditionalHeaderNoTrimming() { | ||
| FastqRecord fqRec = new FastqRecord("basename 1.2.ACGTTGCA/1", "ACGT", "", "????"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might I suggest two small changes to this and the ones below...
- make the read id
basename/1 1.2.ACGTTGCAso the/1is in a more typical position for paired reads - Construct the read id out of named pieces e.g.
readId = basename + additionalHeader + " TB:" + trimmedBasesso that it's a bit clearer in theassertEquals-es that the expected pieces are being put into the expected places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tick and tick
| assertEquals("AAATGAGGGAAGAAAAGAGTTAAATGCATGTTGATTCCAAGCCCCCGCCTGCCGGGGGGACAGCGGGAGGTTGGAGCACGCAGCCCTGGTGCCTGGT" + "GCGA", fq.getReadString()); | ||
| assertEquals("???????????????????5???????????????????????????????????????&5?????????+??55?????????????????????????'", fq.getBaseQualityString()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, is it worth replicating the Picard unit tests here, or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added extra tests to try and replicate the tests in SamToFastq
| assertEquals("AAATGAGGGAAGAAAAGAGTTAAATGCATGTTGATTCCAAGCCCCCGCCTGCCGGGGGGACAGCGGGAGGTTGGAGCACGCAGCCCTGGTGCCTGGT" + "GCGA", fq.getReadString()); | ||
| assertEquals("???????????????????5???????????????????????????????????????&5?????????+??55?????????????????????????'", fq.getBaseQualityString()); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to demonstrate reconstructing a read id with a space in it like @ERR194147.1758538/1 ACGTACGT ? I mean it's pretty obvious that it works no matter what is in the ZH attribute but could be good to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea - I'll flesh out the ZH attribute
Description
This PR contains 2 new classes to create unmapped BAMs from fastq files, and to create fastq files from mapped BAM/CRAM files.
They are effectively clones of picard's classes (
FastqToSamandSamToFastq).What sets them apart from the picard classes is that they will attempt to preserve any additional information in the
FastqRecordheader in theSAMRecord(FastqToSamWithHeaders).When writing back to fastq (
SamToFastqWithHeaders), these additional headers (if present) will be returned to theFastqRecordheader. This means that it should be possible to accurately recreate the fastq files that were used to make the BAM/CRAM, which means that it should be possible to delete the original fastq files, which means that disk space should be saved.FastqToSamWithHeaderswill write any additional header information into 2 user defined tags:ZH- for any extra Header informationZT- for any bases and qualities that have been Trimmed from the read (a separate process is responsible for trimming the bases and adding them to theFastqRecordheader)These tags are then either added back to the header (
ZH) or added back to the read and quality (ZT) whenSamToFastqWithHeadersis called.Type of change
How Has This Been Tested?
New unit test classes have been included as part of this PR.
These classes have been included as part of a modified
FTUB_WGGSSwfl, with the same number of hard filtered vcf records produced (on the GS NA12878 dataset) as the existingFTUB_WGGSSwfl. The CRAM/BAM can be converted back to a fastq file with the same records as the original.Are WDL Updates Required?
No wdl updates are required, although the expectation is that once this is in production, the
FTUB_WGGSSwfl (or a new one based on that) will be update to call these new classes.Checklist: