Skip to content

Conversation

@holmeso
Copy link
Contributor

@holmeso holmeso commented Aug 28, 2023

Description

This PR contains 2 new classes to create unmapped BAMs from fastq files, and to create fastq files from mapped BAM/CRAM files.
They are effectively clones of picard's classes (FastqToSam and SamToFastq).

What sets them apart from the picard classes is that they will attempt to preserve any additional information in the FastqRecord header in the SAMRecord (FastqToSamWithHeaders).
When writing back to fastq (SamToFastqWithHeaders), these additional headers (if present) will be returned to the FastqRecord header. This means that it should be possible to accurately recreate the fastq files that were used to make the BAM/CRAM, which means that it should be possible to delete the original fastq files, which means that disk space should be saved.

FastqToSamWithHeaders will write any additional header information into 2 user defined tags:

  • ZH - for any extra Header information
  • ZT - for any bases and qualities that have been Trimmed from the read (a separate process is responsible for trimming the bases and adding them to the FastqRecord header)

These tags are then either added back to the header (ZH) or added back to the read and quality (ZT) when SamToFastqWithHeaders is called.

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

New unit test classes have been included as part of this PR.
These classes have been included as part of a modified FTUB_WGGSS wfl, with the same number of hard filtered vcf records produced (on the GS NA12878 dataset) as the existing FTUB_WGGSS wfl. The CRAM/BAM can be converted back to a fastq file with the same records as the original.

Are WDL Updates Required?

No wdl updates are required, although the expectation is that once this is in production, the FTUB_WGGSS wfl (or a new one based on that) will be update to call these new classes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

These new classes preserve trimmed bases and additional header information in the fastq file through
to (unmapped) BAM and back to fastq
@delocalizer delocalizer self-assigned this Aug 28, 2023
if (MAX_Q > SAMUtils.MAX_PHRED_SCORE) return new String[]{"MAX_Q must be <= " + SAMUtils.MAX_PHRED_SCORE};
return null;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


System.exit(exitStatus);
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

assertEquals("aaaaAAAA+????????", samRec.getAttribute(FastqToSamWithHeaders.ZT_ATTRIBUTE));
assertEquals(" 1.2.ACGTTGCA/1", samRec.getAttribute(FastqToSamWithHeaders.ZH_ATTRIBUTE));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth replicating the Picard FastqToSam tests here just so we're happy it's still doing all the usual things correctly? ("NO" is a reasonable answer!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added extra tests to try and replicate the tests in FastqToSam

/*
If this contains the trimmed bases flag (TB:) then put that in a separate tag
*/
int tbIndex = additionalHeader.indexOf(" TB:");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" TB:" as a static final ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree


@Test
public void createSAMRecordAdditionalHeaderNoTrimming() {
FastqRecord fqRec = new FastqRecord("basename 1.2.ACGTTGCA/1", "ACGT", "", "????");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might I suggest two small changes to this and the ones below...

  1. make the read id basename/1 1.2.ACGTTGCA so the /1 is in a more typical position for paired reads
  2. Construct the read id out of named pieces e.g. readId = basename + additionalHeader + " TB:" + trimmedBases so that it's a bit clearer in the assertEquals-es that the expected pieces are being put into the expected places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tick and tick

assertEquals("AAATGAGGGAAGAAAAGAGTTAAATGCATGTTGATTCCAAGCCCCCGCCTGCCGGGGGGACAGCGGGAGGTTGGAGCACGCAGCCCTGGTGCCTGGT" + "GCGA", fq.getReadString());
assertEquals("???????????????????5???????????????????????????????????????&5?????????+??55?????????????????????????'", fq.getBaseQualityString());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, is it worth replicating the Picard unit tests here, or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added extra tests to try and replicate the tests in SamToFastq

assertEquals("AAATGAGGGAAGAAAAGAGTTAAATGCATGTTGATTCCAAGCCCCCGCCTGCCGGGGGGACAGCGGGAGGTTGGAGCACGCAGCCCTGGTGCCTGGT" + "GCGA", fq.getReadString());
assertEquals("???????????????????5???????????????????????????????????????&5?????????+??55?????????????????????????'", fq.getBaseQualityString());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to demonstrate reconstructing a read id with a space in it like @ERR194147.1758538/1 ACGTACGT ? I mean it's pretty obvious that it works no matter what is in the ZH attribute but could be good to be explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea - I'll flesh out the ZH attribute

@holmeso holmeso merged commit e1d1cb8 into master Sep 4, 2023
@holmeso holmeso deleted the fastqToSamHeaders branch September 4, 2023 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants