-
Notifications
You must be signed in to change notification settings - Fork 459
Restore support for QNAMEs longer than 251 characters #900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So that CIGAR data is 32-bit aligned, there are extra padding NULs in memory after qname (see 5d114eb). For simplicity in the bam_get_cigar() et al macros, these NULs are counted in both l_qname and l_extranul -- but this means that the maximum value for l_qname is 256 (namely a QNAME of length >= 252, plus one NUL terminator, plus more NULs to get to a multiple of 4). Make l_qname a uint16_t and (to retain the simplicity) leave the l_qname/l_extranul representation as it is. Rearrange the two fields so that the larger field takes over the previous unused1's space, so that sizeof(bam1_core_t) is unchanged. Revert the checks added by 079c7b8, which are no longer needed. In bam_construct_seq(), just trust the length obtained from the CRAM record; in bam_read1(), the l_qname value has come from a 1-byte raw field, so there is no longer any overflow risk; in sam_parse1(), revert the check to check against the SAM spec's exact 254 maximum length (+ 1 for the following TAB). Add test cases with long read names. Fixes samtools#520.
| uint8_t l_extranul; | ||
| uint16_t flag; | ||
| uint16_t l_qname; | ||
| uint32_t n_cigar; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff is pretty unhelpfully displayed. Equivalently and more clearly:
int32_t pos;
uint16_t bin;
uint8_t qual;
- uint8_t l_qname;
+ uint8_t l_extranul;
uint16_t flag;
- uint8_t unused1;
- uint8_t l_extranul;
+ uint16_t l_qname;
uint32_t n_cigar;
int32_t l_qseq;
int32_t mtid;|
I suppose, to be precisely accurate, it needs to represent 4…256 inclusive. So we could put #define bam_get_cigar(b) ((uint32_t*)((b)->data + (b)->core.l_qname + 4)
// etcand Edited to add: Certainly just expanding the field to |
|
I think Possibly we need some checks in |
It's possible to programmatically make a longer QNAME, which would lead to outputting a corrupted record in BAM.
Bumped TWO_TO_THREE_TRANSITION_COUNT due to bam1_core_t changes.
|
Thanks, long QNAMEs are now restored. |
Another ABI breaker that was in danger of being forgotten about while the ABI change window is open (leastwise I for one had forgotten about it!): change
l_qnametouint16_tso QNAME length can be returned to the spec's 254 from 079c7b8's 251 — see #520.As the soversion is being bumped, we have the luxury of fixing this by simply enlarging the
l_qnamefield. It needs to represent the values 0…256 so most of the new bits are wasted, but ¯\_(ツ)_/¯ and maybe one day we may have a format wanting longer QNAMEs…Adds test cases with long read names. I wish I had though of such test cases when I first implemented the align-the-CIGAR-data thing in 5d114eb!
Fixes #520. Supersedes and closes #523.