Skip to content

Conversation

@ctsa
Copy link
Contributor

@ctsa ctsa commented Oct 3, 2015

Always set binary mode on the BGZF* file descriptor for
all bgzf_*open calls on windows. Note the occurance of
'b' in the mode string is ignored, because this file
type must be binary mode to work correctly on windows.

This design will correct all bgzf file pointers (stdin/stdout,
other cases of pre-specified fd...) instead of just
those using bgzf_open with the current O_BINARY define
in hfile_oflags.

@ctsa ctsa mentioned this pull request Oct 3, 2015
@jmarshall
Copy link
Member

hFILE_fd *fp = (hFILE_fd *)fpv;

Not all hFILE pointers are hFILE_fd pointers, so this won't fly. And I'd really rather not expose an API function like hsetbinary().

We would be more likely to accept a PR that just implements the existing TODO re Set binary mode (for Windows). This will fix the problem for stdin/stdout, and in other cases of pre-specified fds I think it's reasonable to expect the calling code that opens the fd to open it appropriately.

@ctsa
Copy link
Contributor Author

ctsa commented Oct 4, 2015

Thanks for clarifying the hFILE/hFILE_fd issue, I did not know about this. The design constraint I'm trying to work around is this:

I understand that all BGZF* are dealing with binary formats, and therefore we can set binary mode in this case without consulting the mode string, because otherwise on windows we write a corrupt bgzip/gzip file.

On the other hand, the hFILE/hopen() interface seems to be generalized to include non-compressed formats as well (SAM...), so I wasn't sure that it would be valid to assume we could set binary mode on windows in this case -- this is what confuses me about the TODO -- I would be fine to submit a revised PR which sets binary mode on a wider set of hopen calls, but I was trying to stay away from any issues of binary vs text mode for uncompressed SAM and VCF files.

@jmarshall
Copy link
Member

The general policy in htslib is to open everything in binary mode, and accept both LF and CR-LF as line terminators when reading files as text. (See for example kseq.h which is currently where this happens for SAM and VCF files.) Implications are:

  • Tools on both Unix and Windows can read both LF-style and CR-LF-style text files without faff;
  • Tools on Windows write text output in the non-native LF-style (but most of the opinions we've heard think this is preferable);
  • We don't do any special ctrl-Z handling on Windows, which may or may not affect the ability to provide standard input to these tools on the command line by typing into the terminal window.

@jkbonfield
Copy link
Contributor

On Mon, Oct 05, 2015 at 04:16:43AM -0700, John Marshall wrote:

  • We don't do any special ctrl-Z handling on Windows, which may or may not affect the ability to provide standard input to these tools on the command line by typing into the terminal window.

This is quite easy to add.

You can already see this sort of thing in O_BINARY for open in htslib
knetfile.c:

#ifdef _WIN32
/* In windows, O_BINARY is necessary. In Linux/Mac, O_BINARY may
* be undefined on some systems, although it is defined on my
* Mac and the Linux I have tested on. */
int fd = open(fn, O_RDONLY | O_BINARY);
#else
int fd = open(fn, O_RDONLY);
#endif

or O_BINARY on an explicit set_mode in cram/mFILE.c:

#ifdef _WIN32
if (binary)
_setmode(_fileno(fp), _O_BINARY);
else
_setmode(_fileno(fp), _O_TEXT);
#endif

James

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@jmarshall
Copy link
Member

There is a reason why the general policy in htslib is to open everything in binary mode.

The point is that when a data file is opened, we don't know whether it will be a binary file or a text file. This is not just a question of adding a flag to hts_open()'s arguments — when opening a file for reading, we don't know which it is until we've read a few (tens of) bytes in hts_detect_format(). So we'd have to change the mode after we've already read from the file, which is painful at best.

Since the majority of users are on platforms where all this is immaterial, the claim is that the always-binary approach is a simple maintainable implementation that is ideal for most and workable for the platforms that distinguish text and binary (i.e., Windows). Perhaps @ctsa can comment on whether the ctrl-Z handling or LF-style output is problematic on Windows.

@ctsa
Copy link
Contributor Author

ctsa commented Oct 5, 2015

Thanks for clarifying the intent -- 'always binary' is a very easy policy to follow. I don't have any insight on the ctrl-Z issue and am not familiar with any cases where this would come up. Will submit revised PR based on this policy once I have a moment to test.

@jkbonfield
Copy link
Contributor

John: your TODO comment does indeed look like a good starting point for many cases. Sorry I didn't notice it was a link to some actual source earlier, I was reading in plain old mutt. Although it may be better to do the actual setmode in the neighbouring hdopen as some functions call hdopen direct and don't go via the stdin/stdout version. An example of this is bgzf_dopen(), which is called by the bgzf_fdopen macro, in turn called in bgzf.c's main function directly on fileno(stdin).

Chris: the control-Z issue is one of how Windows handles reading a binary file when opened in non-binary mode (the default for stdin/stdout). If it sees a control-Z it interpets it as EOF and promptly ceases reading, despite potentially megabases of data on the stream after it. Basically the same things that solve cr/lf typically also solve control-Z.

Hence using _setmode(_fileno(fp), _O_BINARY) or an equivalent solves it too.

This edit, together with the existing O_BINARY setting
in hfile_oflags, ensures that nearly all hFILE open calls
are set to binary mode on windows.
@ctsa
Copy link
Contributor Author

ctsa commented Oct 5, 2015

Ok, revised the PR down to this minimal edit -- this solves all of my motivating failure cases on windows (things like "sortVcf.py | bgzip.exe -c > file.vcf.gz", and a few similar uses). A more general change to make bgzip & tabix uniformly apply the 'always binary' policy should probably be the subject of another PR.

@jmarshall
Copy link
Member

All that stuff with bgzf_dopen(fileno(stdin), "r") in bgzip.c and tabix.c is pre-hFILE code from the days when bgzf_open() did not treat a - filename specially. It should be recoded to use bgzf_open("-", …) instead.

(The ctrl-Z issue I had in mind was the low-priority one in which the user tries to type a SAM file on the terminal:

$ samtools view -bo foo.bam -
@SQ  SN:foo  LN:200
etc etc
^Z
$

and is unable to terminate giving input. However it turns out (at least on Windows 7) that even with stdin in binary mode typing ctrl-Z still works to signal EOF, so there is probably no issue here.)

I would prefer not to call setmode() in hdopen() because:

  1. hdopen() can be called with file descriptors that are not files. In particular, on Windows sockets will not respond to setmode() particularly well;
  2. file descriptors given to hdopen() may already have had I/O done on them, so it may by then be too late to call setmode().

bzgip.c will need its own overhaul to work fully on Windows anyway. bgzip -d opens its output file with raw open(2) without any concern for binary mode, so surely doesn't work even with a hacked hdopen().

@jmarshall
Copy link
Member

Please give current develop a go. b8204c1 avoids bgzf_dopen() in bgzip.c and tabix.c, meaning that cases like your sortVcf.py | bgzip.exe -c > file.vcf.gz go via hopen_fd_stdinout(), which since fe8b210 sets it to binary.

Bgzip still opens its non-BGZF files, i.e., the file to be compressed or the output from decompression, in non-binary mode. However as bgzip is probably mostly used with text files, that's perhaps the right thing to do. No change from previous versions, in any case.

@ctsa
Copy link
Contributor Author

ctsa commented Dec 16, 2015

Thanks John! I haven't been on a windows iteration in some time. I see this made it into the 1.3 release so I will test when we do the htslib bump.

@jmarshall
Copy link
Member

Should have been fixed by fe8b210. Please reopen if there is still a problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants