Skip to content

Conversation

@anderskaplan
Copy link
Contributor

The purpose of this pull request is to make libhts and the test suite work better with some Windows quirks: the CRLF line endings and the distinction between binary and text files.

Plus a few other minor improvements -- see the individual commits.

-let hfile call setmode(O_BINARY) on _all_ file handles, not only stdin and stdout.
-stop git from converting line endings on the reference fasta files.
-make the test scripts handle CRLF line endings, too. Perl should take care of it, according to the docs, but it doesn't.
Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks - it looks like a useful change and I approve of it bar a couple minor tidy-ups to avoid unnecessary differences.

Have you been testing Samtools too? Windows support is on our agenda, but finding the time is hard and offers for help are very welcome!

test/hfile.c Outdated
if (text == NULL) fail("malloc(text)");

if (fread(text, 1, filesize, f) != filesize) fail("fread");
readsize = fread(text, 1, filesize, f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I've no major objections to this style (use of a variable over inline checking), I don't particularly see the need for changing it and such non-changes make it harder to evaluate the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for splitting the line in the first case was that I wanted to disable the check on Windows. But that wasn't necessary as long as the file is opened in binary mode. Reverting this one.

while ((c = getopt(argc, argv, "IbDCSl:t:i:o:N:BZ:@:")) >= 0) {
switch (c) {
case 'S': flag |= 1; break;
case 'I': ignore_sam_err = 1; break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding usage is most welcomed, thanks, but as for prior comment, why have the options changed order? It doesn't look to be for consistency with the usage statement.

Copy link
Contributor Author

@anderskaplan anderskaplan Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order in the switch to be consistent with the string passed to getopt, right above the switch statement. For the usage instructions I aimed at what would make most sense for the user. Ideally they should all be consistent, but I didn't touch the getopt string because I'm not familiar with how it's used. I can revert the change of order, or change all of them to match the usage, whichever you prefer.

hfile.c Outdated
{
#if defined HAVE_SETMODE && defined O_BINARY
if (setmode(fd, O_BINARY) < 0) return NULL;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. See commentary in #283.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this patch, the test cases in test_bgzf.c which use the function try_bgzf_dopen fail on Windows, because the files are opened in text mode, and then the file descriptors are passed into bgzf_dopen and hdopen.

I considered changing the test cases instead, but couldn't really see why it would be a bad thing to always set binary mode on the file descriptors. Are there any cases where it would be a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmarshall, can you please clarify the intention here for binary files. You state "The general policy in htslib is to open everything in binary mode", but that is precisely what this change does. Do you mean that all opens should be "rb", "wb" or similar and specifying it at that point? (Mostly we do that already of course.) That only covers fopen style APIs though and not the Unix open function used in the test case. The bgzf_dopen function should perhaps add "b" to the mode string before calling hdopen, but this doesn't really seem to fit the "everything is binary" philosophy if we're only doing it for bgzf and not for SAM.

Stdin/stdout as you note are the exceptions which clearly need a setmode call to change, but again we're turning stdin/stdout to an hfile via the dopen call once more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant part from #283 (quoting @jmarshall):

I would prefer not to call setmode() in hdopen() because:

  1. hdopen() can be called with file descriptors that are not files. In particular, on Windows sockets will not respond to setmode() particularly well;
  2. file descriptors given to hdopen() may already have had I/O done on them, so it may by then be too late to call setmode().

So we should fix test_bgzf.c, and note in the documentation for hdopen() that the file descriptor passed in must already be in binary mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[TL;DR: what Rob said.]

As noted in the #283 commentary that I already pointed you both to, there are at least two problematic cases: hdopen() can be too late to set binary mode, and it is invalid to call setmode() on some types of file descriptor given to hdopen().

If you are going to call hdopen() or bgzf_dopen() then it is your responsibility to have opened the file descriptor in binary mode.

The bug is in test_bgzf.c's try_bgzf_dopen() which should be setting O_BINARY as hfile_oflags() does, and probably (being part of htslib so kosher to include hfile_internal.h) it should just call hfile_oflags() instead of duplicating that code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, "open" vs "dopen" in binary mode; agreed fixing the original open is correct.

We should probably document this somewhere too in the doxygen comments before the hdopen function as it's currently intuitive to believe mode should contain "b" and handle it there.

.gitignore Outdated
*.dSYM
*.exe
*.dll
*.obj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gitignore already has appropriate patterns for .dll, no? To date, Windows support has been aiming at Cygwin and GCC so *.o has sufficed, but it would be good to put .obj alongside .o/.pico.

Copy link
Contributor Author

@anderskaplan anderskaplan Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are "hfile_*.dll", "cyg*.dll", and "lib*.dll" patterns but nothing to catch pthreadVC2.dll, the pthreads library for Windows. I don't see why one would ever want to check in dll's in git, so I suggest that we keep the more generic *.dll pattern instead of the more specific ones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this pthreadVC2.dll of which you speak? Why would it appear in the htslib source directory rather than some compiler directory? (If there's some larger pull request you're working towards, showing the draft end product would be illuminating…)

.gitignore Outdated
/test/thrash_threads[1-6]
/test/*.tmp
/test/*.tmp.*
/test/*.new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*.new files are output files representing failed test cases and are not cleaned by make clean. They need to be visible in git status so that you can see what tests have failed and can see them to remove them by hand. Hence they are intentionally not in .gitignore. (If that policy were to change, further changes would be necessary — perhaps just adding them to make clean too.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removing

# Remove the text attribute from reference files, so that git doesn't convert
# line separators on Windows machines. It causes the index files to become out
# of sync with the fasta files.
*.fa* -text
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in the top-level .gitattributes file. But wouldn't it be better to regenerate the .fai files instead?

Copy link
Contributor

@jkbonfield jkbonfield Mar 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is that git clone itself changes these files. We can't really regenerate the .fai files on a git clone, unless we have the Makefile always regenerate .fai from .fa. Even this is problematic as there is no htslib command-line interface to fai_build: our only calls to it are over in samtools faidx instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly by regenerate I'm referring to having test.pl generate the derived .fai files (and removing them from source control), just as samtools's test.pl's test_index() generates some .bai indices used by later tests. Additions to test/test_view.c or a new test/test_fai.c would indeed be needed to exercise fai_build(), but this would increase HTSlib's standalone test coverage so would obviously be a good thing…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with regenerating them ourselves is that then we are testing our own output, which can lead to bugs cancelling each other out. On the whole, I'd prefer to keep the .fai files in version control for now. I agree that the change would be best in the top-level file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to top-level .gitattributes.

…or the failing test case. Added comment that the file descriptor must be binary in hopen and bgzf_dopen.
…iew to match the usage instructions. Also replaced magic numbers with constants to make the code more readable.
@anderskaplan
Copy link
Contributor Author

Regarding your question above: no, I haven't gotten to samtools yet. The Windows build was a bit of a detour, as I planned to work on some bugs/feature request for a start, but it turned out to be an interesting one. At this point I have a working MSVC build of htslib which passes the test suite in my fork of the repo, it's the "msvc-build-2" branch if you want to take a look. But for some reason the compiled code is much slower than on linux, so still some investigation to do before I'm comfortable sharing it.

@jkbonfield
Copy link
Contributor

I've started looking into this, specifically as a starting point for Mingw64 support. So I'll probably include your commits as part of a larger pull request.

Thanks.

@anderskaplan
Copy link
Contributor Author

ok, thanks for the update!

@jkbonfield
Copy link
Contributor

It's ongoing work, but see https://github.com/jkbonfield/htslib/tree/jkb_win, which includes a rebased and squashed version of your commits.

This also has an appveyor config, which is the Travis-like for windows. An example build is seen here:

https://ci.appveyor.com/project/jkbonfield/htslib

I'm going to move over to samtools next. Who knows what horrors that will bring! :-)
Once I'm happy with both I'll create PRs.

@jkbonfield
Copy link
Contributor

Both htslib and samtools are now passing their test harnesses on Windows with mingw/msys. Some caveats:

  1. You need make check TEST_OPTS="-t c:/msys64/tmp/_" as the default msys /tmp/... dir ends up in files of filenames and hence not being automatically modified by the msys command line hackery. Ideally we'd use a raw mingw build, but we need the rest of msys style environment for tests.

  2. The isatty checks in some samtools commands (eg samtools dict) to report usage when run on a terminal otherwise read from stdin if via a pipe or file, fail on msys. This is a terminal bug I think. It works when the executable is run under DOS and PowerShell. It's a known issue with known fixes (https://fossies.org/linux/vim/src/iscygpty.c) but I am unsure if it's worth it. I disabled those tests under Windows.

  3. There are a lot of checks for msys "OS" in the perl test harness. I don't know which of these are generic windows vs mingw vs msys, but for now it's enough to support one windows environment. Similarly I'm not sure all the #ifdef _WIN32 checks in the source are actually win32 issues or just mingw/msys. Do they happen on other compilers? Again - it's sufficient for now.

  4. I had all sorts of nasty issues with curl and linking, but this is an ongoing investigation.

  5. I'm not convinced adding drand48 (and co) to htslib is the correct solution, but it's a starting point. Ideally we'd just stop depending on them!

Anyway, progress has been made and I'll investigate more on Tuesday when I get back from the long weekend. :-)

@anderskaplan
Copy link
Contributor Author

anderskaplan commented Apr 28, 2017

I worked around problem #1 above by wrapping the tempdir function as follows in the test scripts:

sub safe_tempdir
{
    my $dir = tempdir(CLEANUP=>1);
    if ($^O eq 'msys')
    {
        $dir = `cygpath -m $dir`;
        $dir =~ s/\015?\012//;
    }
    return $dir;
}

Other perl distributions on Windows will probably set $^O to other values. I don't know if there's a better way to know if we're running in a cygwin environment; I didn't find anything when googling anyway.

@anderskaplan
Copy link
Contributor Author

The super annoying #defines ERROR and MF_APPEND in Windows.h can be avoided by defining NOUSER
and NOGDI before including Windows.h.

@anderskaplan
Copy link
Contributor Author

For #3, #ifdef _WIN32 tests for platform and not compiler. It is defined also with MSVC.

@jkbonfield
Copy link
Contributor

Thanks for the suggestions. Redefining tempdir is a nice idea. Agreed $^O is a bit messy, but I don't know the right solution here.

The point about _WIN32 is I know it spots the platform rather than the compiler, but I am using it at least in part for compiler issues as well as platform ones. It can be hard to tell which is which short of trying different compilers. Mostly it's OK I think.

@jkbonfield
Copy link
Contributor

jkbonfield commented May 8, 2017

I forgot to add a fixes #commit comment to my new PR, but please see #531 which incorporates this PR (rebased and squashed) plus my own changes.

Thanks.

@anderskaplan
Copy link
Contributor Author

Replaced by another pull request, so closing this one.

@anderskaplan anderskaplan deleted the msvc-prereqs branch May 13, 2017 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants