-
Notifications
You must be signed in to change notification settings - Fork 503
Orc 17 #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orc 17 #134
Conversation
c++/src/OrcHdfsFile.cc
Outdated
| char input_buffer[BUF_SIZE]; | ||
| do{ | ||
| //Reading file chunks | ||
| status = file->PositionRead(input_buffer, sizeof(input_buffer), offset, &last_bytes_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an intermedia buffer here? Can we just pass in "buf" and "length" into PositionRead()? This way we don't need following memcpy() and can also simply the logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used intermediate buffer because it was used for PositionRead in a similar case. So I just tried without this buffer and it worked.
c++/src/OrcHdfsFile.cc
Outdated
| std::unique_ptr<hdfs::FileHandle> file_handle(file_raw); | ||
| file = std::move(file_handle); | ||
| //transferring the ownership of FileSystem to HdfsFileInputStream to avoid premature deallocation | ||
| file_system = fs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite understand why we need a local variable "fs". You could just initialize this->file_system. Also "fs" should have been a unique_ptr as it doesn't share ownership with anyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, we don't really need a shared pointer here. Just fixed it.
c++/src/OrcHdfsFile.cc
Outdated
| } | ||
| //wrapping file_raw into a unique pointer to guarantee deletion | ||
| std::unique_ptr<hdfs::FileHandle> file_handle(file_raw); | ||
| file = std::move(file_handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could combine these two lines into file.reset(file_raw);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good catch.
c++/src/OrcHdfsFile.cc
Outdated
| throw ParseError("Error reading the file: " + status.ToString()); | ||
| } | ||
| } while (true); | ||
| status = file->PositionRead(buf, static_cast<size_t>(length), static_cast<off_t>(offset), &last_bytes_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you still need to put this in a loop in case last_bytes_read is less than length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I just added a loop here.
c++/src/OrcHdfsFile.cc
Outdated
| if (!fs) { | ||
| //Wrapping file_system into a unique pointer to guarantee deletion | ||
| file_system = std::unique_ptr<hdfs::FileSystem>(hdfs::FileSystem::New(io_service, "", options)); | ||
| if (!file_system) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this would work? Should it be if (file_system.get() != nullptr) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unique pointer actually supports this notation (thanks to operator bool)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately unique_ptr can be redefined as auto_ptr if platform doesn't support unique_ptr (see orc_config.hh). We talked about removing this redefine, but it hasn't been done yet. So I'd suggest we stick to file_system.get() != nullptr for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it here, however '!fs' notation is also used in many other parts of the libhdfs++ library, like examples, tests, tools, and c bindings, so hopefully this does not create too much trouble.
|
|
||
| namespace orc { | ||
|
|
||
| class HdfsFileInputStream : public InputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have corresponding UTs for this new class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't currently have the UTs for this class because to test it we need a hadoop cluster, and in order to fire one up we would need to bring the entire hadoop tree into ORC (which is actually much larger than ORC), and it can make things very heavy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does HDFS C++ provide a mocked service interface? My concern is the testability of HdfsFileInputStream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't actually have a mocked service interface in libhdfs++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a mock that does real network stuff libhdfs++ uses the C++/JVM bindings over the minidfscluster (java namenode and datanode in a single process) for its unit tests. That seems like a very large dependency to bring into this project. It'd be possible to write a mock that goes over TCP in C/C++, I can think of 2 ways:
- Capture network traffic of a read with tcpdump/wireshark for a given request and send the response back. This would be really brittle if IO request sizes changed and pretty ugly to store a bunch of binary data.
- Implement at least some of the server side HDFS RPC protocol and DataTransferProtocol to make a mock namenode and datanode. This is possible but would be a considerable amount of work that wouldn't be as useful for libhdfs++ since it already has the minidfscluster.
c++/src/OrcHdfsFile.cc
Outdated
| size_t last_bytes_read = 0; | ||
|
|
||
| do { | ||
| status = file->PositionRead(buf, static_cast<size_t>(length) - total_bytes_read, static_cast<off_t>(offset + total_bytes_read), &last_bytes_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to wrap on 80 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed the whole file to wrap on 80 chars.
c++/libs/libhdfspp/CMakeLists.txt
Outdated
| find_package(OpenSSL REQUIRED) | ||
| find_package(Protobuf REQUIRED) | ||
| find_package(CyrusSASL) | ||
| find_package(GSasl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're going to need a flag to make sure it doesn't attempt to find and link against GSasl since the Apache and GPL licenses aren't compatible. Eventually GSasl will be removed from libhdfs++ for the same reason.
|
Can we add libhdfspp as a tarball and add a line to untar it? |
|
Sorry, this patch is really out of date with the master. Can you rebase it? |
|
Hi @omalley , |
|
Rebased the branch and resolved the protobuf issues, but still need to move Libhdfspp out of the libs and fix other samll issues. Testing if it passes CI-tests now. |
|
Unfortunately, this doesn't build on MacOS. :( I'll take some time this week to figure out what needs to be fixed. I've gone ahead and rebased it to the current master and squashed the commits in https://github.com/omalley/orc/tree/pr/134 |
|
Hi @omalley , |
|
It looks like it is a "long" vs "long long" problem. |
|
Those errors should go away after HDFS-10787 is finished (or applied to the PR temporarily) since it exposes an API to get at the configuration without using std::tr2::optional. We had to suppress a bunch of errors coming from optional when it was added to libhdfs++; eventually I'd like to get rid of the remaining uses of it in the implementation as well. |
|
I am adding a Travis test for OSX os. It should ease catching OSX issues. |
|
Can you check if |
|
@majetideepak Is there a way to get the OSX CI to use a generic LLVM/Clang install rather than the one that gets shipped with OSX? Apparently they apply a patch to Clang to break thread_local, my understanding is they don't want to break the ABI if they ever get around to making builtin thread local specialized for their OS. Otherwise this won't get coverage on OSX CI builds. Edit: Once I have some time I'll write a shim on top of pthread TLS which will make this a non-issue but it might be a while before I have a chance. https://issues.apache.org/jira/browse/HDFS-10355 |
|
.. Owen
On Aug 18, 2017, at 07:14, jamesclampffer ***@***.***> wrote:
@majetideepak Is there a way to get the OSX CI to use a generic LLVM/Clang install rather than the one that gets shipped with OSX?
Don't do that. If it doesn't compile with the "real" Mac OS compiler, you'll break my dev platform.
… Apparently they apply a patch to Clang to break thread_local, my understanding is they don't want to break the ABI if they ever get around to making builtin thread local specialized for their OS.
.Case("cxx_thread_local",
- LangOpts.CPlusPlus11 && PP.getTargetInfo().isTLSSupported() &&
- !PP.getTargetInfo().getTriple().isOSDarwin())
+ LangOpts.CPlusPlus11 && PP.getTargetInfo().isTLSSupported())
Otherwise this won't get coverage on OSX CI builds.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Hi all! It looks like thread_local feature is by default supported by Xcode 8.0 (and above) which runs on OS X El Capitan (10.11.5) or newer. Is it possible to update the osx_image of the CI-run from "xcode6.4" to at least "xcode8.0"? For older versions we can just display a warning message and not build libhdfspp. |
|
Looks like THIS is the list of available CI images for OS X. I will try osx_image: xcode8.3 now and see what happens. |
|
There are a lot of warnings in libhdfs that clang reports. It is currently failing in: with link errors about sasl and protobuf. |
|
Hi @omalley , the linking should be fixed with the new commit, and we will fix the warnings shortly. |
|
Hi guys, I now fixed all the build errors and warnings for OS X (with Xcode 8+), and I would like some additional feedback on this pull request if you have any. Thanks! |
| } | ||
|
|
||
| hdfs::FileHandle *file_raw = nullptr; | ||
| status = file_system->Open(uri.get_path(), &file_raw); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you may want to cache "status" within the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure why we will want to cache "status", since it is just used for the return values of HDFS operations (success or error message). Could you please explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is that you may want to return other meta info in the future. But it's your call.
c++/src/OrcHdfsFile.cc
Outdated
| } | ||
|
|
||
| uint64_t getNaturalReadSize() const override { | ||
| return 1048576; //1 MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use const variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good, I will update this.
|
|
||
| /** | ||
| * Create a stream to a local file. | ||
| * Create a stream to a local file or HDFS file if path begins with "hdfs://" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you want readLocalFile() to handle hdfs files? You already have readHdfsFile() which does exactly that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess this is a little confusing. Currently I am calling readHdfsFile from readLocalFile if it starts with "hdfs://". To fix this I will add a new function readFile which will call either readHdfsFile or readLocalFile.
|
LGTM |
|
Now ORC can be built without LIBHDFSPP like this: |
|
Ok, this patch breaks the docker scripts on at least centos7, debian8, and ubuntu16 by adding new dependencies for the c++ build. I'm going through and fixing them. |
|
my hdfs path just like this: hdfs://yycluster02/hive_warehouse/rs_hago_client.db/peta/test/orc_test/my-file.orc but, I got the error use it like this: |
### What changes were proposed in this pull request? Mark readHdfsFile as deprecated. ### Why are the changes needed? Reading ORC on HDFS was introduced in #134 without any test. It has not been updated for 7 years and updating libhdfspp will result in extra dependency like boost library. Staying at an old version of libhdfspp will also prohibit us from updating other libraries like protobuf. ### How was this patch tested? It does not need test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #1885 from wgtmac/ORC-1669. Authored-by: Gang Wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Mark readHdfsFile as deprecated. ### Why are the changes needed? Reading ORC on HDFS was introduced in #134 without any test. It has not been updated for 7 years and updating libhdfspp will result in extra dependency like boost library. Staying at an old version of libhdfspp will also prohibit us from updating other libraries like protobuf. ### How was this patch tested? It does not need test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #1885 from wgtmac/ORC-1669. Authored-by: Gang Wu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit cea0629) Signed-off-by: Dongjoon Hyun <[email protected]>
In this pull request I added LIBHDFS++ library for reading files from HDFS to ORC project.
Libhdfs++ is located in orc/c++/lib/libhdfspp and by default builds as a light-weight library without examples, tests, and tools (and by this avoids dependencies on JDK, valgrind and gmock). However, if the flag -DHDFSPP_LIBRARY_ONLY=FALSE is passed to cmake, then it will build the examples, tests, and tools as well.
Libhdfs++ depends on protobuf libraries in orc/c++/libs/protobuf-2.6.0 and is searching the system for packages Doxygen, OpenSSL, CyrusSASL, GSasl, and Threads dynamically (however only OpenSSL and Threads are required).
The folder libhdfspp also includes a script pull_hdfs.sh which pulls the latest changes from Libhdfs++ Hadoop branch to ORC, and generates file 'imported_timestamp' with the timestamp and the information about the latest commit.
I also updated all the ORC tools to automatically use Libhdfs++ to read ORC files on HDFS if their path begins with 'hdfs://'.
Please review.