Skip to content

Conversation

@AnatoliShein
Copy link

In this pull request I added LIBHDFS++ library for reading files from HDFS to ORC project.

Libhdfs++ is located in orc/c++/lib/libhdfspp and by default builds as a light-weight library without examples, tests, and tools (and by this avoids dependencies on JDK, valgrind and gmock). However, if the flag -DHDFSPP_LIBRARY_ONLY=FALSE is passed to cmake, then it will build the examples, tests, and tools as well.

Libhdfs++ depends on protobuf libraries in orc/c++/libs/protobuf-2.6.0 and is searching the system for packages Doxygen, OpenSSL, CyrusSASL, GSasl, and Threads dynamically (however only OpenSSL and Threads are required).

The folder libhdfspp also includes a script pull_hdfs.sh which pulls the latest changes from Libhdfs++ Hadoop branch to ORC, and generates file 'imported_timestamp' with the timestamp and the information about the latest commit.

I also updated all the ORC tools to automatically use Libhdfs++ to read ORC files on HDFS if their path begins with 'hdfs://'.

Please review.

char input_buffer[BUF_SIZE];
do{
//Reading file chunks
status = file->PositionRead(input_buffer, sizeof(input_buffer), offset, &last_bytes_read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an intermedia buffer here? Can we just pass in "buf" and "length" into PositionRead()? This way we don't need following memcpy() and can also simply the logic here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used intermediate buffer because it was used for PositionRead in a similar case. So I just tried without this buffer and it worked.

std::unique_ptr<hdfs::FileHandle> file_handle(file_raw);
file = std::move(file_handle);
//transferring the ownership of FileSystem to HdfsFileInputStream to avoid premature deallocation
file_system = fs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite understand why we need a local variable "fs". You could just initialize this->file_system. Also "fs" should have been a unique_ptr as it doesn't share ownership with anyone.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, we don't really need a shared pointer here. Just fixed it.

}
//wrapping file_raw into a unique pointer to guarantee deletion
std::unique_ptr<hdfs::FileHandle> file_handle(file_raw);
file = std::move(file_handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could combine these two lines into file.reset(file_raw);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good catch.

throw ParseError("Error reading the file: " + status.ToString());
}
} while (true);
status = file->PositionRead(buf, static_cast<size_t>(length), static_cast<off_t>(offset), &last_bytes_read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you still need to put this in a loop in case last_bytes_read is less than length.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I just added a loop here.

if (!fs) {
//Wrapping file_system into a unique pointer to guarantee deletion
file_system = std::unique_ptr<hdfs::FileSystem>(hdfs::FileSystem::New(io_service, "", options));
if (!file_system) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this would work? Should it be if (file_system.get() != nullptr) ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unique pointer actually supports this notation (thanks to operator bool)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately unique_ptr can be redefined as auto_ptr if platform doesn't support unique_ptr (see orc_config.hh). We talked about removing this redefine, but it hasn't been done yet. So I'd suggest we stick to file_system.get() != nullptr for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it here, however '!fs' notation is also used in many other parts of the libhdfs++ library, like examples, tests, tools, and c bindings, so hopefully this does not create too much trouble.


namespace orc {

class HdfsFileInputStream : public InputStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have corresponding UTs for this new class?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't currently have the UTs for this class because to test it we need a hadoop cluster, and in order to fire one up we would need to bring the entire hadoop tree into ORC (which is actually much larger than ORC), and it can make things very heavy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HDFS C++ provide a mocked service interface? My concern is the testability of HdfsFileInputStream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't actually have a mocked service interface in libhdfs++.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a mock that does real network stuff libhdfs++ uses the C++/JVM bindings over the minidfscluster (java namenode and datanode in a single process) for its unit tests. That seems like a very large dependency to bring into this project. It'd be possible to write a mock that goes over TCP in C/C++, I can think of 2 ways:

  1. Capture network traffic of a read with tcpdump/wireshark for a given request and send the response back. This would be really brittle if IO request sizes changed and pretty ugly to store a bunch of binary data.
  2. Implement at least some of the server side HDFS RPC protocol and DataTransferProtocol to make a mock namenode and datanode. This is possible but would be a considerable amount of work that wouldn't be as useful for libhdfs++ since it already has the minidfscluster.

size_t last_bytes_read = 0;

do {
status = file->PositionRead(buf, static_cast<size_t>(length) - total_bytes_read, static_cast<off_t>(offset + total_bytes_read), &last_bytes_read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to wrap on 80 characters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed the whole file to wrap on 80 chars.

find_package(OpenSSL REQUIRED)
find_package(Protobuf REQUIRED)
find_package(CyrusSASL)
find_package(GSasl)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're going to need a flag to make sure it doesn't attempt to find and link against GSasl since the Apache and GPL licenses aren't compatible. Eventually GSasl will be removed from libhdfs++ for the same reason.

@majetideepak
Copy link
Contributor

Can we add libhdfspp as a tarball and add a line to untar it?
I am going to make these changes for other libraries in the lib folder as well.

@omalley
Copy link
Contributor

omalley commented Aug 7, 2017

Sorry, this patch is really out of date with the master. Can you rebase it?

@AnatoliShein
Copy link
Author

Hi @omalley ,
Yes, I am working on rebasing it right now. Running into some protobuf problems, but it should not be too bad.

@AnatoliShein
Copy link
Author

Rebased the branch and resolved the protobuf issues, but still need to move Libhdfspp out of the libs and fix other samll issues. Testing if it passes CI-tests now.

@omalley
Copy link
Contributor

omalley commented Aug 14, 2017

Unfortunately, this doesn't build on MacOS. :(

I'll take some time this week to figure out what needs to be fixed. I've gone ahead and rebased it to the current master and squashed the commits in https://github.com/omalley/orc/tree/pr/134

@AnatoliShein
Copy link
Author

Hi @omalley ,
What error messages do you get when trying to build on MacOS?

@omalley
Copy link
Contributor

omalley commented Aug 15, 2017

It looks like it is a "long" vs "long long" problem.

/Users/owen/work/code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/lib/common/configuration.cc:85:12: error: no viable conversion from returned value of type 'optional<long>' to function return type 'optional<long long>'
    return result;
           ^~~~~~
/Users/owen/work/code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/third_party/tr2/optional.hpp:427:13: note: candidate constructor not viable: no known conversion from 'std::experimental::optional<long>' to 'std::experimental::nullopt_t' for 1st argument
  constexpr optional(nullopt_t) noexcept : OptionalBase<T>() {};
            ^
/Users/owen/work/code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/third_party/tr2/optional.hpp:429:3: note: candidate constructor not viable: no known conversion from 'std::experimental::optional<long>' to 'const std::experimental::optional<long long> &' for 1st argument
  optional(const optional& rhs)
  ^
/Users/owen/work/code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/third_party/tr2/optional.hpp:438:3: note: candidate constructor not viable: no known conversion from 'std::experimental::optional<long>' to 'std::experimental::optional<long long> &&' for 1st argument
  optional(optional&& rhs) noexcept(is_nothrow_move_constructible<T>::value)
  ^
/Users/owen/work/code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/third_party/tr2/optional.hpp:447:13: note: candidate constructor not viable: no known conversion from 'std::experimental::optional<long>' to 'const long long &' for 1st argument
  constexpr optional(const T& v) : OptionalBase<T>(v) {}
            ^
/Users/owen/work/code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/third_party/tr2/optional.hpp:449:13: note: candidate constructor not viable: no known conversion from 'std::experimental::optional<long>' to 'long long &&' for 1st argument
  constexpr optional(T&& v) : OptionalBase<T>(constexpr_move(v)) {}
            ^
1 error generated.
make[5]: *** [lib/common/CMakeFiles/common_obj.dir/configuration.cc.o] Error 1
make[4]: *** [lib/common/CMakeFiles/common_obj.dir/all] Error 2
make[3]: *** [all] Error 2

@jamesclampffer
Copy link

Those errors should go away after HDFS-10787 is finished (or applied to the PR temporarily) since it exposes an API to get at the configuration without using std::tr2::optional. We had to suppress a bunch of errors coming from optional when it was added to libhdfs++; eventually I'd like to get rid of the remaining uses of it in the implementation as well.

@majetideepak
Copy link
Contributor

I am adding a Travis test for OSX os. It should ease catching OSX issues.

@majetideepak
Copy link
Contributor

Can you check if thread_local is supported by the platform, and then disable libhdfspp on those platforms?

@jamesclampffer
Copy link

jamesclampffer commented Aug 18, 2017

@majetideepak Is there a way to get the OSX CI to use a generic LLVM/Clang install rather than the one that gets shipped with OSX?

Apparently they apply a patch to Clang to break thread_local, my understanding is they don't want to break the ABI if they ever get around to making builtin thread local specialized for their OS.

        .Case("cxx_thread_local",
-                 LangOpts.CPlusPlus11 && PP.getTargetInfo().isTLSSupported() &&
-                 !PP.getTargetInfo().getTriple().isOSDarwin())
+                 LangOpts.CPlusPlus11 && PP.getTargetInfo().isTLSSupported())

Otherwise this won't get coverage on OSX CI builds.

Edit: Once I have some time I'll write a shim on top of pthread TLS which will make this a non-issue but it might be a while before I have a chance. https://issues.apache.org/jira/browse/HDFS-10355

@omalley
Copy link
Contributor

omalley commented Aug 18, 2017 via email

@AnatoliShein
Copy link
Author

Hi all! It looks like thread_local feature is by default supported by Xcode 8.0 (and above) which runs on OS X El Capitan (10.11.5) or newer. Is it possible to update the osx_image of the CI-run from "xcode6.4" to at least "xcode8.0"? For older versions we can just display a warning message and not build libhdfspp.

@AnatoliShein
Copy link
Author

Looks like THIS is the list of available CI images for OS X. I will try osx_image: xcode8.3 now and see what happens.

@omalley
Copy link
Contributor

omalley commented Aug 29, 2017

There are a lot of warnings in libhdfs that clang reports.

It is currently failing in:
[ 91%] Linking CXX shared library libhdfspp.dylib

with link errors about sasl and protobuf.

@AnatoliShein
Copy link
Author

Hi @omalley , the linking should be fixed with the new commit, and we will fix the warnings shortly.

@AnatoliShein
Copy link
Author

Hi guys, I now fixed all the build errors and warnings for OS X (with Xcode 8+), and I would like some additional feedback on this pull request if you have any. Thanks!
(@omalley , @majetideepak , @xndai, @jamesclampffer )

}

hdfs::FileHandle *file_raw = nullptr;
status = file_system->Open(uri.get_path(), &file_raw);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you may want to cache "status" within the class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure why we will want to cache "status", since it is just used for the return values of HDFS operations (success or error message). Could you please explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that you may want to return other meta info in the future. But it's your call.

}

uint64_t getNaturalReadSize() const override {
return 1048576; //1 MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use const variable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good, I will update this.


/**
* Create a stream to a local file.
* Create a stream to a local file or HDFS file if path begins with "hdfs://"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you want readLocalFile() to handle hdfs files? You already have readHdfsFile() which does exactly that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess this is a little confusing. Currently I am calling readHdfsFile from readLocalFile if it starts with "hdfs://". To fix this I will add a new function readFile which will call either readHdfsFile or readLocalFile.

@xndai
Copy link
Contributor

xndai commented Sep 13, 2017

LGTM

@AnatoliShein
Copy link
Author

Now ORC can be built without LIBHDFSPP like this:
cmake -DBUILD_LIBHDFSPP=off ..

@omalley
Copy link
Contributor

omalley commented Oct 19, 2017

Ok, this patch breaks the docker scripts on at least centos7, debian8, and ubuntu16 by adding new dependencies for the c++ build. I'm going through and fixing them.

@asfgit asfgit closed this in 5831033 Oct 19, 2017
@gxkevin
Copy link

gxkevin commented Oct 9, 2020

my hdfs path just like this: hdfs://yycluster02/hive_warehouse/rs_hago_client.db/peta/test/orc_test/my-file.orc

but, I got the error
'''
[ERROR ][Async Runtime ][Fri Oct 9 17:05:09 2020][Thread id = 140438677890880].
[/home/workspace/online_code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/lib/common/namenode_info.cc:170] Unabled to resolve endpoints for host: yycluster02 port: -1
[ERROR ][RPC ][Fri Oct 9 17:05:09 2020][Thread id = 140438677890880].
[/home/workspace/online_code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/lib/rpc/rpc_engine.cc:191] RpcEngine::AsyncRpcCommsError called; status="No endpoints provided" conn=0xf07340 reqs=1
[WARN ][RPC ][Fri Oct 9 17:05:09 2020][Thread id = 140438651348736].
[/home/workspace/online_code/orc/build/libhdfspp_ep-prefix/src/libhdfspp_ep/lib/rpc/rpc_engine.cc:202] RpcEngine::RpcCommsError called; status="No endpoints provided" conn=0xf07340 reqs=1
Can't connect to yycluster02:. No endpoints provided
'''

use it like this:
ORC_UNIQUE_PTR reader;
ReaderOptions options;
reader = createReader(readFile(file_path), options);

dongjoon-hyun pushed a commit that referenced this pull request Apr 13, 2024
### What changes were proposed in this pull request?
Mark readHdfsFile as deprecated.

### Why are the changes needed?
Reading ORC on HDFS was introduced in #134 without any test. It has not been updated for 7 years and updating libhdfspp will result in extra dependency like boost library. Staying at an old version of libhdfspp will also prohibit us from updating other libraries like protobuf.

### How was this patch tested?
It does not need test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #1885 from wgtmac/ORC-1669.

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Apr 13, 2024
### What changes were proposed in this pull request?
Mark readHdfsFile as deprecated.

### Why are the changes needed?
Reading ORC on HDFS was introduced in #134 without any test. It has not been updated for 7 years and updating libhdfspp will result in extra dependency like boost library. Staying at an old version of libhdfspp will also prohibit us from updating other libraries like protobuf.

### How was this patch tested?
It does not need test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #1885 from wgtmac/ORC-1669.

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit cea0629)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants