Put HDFS in Kubernetes

Cc @foxish @ssuchter @ash211 

As I mentioned in our weekly SIG meetings, I was able to throw together a prototype that runs HDFS daemons inside a kubernetes cluster. I am about to send a PR with the helm chart files.

The prototype does:
1. Run the `datanode` daemons on all cluster nodes of your kubernetes cluster. Uses `DaemonSet` for that.
1. A single `namenode` daemon runs also inside the cluster as a size-one `StatefulSet`.

Both `datanode` and `namenode` uses `hostPath` local disk volumes. The docker images use Hadoop version 2.7 and come from a public docker hub [repo](https://hub.docker.com/u/uhopper/).

Using `StatefulSet` gives a persistent host name for the `namenode`. So the `namenode` address like `hdfs://hdfs-namenode-0.hdfs-namenode.kube-system.svc.cluster.local:8020` will remain the same across restarts. `Datanode` config uses this.

There is no HA support for the `namenode`.

The `namenode` is pinned to a node using a k8s node label. This allows the `namenode` pod to go to the same node upon restart, and keep the same filesystem data on the local disk.

For more details, please see the PR.

This is clearly not ready for production use. Our main focus at the moment is studying data locality.

The `datanode` daemons use the physical IP address of cluster nodes, with the k8s `hostNetwork` setup, whereas other application pods on cluster nodes like Spark executors will use pod specific virtual IPs. The data locality optimization relies on matching executor host names with `datanode` host names, which will never match in k8s as is. So we expect data locality is broken. (I'm writing a detailed doc on this). Hopefully, this prototype will serve as a good testbed to study how we can possibly fix it.
 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Put HDFS in Kubernetes #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Put HDFS in Kubernetes #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions