Skip to content

Put HDFS in Kubernetes #1

@kimoonkim

Description

@kimoonkim

Cc @foxish @ssuchter @ash211

As I mentioned in our weekly SIG meetings, I was able to throw together a prototype that runs HDFS daemons inside a kubernetes cluster. I am about to send a PR with the helm chart files.

The prototype does:

  1. Run the datanode daemons on all cluster nodes of your kubernetes cluster. Uses DaemonSet for that.
  2. A single namenode daemon runs also inside the cluster as a size-one StatefulSet.

Both datanode and namenode uses hostPath local disk volumes. The docker images use Hadoop version 2.7 and come from a public docker hub repo.

Using StatefulSet gives a persistent host name for the namenode. So the namenode address like hdfs://hdfs-namenode-0.hdfs-namenode.kube-system.svc.cluster.local:8020 will remain the same across restarts. Datanode config uses this.

There is no HA support for the namenode.

The namenode is pinned to a node using a k8s node label. This allows the namenode pod to go to the same node upon restart, and keep the same filesystem data on the local disk.

For more details, please see the PR.

This is clearly not ready for production use. Our main focus at the moment is studying data locality.

The datanode daemons use the physical IP address of cluster nodes, with the k8s hostNetwork setup, whereas other application pods on cluster nodes like Spark executors will use pod specific virtual IPs. The data locality optimization relies on matching executor host names with datanode host names, which will never match in k8s as is. So we expect data locality is broken. (I'm writing a detailed doc on this). Hopefully, this prototype will serve as a good testbed to study how we can possibly fix it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions