-
Couldn't load subscription status.
- Fork 182
Description
As I mentioned in our weekly SIG meetings, I was able to throw together a prototype that runs HDFS daemons inside a kubernetes cluster. I am about to send a PR with the helm chart files.
The prototype does:
- Run the
datanodedaemons on all cluster nodes of your kubernetes cluster. UsesDaemonSetfor that. - A single
namenodedaemon runs also inside the cluster as a size-oneStatefulSet.
Both datanode and namenode uses hostPath local disk volumes. The docker images use Hadoop version 2.7 and come from a public docker hub repo.
Using StatefulSet gives a persistent host name for the namenode. So the namenode address like hdfs://hdfs-namenode-0.hdfs-namenode.kube-system.svc.cluster.local:8020 will remain the same across restarts. Datanode config uses this.
There is no HA support for the namenode.
The namenode is pinned to a node using a k8s node label. This allows the namenode pod to go to the same node upon restart, and keep the same filesystem data on the local disk.
For more details, please see the PR.
This is clearly not ready for production use. Our main focus at the moment is studying data locality.
The datanode daemons use the physical IP address of cluster nodes, with the k8s hostNetwork setup, whereas other application pods on cluster nodes like Spark executors will use pod specific virtual IPs. The data locality optimization relies on matching executor host names with datanode host names, which will never match in k8s as is. So we expect data locality is broken. (I'm writing a detailed doc on this). Hopefully, this prototype will serve as a good testbed to study how we can possibly fix it.