Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Discuss HDFS data locality in Kubernetes #206

@kimoonkim

Description

@kimoonkim

A few weeks ago, I prototyped a way to put HDFS namenode and datanodes in our kubernetes cluster. (See PR on apache-spark-on-k8s/Kubernetes-HDFS)

I am hoping the setup can help us find out how exactly HDFS data locality is broken in k8s. I wrote a Google doc with the research, http://tiny.pepperdata.com/k8s-hdfs-locality. PTAL. I am posting the summary below as well for quick read. The doc has diagrams and more details. Please share your comments in the Google doc since it's much easier to have in-line discussions in the doc.


HDFS locality, broken in Kubernetes

HDFS data locality relies on matching executor host names against datanode host names:

  1. Node locality: If the host name of an executor matches that of a given datanode, it means the executor can read the data from the local datanode.
  2. Rack locality: The namenode has topology information which is a list of host names with rack names as satellite values. If the namenode can retrieve an entry in its topology information using the executor host name as the key, then we can determine the rack that the executor resides. This means the rack locality will work. I.e. The executor can read data from datanodes in the same rack.

The node locality (A) is broken In Kubernetes. Each executor pod is assigned a virtual IP and thus virtual host name. This will not match datanodes’ physical host names.

Similarly, rack locality (B) is broken. The executor host name will fail to match any topology entry in the namenode, so we’ll fail to determine the rack name.

Locality-aware layers

Here, we look at existing locality-aware layers and figure out how to fix the locality.

When Spark reads data from HDFS, it can increase the read throughput by sending tasks to the executors that can access the needed disk data on the same node or another node on the same rack. This locality-aware execution is implemented in three different layers:

  1. Executor scheduling: When Spark Driver launches executors, it may suggest the cluster scheduler to consider a list of candidate hosts and racks that it prefers. The Driver gets this list by asking namenode which datanode hosts have the input data of the application. (In YARN, the optimization in this layer is triggered only when Spark dynamic allocation is enabled). See Appendix A for the detailed code snippets.
  2. Task dispatching: Once it gets executors, Spark Driver will dispatch tasks to executors. For each task, the Driver tries to send it to one of right executors that has the partition of the task on local disk or local rack. For this, the Driver builds the hosts-to- tasks and racks-to-tasks mapping with the datanode info from the namenode, and later looks up the maps using executor host names. See Appendix B for details.
  3. Reading HDFS partitions: When a task actually runs on an executor, it will read its partition block from a datanode host. The HDFS read library asks the namenode to return multiple candidate datanode hosts each of which has a copy of the data. The namenode sorts the result in the order of the proximity to the client host. The client picks the first one in the returned list. See Appendix C for details.

Layer (1) is not necessarily broken for k8s because we can probably use the k8s node selection to express the preference. However the correct implementation only exists in YARN-related code. We’ll need to generalize and reuse the code in the right way for k8s.

Layer (2) is broken. Spark Driver will build the mapping using the data node names. Then it will look up the maps using the executor pod host names, which will never match.

Layer (3) is also broken. The namenode will retrieve the executor pod name for the client host name. And compare the pod name against datanode host names to sort the datanode list. The resulting list will not be in the correct order.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions