[SPARK-468] kerberos docs round2 (apache#225)

Arthur Rand · web-flow · commit bd2cb26e0ec8 · 2017-12-11T06:27:47.000-08:00
* wip

* wip

* unchange Dockerfile

* unchange Dockerfile

* add to TOC

* addressed comments

* remove references to starting secure clusters

* add nobody limitation to doc

* add note about root
diff --git a/docs/hdfs.md b/docs/hdfs.md
@@ -1,13 +1,15 @@
 ---
-post_title: Configure Spark for HDFS
+post_title: Integration with HDFS
 nav_title: HDFS
 menu_order: 20
 enterprise: 'no'
 ---
 
-You can configure Spark for a specific HDFS cluster.
 
-To configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
+If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark's classpath: `hdfs-site.xml`, which provides default behaviors for the HDFS client. `core-site.xml`, which sets the default filesystem name. You can specify the location of these files at install time or for each job.
+
+# Spark Installation
+Within the Spark service configuration, set `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
 
 ```json
 {
@@ -16,12 +18,12 @@ To configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and
   }
 }
 ```
+This can also be done through the UI. If you are using the default installation of HDFS from Mesosphere this is probably `http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints`.
 
-For more information, see [Inheriting Hadoop Cluster Configuration][8].
-
-For DC/OS HDFS, these configuration files are served at `http://<hdfs.framework-name>.marathon.mesos:<port>/v1/endpoints`, where `<hdfs.framework-name>` is a configuration variable set in the HDFS package, and `<port>` is the port of its marathon app.
+# Adding HDFS files per-job
+To add the configuration files manually for a job, use `--conf spark.mesos.uris=<location_of_hdfs-site.xml>,<location_of_core-site.xml>`. This will download the files to the sandbox of the Driver Spark application, and DC/OS Spark will automatically load these files into the correct location. **Note** It is important these files are called `hdfs-site.xml` and `core-site.xml`.
 
-### Spark Checkpointing
+## Spark Checkpointing
 
 In order to use spark with checkpointing make sure you follow the instructions [here](https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing) and use an hdfs directory as the checkpointing directory. For example:
 ```
@@ -31,77 +33,6 @@ ssc.checkpoint(checkpointDirectory)
 ```
 That hdfs directory will be automatically created on hdfs and spark streaming app will work from checkpointed data even in the presence of application restarts/failures.
 
-# HDFS Kerberos
-
-You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters from Spark on Mesos.
-
-## HDFS Configuration
-
-After you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect to it. See instructions [here](#hdfs).
-
-## Installation
-
-1.  A `krb5.conf` file tells Spark how to connect to your KDC.  Base64 encode this file:
-
-        cat krb5.conf | base64
-
-1.  Add the following to your JSON configuration file to enable Kerberos in Spark:
-
-        {
-           "security": {
-             "kerberos": {
-              "krb5conf": "<base64 encoding>"
-              }
-           }
-        }
-
-1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server.  **WARNING**: The keytab contains secrets, so you should ensure you have SSL enabled while installing DC/OS Apache Spark.
-
-    Base64 encode your keytab:
-
-        cat spark.keytab | base64
-
-    And add the following to your configuration file:
-
-         {
-            "history-server": {
-                "kerberos": {
-                  "principal": "spark@REALM",
-                  "keytab": "<base64 encoding>"
-                }
-            }
-         }
-
-1.  Install Spark with your custom configuration, here called `options.json`:
-
-        dcos package install --options=options.json spark
-
-## Job Submission
-
-To authenticate to a Kerberos KDC, DC/OS Apache Spark supports keytab files as well as ticket-granting tickets (TGTs).
-
-Keytabs are valid infinitely, while tickets can expire. Especially for long-running streaming jobs, keytabs are recommended.
-
-### Keytab Authentication
-
-Submit the job with the keytab:
-
-    dcos spark run --submit-args="\
-    --kerberos-principal user@REALM \
-    --keytab-secret-path /__dcos_base64__hdfs-keytab \
-    --conf ... --class MySparkJob <url> <args>"
-
-### TGT Authentication
-
-Submit the job with the ticket:
-```$bash
-    dcos spark run --submit-args="\
-    --kerberos-principal hdfs/name-0-node.hdfs.autoip.dcos.thisdcos.directory@LOCAL \
-    --tgt-secret-path /__dcos_base64__tgt \
-    --conf ... --class MySparkJob <url> <args>"
-```
-
-**Note:** These credentials are security-critical. The DC/OS Secret Store requires you to base64 encode binary secrets (such as the Kerberos keytab) before adding them. If they are uploaded with the `__dcos_base64__` prefix, they are automatically decoded when the secret is made available to your Spark job. If the secret name **doesn't** have this prefix, the keytab will be decoded and written to a file in the sandbox. This leaves the secret exposed and is not recommended. We also highly recommended configuring SSL encryption between the Spark components when accessing Kerberos-secured HDFS clusters. See the Security section for information on how to do this.
-
 
 [8]: http://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration
+[9]: https://docs.mesosphere.com/service-docs/spark/2.1.0-2.2.0-1/limitations/
diff --git a/docs/kerberos.md b/docs/kerberos.md
@@ -0,0 +1,126 @@
+---
+post_title: Kerberos
+nav_title: Kerberos
+menu_order: 120
+enterprise: 'no'
+---
+
+
+# HDFS Kerberos
+
+Kerberos is an authentication system to allow Spark to retrieve and write data securely to a Kerberos-enabled HDFS cluster. As of Mesosphere Spark `2.2.0-2`, long-running jobs will renew their delegation tokens (authentication credentials). This section assumes you have previously set up a Kerberos-enabled HDFS cluster. **Note** Depending on your OS, Spark may need to be run as `root` in order to authenticate with your Kerberos-enabled service. This can be done by setting `--conf spark.mesos.driverEnv.SPARK_USER=root` when submitting your job.
+
+## Spark Installation
+
+Spark (and all Kerberos-enabed) components need a valid `krb5.conf` file. You can setup the Spark service to use a single `krb5.conf` file for all of the its drivers.
+
+1.  A `krb5.conf` file tells Spark how to connect to your KDC.  Base64 encode this file:
+
+        cat krb5.conf | base64 -w 0 
+
+1.  Put the encoded file (as a string) into your JSON configuration file:
+
+        {
+           "security": {
+             "kerberos": {
+              "krb5conf": "<base64 encoding>"
+              }
+           }
+        }
+        
+     Your configuration will probably also have the `hdfs` parameters from above:
+     
+        {
+          "service": {
+              "name": "kerberized-spark",
+              "user": "nobody"
+          },
+          "hdfs": {
+              "config-url": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints"
+          },
+          "security": {
+              "kerberos": {
+                  "krb5conf": "<base64_encoding>"
+            }
+          }
+        }
+
+        
+1.  Install Spark with your custom configuration, here called `options.json`:
+
+        dcos package install --options=/path/to/options.json spark
+        
+1.  Make sure your keytab is accessible from the DC/OS [Secret Store][https://docs.mesosphere.com/latest/security/secrets/].
+
+1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server.  **WARNING**: The keytab contains secrets, in the current history server package the keytab is not stored securely. See [Limitations][9]
+
+    Base64 encode your keytab:
+
+        cat spark.keytab | base64
+
+    And add the following to your configuration file:
+
+         {
+            "history-server": {
+                "kerberos": {
+                  "principal": "spark@REALM",
+                  "keytab": "<base64 encoding>"
+                }
+            }
+         }
+
+## Job Submission
+
+To authenticate to a Kerberos KDC, Spark on Mesos supports keytab files as well as ticket-granting tickets (TGTs). Keytabs are valid infinitely, while tickets can expire. Keytabs are recommended, especially for long-running streaming jobs.
+
+### Keytab Authentication
+
+Submit the job with the keytab:
+
+    dcos spark run --submit-args="\
+    --kerberos-principal user@REALM \
+    --keytab-secret-path /__dcos_base64__hdfs-keytab \
+    --conf ... --class MySparkJob <url> <args>"
+
+### TGT Authentication
+
+Submit the job with the ticket:
+
+    dcos spark run --submit-args="\
+    --kerberos-principal hdfs/name-0-node.hdfs.autoip.dcos.thisdcos.directory@LOCAL \
+    --tgt-secret-path /__dcos_base64__tgt \
+    --conf ... --class MySparkJob <url> <args>"
+
+**Note:** You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters from Spark on Mesos.
+
+**Note:** These credentials are security-critical. The DC/OS Secret Store requires you to base64 encode binary secrets (such as the Kerberos keytab) before adding them. If they are uploaded with the `__dcos_base64__` prefix, they are automatically decoded when the secret is made available to your Spark job. If the secret name **doesn't** have this prefix, the keytab will be decoded and written to a file in the sandbox. This leaves the secret exposed and is not recommended. 
+
+
+# Kafka Kerberos
+
+Spark can consume data from a Kerberos-enabled Kafka cluster. Connecting Spark to secure Kafka does not require special installation parameters, however does require the Spark Driver _and_ the Spark Executors can access the following files:
+
+*   Client JAAS (Java Authentication and Authorization Service) file. This is provided using Mesos URIS with `--conf spark.mesos.uris=<location_of_jaas>`.
+*   `krb5.conf` for your Kerberos setup. Similarly to HDFS, this is provided using a base64 encoding of the file.
+ 
+        cat krb5.conf | base64 -w 0
+        
+    Then assign the environment variable, `KRB5_CONFIG_BASE64`, this value for the Driver and the Executors:
+        --conf spark.mesos.driverEnv.KRB5_CONFIG_BASE64=<base64_encoded_string>
+        --conf spark.executorEnv.KRB5_CONFIG_BASE64=<base64_encoded_string>
+        
+*   The `keytab` containing the credentials for accessing the Kafka cluster.
+        
+        --conf spark.mesos.driver.secret.names=<base64_encoded_keytab>    # e.g. __dcos_base64__kafka_keytab
+        --conf spark.mesos.driver.secret.filenames=<keytab_file_name>     # e.g. kafka.keytab
+        --conf spark.mesos.executor.secret.names=<base64_encoded_keytab>  # e.g. __dcos_base64__kafka_keytab
+        --conf spark.mesos.executor.secret.filenames=<keytab_file_name>   # e.g. kafka.keytab
+        
+
+Finally, you'll likely need to tell Spark to use the JAAS file:
+        
+        --conf spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/mnt/mesos/sandbox/<jaas_file>
+        --conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/mnt/mesos/sandbox/<jaas_file>
+
+
+It is important that the filename is the same for the driver and executor keytab file (`<keytab_file_name>` above) and that this file is properly addressed in your JAAS file. For a worked example of a Spark consumer from secure Kafka see the [advanced examples][https://docs.mesosphere.com/service-docs/spark/2.1.1-2.2.0-2/usage-examples/]
diff --git a/docs/limitations.md b/docs/limitations.md
@@ -1,6 +1,6 @@
 ---
 post_title: Limitations
-menu_order: 130
+menu_order: 135
 feature_maturity: ""
 enterprise: 'no'
 ---
@@ -15,3 +15,7 @@ enterprise: 'no'
     if you specify environment-based secrets with `spark.mesos.[driver|executor].secret.envkeys`,
     the keystore and truststore secrets will also show up as environment-based secrets,
     due to the way secrets are implemented. You can ignore these extra environment variables.
+    
+*   When using Kerberos and HDFS, the Spark Driver generates delegation tokens and distributes them to it's Executors via RPC.  Authentication of the Executors with the Driver is done with a [shared secret][https://spark.apache.org/docs/latest/security.html#spark-security]. Without authentication, it is possible for executor containers to register with the Driver and retrieve the delegation tokens. Currently, for Spark on Mesos this requires manually setting up the default configuration in Spark to use authentication and setting the secret. Mesosphere is actively working to make this an automated and secure process in future releases. 
+
+*   Spark runs all of its components in Docker containers. Since the Docker image contains a full Linux userspace with its own `/etc/users` file, it is possible for the default service user `nobody` to have a different UID inside the container than on the host system. Although user `nobody` has UID 65534 by convention on many systems, this is not always the case. As Mesos does not perform UID mapping between Linux user namespaces, specifying a service user of `nobody` in this case will cause access failures when the container user attempts to open or execute a filesystem resource owned by a user with a different UID, preventing the service from launching. If the hosts in your cluster have a UID for `nobody` other than 65534, you will need to specify a service user of root to run DC/OS Spark successfully.
diff --git a/docs/security.md b/docs/security.md
@@ -15,7 +15,7 @@ Follow these instructions to [authenticate in strict mode](https://docs.mesosphe
 SSL support in DC/OS Apache Spark encrypts the following channels:
 
 *   From the [DC/OS admin router][11] to the dispatcher.
-*   From the drivers to their executors.
+*   Files served from the drivers to their executors.
 
 To enable SSL, a Java keystore (and, optionally, truststore) must be provided, along
 with their passwords. The first three settings below are **required** during job
diff --git a/docs/table-of-contents.md b/docs/table-of-contents.md
@@ -14,6 +14,8 @@
 - [Interactive Spark Shell](spark-shell.md)
 - [Fault Tolerance](fault-tolerance.md)
 - [Job Scheduling](job-scheduling.md)
+- [Kerberos](kerberos.md)
+- [Usage Examples](usage-examples.md)
 - [Troubleshooting](troubleshooting.md)
 - [Spark Versions](spark-versions.md)
 - [Version Policy](version-policy.md)
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -1,6 +1,6 @@
 ---
 post_title: Troubleshooting
-menu_order: 120
+menu_order: 125
 enterprise: 'no'
 ---
 
diff --git a/docs/usage-examples.md b/docs/usage-examples.md
@@ -22,3 +22,33 @@ enterprise: 'no'
 1.  View your job:
 
 Visit the Spark cluster dispatcher at `http://<dcos-url>/service/spark/` to view the status of your job. Also visit the Mesos UI at `http://<dcos-url>/mesos/` to see job logs.
+
+## Advanced
+
+*   Run an Spark Streaming job with Kafka: Examples of Spark Streaming applications that connect to a secure Kafka cluster can be found at [spark-build][https://github.com/mesosphere/spark-build/blob/beta-2.1.1-2.2.0-2/tests/jobs/scala/src/main/scala/KafkaJobs.scala]. As mentioned in the [kerberos][https://docs.mesosphere.com/service-docs/spark/2.1.0-2.2.0-2/kerberos/] section, Spark requires a JAAS file, the `krb5.conf`, and the keytab. An example of the JAAS file is: 
+        
+        KafkaClient {
+            com.sun.security.auth.module.Krb5LoginModule required
+            useKeyTab=true
+            storeKey=true
+            keyTab="/mnt/mesos/sandbox/kafka-client.keytab"
+            useTicketCache=false
+            serviceName="kafka"
+            principal="client@LOCAL";
+        };
+    
+    The corresponding `dcos spark` command would be: 
+
+        dcos spark run --submit-args="\
+        --conf spark.mesos.containerizer=mesos \  # required for secrets
+        --conf spark.mesos.uris=<URI_of_jaas.conf> \
+        --conf spark.mesos.driver.secret.names=__dcos_base64___keytab \  # __dcos_base64__ prefix required for decoding base64 encoded binary secrets
+        --conf spark.mesos.driver.secret.filenames=kafka-client.keytab \
+        --conf spark.mesos.executor.secret.names=__dcos_base64___keytab \
+        --conf spark.mesos.executor.secret.filenames=kafka-client.keytab \
+        --conf spark.mesos.task.labels=DCOS_SPACE:/spark \ 
+        --conf spark.scheduler.minRegisteredResourcesRatio=1.0 \
+        --conf spark.executorEnv.KRB5_CONFIG_BASE64=W2xpYmRlZmF1bHRzXQpkZWZhdWx0X3JlYWxtID0gTE9DQUwKCltyZWFsbXNdCiAgTE9DQUwgPSB7CiAgICBrZGMgPSBrZGMubWFyYXRob24uYXV0b2lwLmRjb3MudGhpc2Rjb3MuZGlyZWN0b3J5OjI1MDAKICB9Cg== \
+        --conf spark.mesos.driverEnv.KRB5_CONFIG_BASE64=W2xpYmRlZmF1bHRzXQpkZWZhdWx0X3JlYWxtID0gTE9DQUwKCltyZWFsbXNdCiAgTE9DQUwgPSB7CiAgICBrZGMgPSBrZGMubWFyYXRob24uYXV0b2lwLmRjb3MudGhpc2Rjb3MuZGlyZWN0b3J5OjI1MDAKICB9Cg== \
+        --class MyAppClass <URL_of_jar> [application args]"
+
diff --git a/docs/version-policy.md b/docs/version-policy.md
@@ -1,6 +1,6 @@
 ---
 post_title: Version Policy
-menu_order: 125
+menu_order: 130
 feature_maturity: ""
 enterprise: 'no'
 ---