Skip to content

Commit bd2cb26

Browse files
author
Arthur Rand
authored
[SPARK-468] kerberos docs round2 (apache#225)
* wip * wip * unchange Dockerfile * unchange Dockerfile * add to TOC * addressed comments * remove references to starting secure clusters * add nobody limitation to doc * add note about root
1 parent da5174d commit bd2cb26

File tree

8 files changed

+176
-83
lines changed

8 files changed

+176
-83
lines changed

docs/hdfs.md

Lines changed: 10 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
---
2-
post_title: Configure Spark for HDFS
2+
post_title: Integration with HDFS
33
nav_title: HDFS
44
menu_order: 20
55
enterprise: 'no'
66
---
77

8-
You can configure Spark for a specific HDFS cluster.
98

10-
To configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
9+
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark's classpath: `hdfs-site.xml`, which provides default behaviors for the HDFS client. `core-site.xml`, which sets the default filesystem name. You can specify the location of these files at install time or for each job.
10+
11+
# Spark Installation
12+
Within the Spark service configuration, set `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and `core-site.xml`, use this example where `http://mydomain.com/hdfs-config/hdfs-site.xml` and `http://mydomain.com/hdfs-config/core-site.xml` are valid URLs:
1113

1214
```json
1315
{
@@ -16,12 +18,12 @@ To configure `hdfs.config-url` to be a URL that serves your `hdfs-site.xml` and
1618
}
1719
}
1820
```
21+
This can also be done through the UI. If you are using the default installation of HDFS from Mesosphere this is probably `http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints`.
1922

20-
For more information, see [Inheriting Hadoop Cluster Configuration][8].
21-
22-
For DC/OS HDFS, these configuration files are served at `http://<hdfs.framework-name>.marathon.mesos:<port>/v1/endpoints`, where `<hdfs.framework-name>` is a configuration variable set in the HDFS package, and `<port>` is the port of its marathon app.
23+
# Adding HDFS files per-job
24+
To add the configuration files manually for a job, use `--conf spark.mesos.uris=<location_of_hdfs-site.xml>,<location_of_core-site.xml>`. This will download the files to the sandbox of the Driver Spark application, and DC/OS Spark will automatically load these files into the correct location. **Note** It is important these files are called `hdfs-site.xml` and `core-site.xml`.
2325

24-
### Spark Checkpointing
26+
## Spark Checkpointing
2527

2628
In order to use spark with checkpointing make sure you follow the instructions [here](https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing) and use an hdfs directory as the checkpointing directory. For example:
2729
```
@@ -31,77 +33,6 @@ ssc.checkpoint(checkpointDirectory)
3133
```
3234
That hdfs directory will be automatically created on hdfs and spark streaming app will work from checkpointed data even in the presence of application restarts/failures.
3335

34-
# HDFS Kerberos
35-
36-
You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters from Spark on Mesos.
37-
38-
## HDFS Configuration
39-
40-
After you've set up a Kerberos-enabled HDFS cluster, configure Spark to connect to it. See instructions [here](#hdfs).
41-
42-
## Installation
43-
44-
1. A `krb5.conf` file tells Spark how to connect to your KDC. Base64 encode this file:
45-
46-
cat krb5.conf | base64
47-
48-
1. Add the following to your JSON configuration file to enable Kerberos in Spark:
49-
50-
{
51-
"security": {
52-
"kerberos": {
53-
"krb5conf": "<base64 encoding>"
54-
}
55-
}
56-
}
57-
58-
1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server. **WARNING**: The keytab contains secrets, so you should ensure you have SSL enabled while installing DC/OS Apache Spark.
59-
60-
Base64 encode your keytab:
61-
62-
cat spark.keytab | base64
63-
64-
And add the following to your configuration file:
65-
66-
{
67-
"history-server": {
68-
"kerberos": {
69-
"principal": "spark@REALM",
70-
"keytab": "<base64 encoding>"
71-
}
72-
}
73-
}
74-
75-
1. Install Spark with your custom configuration, here called `options.json`:
76-
77-
dcos package install --options=options.json spark
78-
79-
## Job Submission
80-
81-
To authenticate to a Kerberos KDC, DC/OS Apache Spark supports keytab files as well as ticket-granting tickets (TGTs).
82-
83-
Keytabs are valid infinitely, while tickets can expire. Especially for long-running streaming jobs, keytabs are recommended.
84-
85-
### Keytab Authentication
86-
87-
Submit the job with the keytab:
88-
89-
dcos spark run --submit-args="\
90-
--kerberos-principal user@REALM \
91-
--keytab-secret-path /__dcos_base64__hdfs-keytab \
92-
--conf ... --class MySparkJob <url> <args>"
93-
94-
### TGT Authentication
95-
96-
Submit the job with the ticket:
97-
```$bash
98-
dcos spark run --submit-args="\
99-
--kerberos-principal hdfs/name-0-node.hdfs.autoip.dcos.thisdcos.directory@LOCAL \
100-
--tgt-secret-path /__dcos_base64__tgt \
101-
--conf ... --class MySparkJob <url> <args>"
102-
```
103-
104-
**Note:** These credentials are security-critical. The DC/OS Secret Store requires you to base64 encode binary secrets (such as the Kerberos keytab) before adding them. If they are uploaded with the `__dcos_base64__` prefix, they are automatically decoded when the secret is made available to your Spark job. If the secret name **doesn't** have this prefix, the keytab will be decoded and written to a file in the sandbox. This leaves the secret exposed and is not recommended. We also highly recommended configuring SSL encryption between the Spark components when accessing Kerberos-secured HDFS clusters. See the Security section for information on how to do this.
105-
10636

10737
[8]: http://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration
38+
[9]: https://docs.mesosphere.com/service-docs/spark/2.1.0-2.2.0-1/limitations/

docs/kerberos.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
---
2+
post_title: Kerberos
3+
nav_title: Kerberos
4+
menu_order: 120
5+
enterprise: 'no'
6+
---
7+
8+
9+
# HDFS Kerberos
10+
11+
Kerberos is an authentication system to allow Spark to retrieve and write data securely to a Kerberos-enabled HDFS cluster. As of Mesosphere Spark `2.2.0-2`, long-running jobs will renew their delegation tokens (authentication credentials). This section assumes you have previously set up a Kerberos-enabled HDFS cluster. **Note** Depending on your OS, Spark may need to be run as `root` in order to authenticate with your Kerberos-enabled service. This can be done by setting `--conf spark.mesos.driverEnv.SPARK_USER=root` when submitting your job.
12+
13+
## Spark Installation
14+
15+
Spark (and all Kerberos-enabed) components need a valid `krb5.conf` file. You can setup the Spark service to use a single `krb5.conf` file for all of the its drivers.
16+
17+
1. A `krb5.conf` file tells Spark how to connect to your KDC. Base64 encode this file:
18+
19+
cat krb5.conf | base64 -w 0
20+
21+
1. Put the encoded file (as a string) into your JSON configuration file:
22+
23+
{
24+
"security": {
25+
"kerberos": {
26+
"krb5conf": "<base64 encoding>"
27+
}
28+
}
29+
}
30+
31+
Your configuration will probably also have the `hdfs` parameters from above:
32+
33+
{
34+
"service": {
35+
"name": "kerberized-spark",
36+
"user": "nobody"
37+
},
38+
"hdfs": {
39+
"config-url": "http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints"
40+
},
41+
"security": {
42+
"kerberos": {
43+
"krb5conf": "<base64_encoding>"
44+
}
45+
}
46+
}
47+
48+
49+
1. Install Spark with your custom configuration, here called `options.json`:
50+
51+
dcos package install --options=/path/to/options.json spark
52+
53+
1. Make sure your keytab is accessible from the DC/OS [Secret Store][https://docs.mesosphere.com/latest/security/secrets/].
54+
55+
1. If you've enabled the history server via `history-server.enabled`, you must also configure the principal and keytab for the history server. **WARNING**: The keytab contains secrets, in the current history server package the keytab is not stored securely. See [Limitations][9]
56+
57+
Base64 encode your keytab:
58+
59+
cat spark.keytab | base64
60+
61+
And add the following to your configuration file:
62+
63+
{
64+
"history-server": {
65+
"kerberos": {
66+
"principal": "spark@REALM",
67+
"keytab": "<base64 encoding>"
68+
}
69+
}
70+
}
71+
72+
## Job Submission
73+
74+
To authenticate to a Kerberos KDC, Spark on Mesos supports keytab files as well as ticket-granting tickets (TGTs). Keytabs are valid infinitely, while tickets can expire. Keytabs are recommended, especially for long-running streaming jobs.
75+
76+
### Keytab Authentication
77+
78+
Submit the job with the keytab:
79+
80+
dcos spark run --submit-args="\
81+
--kerberos-principal user@REALM \
82+
--keytab-secret-path /__dcos_base64__hdfs-keytab \
83+
--conf ... --class MySparkJob <url> <args>"
84+
85+
### TGT Authentication
86+
87+
Submit the job with the ticket:
88+
89+
dcos spark run --submit-args="\
90+
--kerberos-principal hdfs/name-0-node.hdfs.autoip.dcos.thisdcos.directory@LOCAL \
91+
--tgt-secret-path /__dcos_base64__tgt \
92+
--conf ... --class MySparkJob <url> <args>"
93+
94+
**Note:** You can access external (i.e. non-DC/OS) Kerberos-secured HDFS clusters from Spark on Mesos.
95+
96+
**Note:** These credentials are security-critical. The DC/OS Secret Store requires you to base64 encode binary secrets (such as the Kerberos keytab) before adding them. If they are uploaded with the `__dcos_base64__` prefix, they are automatically decoded when the secret is made available to your Spark job. If the secret name **doesn't** have this prefix, the keytab will be decoded and written to a file in the sandbox. This leaves the secret exposed and is not recommended.
97+
98+
99+
# Kafka Kerberos
100+
101+
Spark can consume data from a Kerberos-enabled Kafka cluster. Connecting Spark to secure Kafka does not require special installation parameters, however does require the Spark Driver _and_ the Spark Executors can access the following files:
102+
103+
* Client JAAS (Java Authentication and Authorization Service) file. This is provided using Mesos URIS with `--conf spark.mesos.uris=<location_of_jaas>`.
104+
* `krb5.conf` for your Kerberos setup. Similarly to HDFS, this is provided using a base64 encoding of the file.
105+
106+
cat krb5.conf | base64 -w 0
107+
108+
Then assign the environment variable, `KRB5_CONFIG_BASE64`, this value for the Driver and the Executors:
109+
--conf spark.mesos.driverEnv.KRB5_CONFIG_BASE64=<base64_encoded_string>
110+
--conf spark.executorEnv.KRB5_CONFIG_BASE64=<base64_encoded_string>
111+
112+
* The `keytab` containing the credentials for accessing the Kafka cluster.
113+
114+
--conf spark.mesos.driver.secret.names=<base64_encoded_keytab> # e.g. __dcos_base64__kafka_keytab
115+
--conf spark.mesos.driver.secret.filenames=<keytab_file_name> # e.g. kafka.keytab
116+
--conf spark.mesos.executor.secret.names=<base64_encoded_keytab> # e.g. __dcos_base64__kafka_keytab
117+
--conf spark.mesos.executor.secret.filenames=<keytab_file_name> # e.g. kafka.keytab
118+
119+
120+
Finally, you'll likely need to tell Spark to use the JAAS file:
121+
122+
--conf spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/mnt/mesos/sandbox/<jaas_file>
123+
--conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/mnt/mesos/sandbox/<jaas_file>
124+
125+
126+
It is important that the filename is the same for the driver and executor keytab file (`<keytab_file_name>` above) and that this file is properly addressed in your JAAS file. For a worked example of a Spark consumer from secure Kafka see the [advanced examples][https://docs.mesosphere.com/service-docs/spark/2.1.1-2.2.0-2/usage-examples/]

docs/limitations.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
post_title: Limitations
3-
menu_order: 130
3+
menu_order: 135
44
feature_maturity: ""
55
enterprise: 'no'
66
---
@@ -15,3 +15,7 @@ enterprise: 'no'
1515
if you specify environment-based secrets with `spark.mesos.[driver|executor].secret.envkeys`,
1616
the keystore and truststore secrets will also show up as environment-based secrets,
1717
due to the way secrets are implemented. You can ignore these extra environment variables.
18+
19+
* When using Kerberos and HDFS, the Spark Driver generates delegation tokens and distributes them to it's Executors via RPC. Authentication of the Executors with the Driver is done with a [shared secret][https://spark.apache.org/docs/latest/security.html#spark-security]. Without authentication, it is possible for executor containers to register with the Driver and retrieve the delegation tokens. Currently, for Spark on Mesos this requires manually setting up the default configuration in Spark to use authentication and setting the secret. Mesosphere is actively working to make this an automated and secure process in future releases.
20+
21+
* Spark runs all of its components in Docker containers. Since the Docker image contains a full Linux userspace with its own `/etc/users` file, it is possible for the default service user `nobody` to have a different UID inside the container than on the host system. Although user `nobody` has UID 65534 by convention on many systems, this is not always the case. As Mesos does not perform UID mapping between Linux user namespaces, specifying a service user of `nobody` in this case will cause access failures when the container user attempts to open or execute a filesystem resource owned by a user with a different UID, preventing the service from launching. If the hosts in your cluster have a UID for `nobody` other than 65534, you will need to specify a service user of root to run DC/OS Spark successfully.

docs/security.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Follow these instructions to [authenticate in strict mode](https://docs.mesosphe
1515
SSL support in DC/OS Apache Spark encrypts the following channels:
1616

1717
* From the [DC/OS admin router][11] to the dispatcher.
18-
* From the drivers to their executors.
18+
* Files served from the drivers to their executors.
1919

2020
To enable SSL, a Java keystore (and, optionally, truststore) must be provided, along
2121
with their passwords. The first three settings below are **required** during job

docs/table-of-contents.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414
- [Interactive Spark Shell](spark-shell.md)
1515
- [Fault Tolerance](fault-tolerance.md)
1616
- [Job Scheduling](job-scheduling.md)
17+
- [Kerberos](kerberos.md)
18+
- [Usage Examples](usage-examples.md)
1719
- [Troubleshooting](troubleshooting.md)
1820
- [Spark Versions](spark-versions.md)
1921
- [Version Policy](version-policy.md)

docs/troubleshooting.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
post_title: Troubleshooting
3-
menu_order: 120
3+
menu_order: 125
44
enterprise: 'no'
55
---
66

docs/usage-examples.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,33 @@ enterprise: 'no'
2222
1. View your job:
2323

2424
Visit the Spark cluster dispatcher at `http://<dcos-url>/service/spark/` to view the status of your job. Also visit the Mesos UI at `http://<dcos-url>/mesos/` to see job logs.
25+
26+
## Advanced
27+
28+
* Run an Spark Streaming job with Kafka: Examples of Spark Streaming applications that connect to a secure Kafka cluster can be found at [spark-build][https://github.com/mesosphere/spark-build/blob/beta-2.1.1-2.2.0-2/tests/jobs/scala/src/main/scala/KafkaJobs.scala]. As mentioned in the [kerberos][https://docs.mesosphere.com/service-docs/spark/2.1.0-2.2.0-2/kerberos/] section, Spark requires a JAAS file, the `krb5.conf`, and the keytab. An example of the JAAS file is:
29+
30+
KafkaClient {
31+
com.sun.security.auth.module.Krb5LoginModule required
32+
useKeyTab=true
33+
storeKey=true
34+
keyTab="/mnt/mesos/sandbox/kafka-client.keytab"
35+
useTicketCache=false
36+
serviceName="kafka"
37+
principal="client@LOCAL";
38+
};
39+
40+
The corresponding `dcos spark` command would be:
41+
42+
dcos spark run --submit-args="\
43+
--conf spark.mesos.containerizer=mesos \ # required for secrets
44+
--conf spark.mesos.uris=<URI_of_jaas.conf> \
45+
--conf spark.mesos.driver.secret.names=__dcos_base64___keytab \ # __dcos_base64__ prefix required for decoding base64 encoded binary secrets
46+
--conf spark.mesos.driver.secret.filenames=kafka-client.keytab \
47+
--conf spark.mesos.executor.secret.names=__dcos_base64___keytab \
48+
--conf spark.mesos.executor.secret.filenames=kafka-client.keytab \
49+
--conf spark.mesos.task.labels=DCOS_SPACE:/spark \
50+
--conf spark.scheduler.minRegisteredResourcesRatio=1.0 \
51+
--conf spark.executorEnv.KRB5_CONFIG_BASE64=W2xpYmRlZmF1bHRzXQpkZWZhdWx0X3JlYWxtID0gTE9DQUwKCltyZWFsbXNdCiAgTE9DQUwgPSB7CiAgICBrZGMgPSBrZGMubWFyYXRob24uYXV0b2lwLmRjb3MudGhpc2Rjb3MuZGlyZWN0b3J5OjI1MDAKICB9Cg== \
52+
--conf spark.mesos.driverEnv.KRB5_CONFIG_BASE64=W2xpYmRlZmF1bHRzXQpkZWZhdWx0X3JlYWxtID0gTE9DQUwKCltyZWFsbXNdCiAgTE9DQUwgPSB7CiAgICBrZGMgPSBrZGMubWFyYXRob24uYXV0b2lwLmRjb3MudGhpc2Rjb3MuZGlyZWN0b3J5OjI1MDAKICB9Cg== \
53+
--class MyAppClass <URL_of_jar> [application args]"
54+

docs/version-policy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
post_title: Version Policy
3-
menu_order: 125
3+
menu_order: 130
44
feature_maturity: ""
55
enterprise: 'no'
66
---

0 commit comments

Comments
 (0)