Skip to content

Commit 9b625b5

Browse files
committed
Merge branch 'master' of https://github.com/gilv/spark
Conflicts: docs/openstack-integration.md
2 parents eff538d + ce483d7 commit 9b625b5

File tree

1 file changed

+64
-15
lines changed

1 file changed

+64
-15
lines changed

docs/openstack-integration.md

Lines changed: 64 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,51 @@
1-
---
2-
layout: global
1+
yout: global
32
title: Accessing Openstack Swift storage from Spark
43
---
54

65
# Accessing Openstack Swift storage from Spark
76

8-
Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a URI of the form `swift://<container.service_provider>/path`. You will also need to set your Swift security credentials, through `SparkContext.hadoopConfiguration`.
7+
Spark's file interface allows it to process data in Openstack Swift using the same URI
8+
9+
formats that are supported for Hadoop. You can specify a path in Swift as input through a
10+
11+
URI of the form `swift://<container.service_provider>/path`. You will also need to set your
12+
13+
Swift security credentials, through `SparkContext.hadoopConfiguration`.
914

1015
#Configuring Hadoop to use Openstack Swift
11-
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver requieres Swift to use Keystone authentication method. There are recent efforts to support also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
12-
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and setup Swift FS.
16+
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver]
17+
18+
(https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous
19+
20+
Hadoop versions will need to configure Swift driver manually. Current Swift driver
21+
22+
requieres Swift to use Keystone authentication method. There are recent efforts to support
23+
24+
also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
25+
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and
26+
27+
setup Swift FS.
1328

1429
<configuration>
1530
<property>
1631
<name>fs.swift.impl</name>
17-
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
32+
33+
34+
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
1835
</property>
1936
</configuration>
2037

2138
#Configuring Swift
22-
Proxy server of Swift should include `list_endpoints` middleware. More information available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
39+
Proxy server of Swift should include `list_endpoints` middleware. More information
40+
41+
available [here]
42+
43+
(https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
2344

2445
#Configuring Spark
25-
To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar` distributted with Hadoop 2.3.0.
46+
To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar`
47+
48+
distributted with Hadoop 2.3.0.
2649
For the Maven builds, Spark's main pom.xml should include
2750

2851
<swift.version>2.3.0</swift.version>
@@ -42,10 +65,26 @@ in addition, pom.xml of the `core` and `yarn` projects should include
4265
</dependency>
4366

4467

45-
Additional parameters has to be provided to the Swift driver. Swift driver will use those parameters to perform authentication in Keystone prior accessing Swift. List of mandatory parameters is : `fs.swift.service.<PROVIDER>.auth.url`, `fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,
46-
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`, where `PROVIDER` is any name. `fs.swift.service.<PROVIDER>.auth.url` should point to the Keystone authentication URL.
68+
Additional parameters has to be provided to the Swift driver. Swift driver will use those
4769

48-
Create core-sites.xml with the mandatory parameters and place it under /spark/conf directory. For example:
70+
parameters to perform authentication in Keystone prior accessing Swift. List of mandatory
71+
72+
parameters is : `fs.swift.service.<PROVIDER>.auth.url`,
73+
74+
`fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`,
75+
76+
`fs.swift.service.<PROVIDER>.username`,
77+
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`,
78+
79+
`fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`, where
80+
81+
`PROVIDER` is any name. `fs.swift.service.<PROVIDER>.auth.url` should point to the Keystone
82+
83+
authentication URL.
84+
85+
Create core-sites.xml with the mandatory parameters and place it under /spark/conf
86+
87+
directory. For example:
4988

5089

5190
<property>
@@ -68,9 +107,17 @@ Create core-sites.xml with the mandatory parameters and place it under /spark/co
68107
<value>true</value>
69108
</property>
70109

71-
We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`, `fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to SparkContext in run time, which seems to be impossible yet.
72-
Another approach is to adapt Swift driver to obtain those values from system environment variables. For now we provide them via core-sites.xml.
73-
Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml shoud include:
110+
We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,
111+
112+
`fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to
113+
114+
SparkContext in run time, which seems to be impossible yet.
115+
Another approach is to adapt Swift driver to obtain those values from system environment
116+
117+
variables. For now we provide them via core-sites.xml.
118+
Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml
119+
120+
shoud include:
74121

75122
<property>
76123
<name>fs.swift.service.<PROVIDER>.tenant</name>
@@ -85,7 +132,9 @@ Assume a tenant `test` with user `tester` was defined in Keystone, then the core
85132
<value>testing</value>
86133
</property>
87134
# Usage
88-
Assume there exists Swift container `logs` with an object `data.log`. To access `data.log` from Spark the `swift://` scheme should be used.
135+
Assume there exists Swift container `logs` with an object `data.log`. To access `data.log`
136+
137+
from Spark the `swift://` scheme should be used.
89138
For example:
90139

91140
val sfdata = sc.textFile("swift://logs.<PROVIDER>/data.log")

0 commit comments

Comments
 (0)