|  | 
|  | 1 | +layout: global | 
|  | 2 | +title: Accessing Openstack Swift from Spark | 
|  | 3 | +--- | 
|  | 4 | + | 
|  | 5 | +# Accessing Openstack Swift from Spark | 
|  | 6 | + | 
|  | 7 | +Spark's file interface allows it to process data in Openstack Swift using the same URI  | 
|  | 8 | +formats that are supported for Hadoop. You can specify a path in Swift as input through a  | 
|  | 9 | +URI of the form `swift://<container.PROVIDER/path`. You will also need to set your  | 
|  | 10 | +Swift security credentials, through `core-sites.xml` or via `SparkContext.hadoopConfiguration`.  | 
|  | 11 | +Openstack Swift driver was merged in Hadoop version 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)).  Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver  | 
|  | 12 | +requieres Swift to use Keystone authentication method. There are recent efforts to support  | 
|  | 13 | +temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420). | 
|  | 14 | + | 
|  | 15 | +# Configuring Swift  | 
|  | 16 | +Proxy server of Swift should include `list_endpoints` middleware. More information  | 
|  | 17 | +available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py) | 
|  | 18 | + | 
|  | 19 | +# Compilation of Spark | 
|  | 20 | +Spark should be compiled with `hadoop-openstack-2.3.0.jar` that is distributted with Hadoop 2.3.0.  | 
|  | 21 | +For the Maven builds, the `dependencyManagement` section of Spark's main `pom.xml` should include  | 
|  | 22 | + | 
|  | 23 | +	<dependencyManagement> | 
|  | 24 | +	--------- | 
|  | 25 | +	<dependency> | 
|  | 26 | +		<groupId>org.apache.hadoop</groupId> | 
|  | 27 | +		<artifactId>hadoop-openstack</artifactId> | 
|  | 28 | +		<version>2.3.0</version> | 
|  | 29 | +	</dependency> | 
|  | 30 | +	---------- | 
|  | 31 | +	</dependencyManagement> | 
|  | 32 | + | 
|  | 33 | +in addition, both `core` and `yarn` projects should add `hadoop-openstack` to the `dependencies` section of their `pom.xml` | 
|  | 34 | + | 
|  | 35 | +	<dependencies> | 
|  | 36 | +	---------- | 
|  | 37 | +	<dependency> | 
|  | 38 | +		<groupId>org.apache.hadoop</groupId> | 
|  | 39 | +		<artifactId>hadoop-openstack</artifactId> | 
|  | 40 | +	</dependency> | 
|  | 41 | +	---------- | 
|  | 42 | +	</dependencies> | 
|  | 43 | +# Configuration of Spark | 
|  | 44 | +Create `core-sites.xml` and place it inside `/spark/conf` directory. There are two main categories of parameters that should to be  | 
|  | 45 | +configured: declaration of the Swift driver and the parameters that are required by Keystone.  | 
|  | 46 | + | 
|  | 47 | +Configuration of Hadoop to use Swift File system achieved via  | 
|  | 48 | + | 
|  | 49 | +<table class="table"> | 
|  | 50 | +<tr><th>Property Name</th><th>Value</th></tr> | 
|  | 51 | +<tr> | 
|  | 52 | +  <td>fs.swift.impl</td> | 
|  | 53 | +  <td>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</td> | 
|  | 54 | +<tr> | 
|  | 55 | +</table> | 
|  | 56 | + | 
|  | 57 | +Additional parameters requiered by Keystone and should be provided to the Swift driver. Those  | 
|  | 58 | +parameters will be used to perform authentication in Keystone to access Swift. The following table  | 
|  | 59 | +contains a list of Keystone mandatory parameters. `PROVIDER` can be any name. | 
|  | 60 | + | 
|  | 61 | +<table class="table"> | 
|  | 62 | +<tr><th>Property Name</th><th>Meaning</th><th>Required</th></tr> | 
|  | 63 | +<tr> | 
|  | 64 | +  <td>fs.swift.service.PROVIDER.auth.url</td> | 
|  | 65 | +  <td>Keystone Authentication URL</td> | 
|  | 66 | +  <td>Mandatory</td> | 
|  | 67 | +</tr> | 
|  | 68 | +<tr> | 
|  | 69 | +  <td>fs.swift.service.PROVIDER.auth.endpoint.prefix</td> | 
|  | 70 | +  <td>Keystone endpoints prefix</td> | 
|  | 71 | +  <td>Optional</td> | 
|  | 72 | +</tr> | 
|  | 73 | +<tr> | 
|  | 74 | +  <td>fs.swift.service.PROVIDER.tenant</td> | 
|  | 75 | +  <td>Tenant</td> | 
|  | 76 | +  <td>Mandatory</td> | 
|  | 77 | +</tr> | 
|  | 78 | +<tr> | 
|  | 79 | +  <td>fs.swift.service.PROVIDER.username</td> | 
|  | 80 | +  <td>Username</td> | 
|  | 81 | +  <td>Mandatory</td> | 
|  | 82 | +</tr> | 
|  | 83 | +<tr> | 
|  | 84 | +  <td>fs.swift.service.PROVIDER.password</td> | 
|  | 85 | +  <td>Password</td> | 
|  | 86 | +  <td>Mandatory</td> | 
|  | 87 | +</tr> | 
|  | 88 | +<tr> | 
|  | 89 | +  <td>fs.swift.service.PROVIDER.http.port</td> | 
|  | 90 | +  <td>HTTP port</td> | 
|  | 91 | +  <td>Mandatory</td> | 
|  | 92 | +</tr> | 
|  | 93 | +<tr> | 
|  | 94 | +  <td>fs.swift.service.PROVIDER.region</td> | 
|  | 95 | +  <td>Keystone region</td> | 
|  | 96 | +  <td>Mandatory</td> | 
|  | 97 | +</tr> | 
|  | 98 | +<tr> | 
|  | 99 | +  <td>fs.swift.service.PROVIDER.public</td> | 
|  | 100 | +  <td>Indicates if all URLs are public</td> | 
|  | 101 | +  <td>Mandatory</td> | 
|  | 102 | +</tr> | 
|  | 103 | +</table> | 
|  | 104 | + | 
|  | 105 | +For example, assume `PROVIDER=SparkTest` and Keystone contains user `tester` with password `testing` defined for tenant `tenant`.  | 
|  | 106 | +Than `core-sites.xml` should include: | 
|  | 107 | + | 
|  | 108 | +	<configuration> | 
|  | 109 | +		<property> | 
|  | 110 | +			<name>fs.swift.impl</name> | 
|  | 111 | +			<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value> | 
|  | 112 | +		</property> | 
|  | 113 | +		<property> | 
|  | 114 | +			<name>fs.swift.service.SparkTest.auth.url</name> | 
|  | 115 | +			<value>http://127.0.0.1:5000/v2.0/tokens</value> | 
|  | 116 | +		</property> | 
|  | 117 | +		<property> | 
|  | 118 | +			<name>fs.swift.service.SparkTest.auth.endpoint.prefix</name> | 
|  | 119 | +			<value>endpoints</value> | 
|  | 120 | +		</property> | 
|  | 121 | +			<name>fs.swift.service.SparkTest.http.port</name> | 
|  | 122 | +			<value>8080</value> | 
|  | 123 | +		</property> | 
|  | 124 | +		<property> | 
|  | 125 | +			<name>fs.swift.service.SparkTest.region</name> | 
|  | 126 | +			<value>RegionOne</value> | 
|  | 127 | +		</property> | 
|  | 128 | +		<property> | 
|  | 129 | +			<name>fs.swift.service.SparkTest.public</name> | 
|  | 130 | +			<value>true</value> | 
|  | 131 | +		</property> | 
|  | 132 | +		<property> | 
|  | 133 | +			<name>fs.swift.service.SparkTest.tenant</name> | 
|  | 134 | +			<value>test</value> | 
|  | 135 | +		</property> | 
|  | 136 | +		<property> | 
|  | 137 | +			<name>fs.swift.service.SparkTest.username</name> | 
|  | 138 | +			<value>tester</value> | 
|  | 139 | +		</property> | 
|  | 140 | +		<property> | 
|  | 141 | +			<name>fs.swift.service.SparkTest.password</name> | 
|  | 142 | +			<value>testing</value> | 
|  | 143 | +		</property> | 
|  | 144 | +	</configuration> | 
|  | 145 | + | 
|  | 146 | +Notice that `fs.swift.service.PROVIDER.tenant`, `fs.swift.service.PROVIDER.username`,  | 
|  | 147 | +`fs.swift.service.PROVIDER.password` contains sensitive information and keeping them in `core-sites.xml` is not always a good approach.  | 
|  | 148 | +We suggest to keep those parameters in `core-sites.xml` for testing purposes when running Spark via `spark-shell`. For job submissions they should be provided via `sparkContext.hadoopConfiguration` | 
|  | 149 | + | 
|  | 150 | +# Usage examples | 
|  | 151 | +Assume Keystone's authentication URL is `http://127.0.0.1:5000/v2.0/tokens` and Keystone contains tenant `test`, user `tester` with password `testing`. In our example we define `PROVIDER=SparkTest`. Assume that Swift contains container `logs` with an object `data.log`. To access `data.log`  | 
|  | 152 | +from Spark the `swift://` scheme should be used. | 
|  | 153 | + | 
|  | 154 | +## Running Spark via spark-shell | 
|  | 155 | +Make sure that `core-sites.xml` contains `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`,  | 
|  | 156 | +`fs.swift.service.SparkTest.password`. Run Spark via `spark-shell` and access Swift via `swift:\\` scheme. | 
|  | 157 | + | 
|  | 158 | +	val sfdata = sc.textFile("swift://logs.SparkTest/data.log") | 
|  | 159 | +	sfdata.count() | 
|  | 160 | + | 
|  | 161 | +## Job submission via spark-submit | 
|  | 162 | +In this case `core-sites.xml` need not contain `fs.swift.service.SparkTest.tenant`, `fs.swift.service.SparkTest.username`,  | 
|  | 163 | +`fs.swift.service.SparkTest.password`. Example of Java usage: | 
|  | 164 | + | 
|  | 165 | +	/* SimpleApp.java */ | 
|  | 166 | +	import org.apache.spark.api.java.*; | 
|  | 167 | +	import org.apache.spark.SparkConf; | 
|  | 168 | +	import org.apache.spark.api.java.function.Function; | 
|  | 169 | + | 
|  | 170 | +	public class SimpleApp { | 
|  | 171 | +	  public static void main(String[] args) { | 
|  | 172 | +	    String logFile = "swift://logs.SparkTest/data.log"; | 
|  | 173 | +	    SparkConf conf = new SparkConf().setAppName("Simple Application"); | 
|  | 174 | +	    JavaSparkContext sc = new JavaSparkContext(conf); | 
|  | 175 | +	    sc.hadoopConfiguration().set("fs.swift.service.ibm.tenant", "test"); | 
|  | 176 | +	    sc.hadoopConfiguration().set("fs.swift.service.ibm.password", "testing"); | 
|  | 177 | +	    sc.hadoopConfiguration().set("fs.swift.service.ibm.username", "tester"); | 
|  | 178 | +	     | 
|  | 179 | +	    JavaRDD<String> logData = sc.textFile(logFile).cache(); | 
|  | 180 | + | 
|  | 181 | +	    long num = logData.count(); | 
|  | 182 | + | 
|  | 183 | +	    System.out.println("Total number of lines: " + num); | 
|  | 184 | +	  } | 
|  | 185 | +	} | 
|  | 186 | + | 
|  | 187 | +The directory sturture is  | 
|  | 188 | + | 
|  | 189 | +	find . | 
|  | 190 | +	./src | 
|  | 191 | +	./src/main | 
|  | 192 | +	./src/main/java | 
|  | 193 | +	./src/main/java/SimpleApp.java | 
|  | 194 | + | 
|  | 195 | +Maven pom.xml is | 
|  | 196 | + | 
|  | 197 | +	<project> | 
|  | 198 | +		<groupId>edu.berkeley</groupId> | 
|  | 199 | +		<artifactId>simple-project</artifactId> | 
|  | 200 | +		<modelVersion>4.0.0</modelVersion> | 
|  | 201 | +		<name>Simple Project</name> | 
|  | 202 | +		<packaging>jar</packaging> | 
|  | 203 | +		<version>1.0</version> | 
|  | 204 | +		<repositories> | 
|  | 205 | +		        <repository> | 
|  | 206 | +		                <id>Akka repository</id> | 
|  | 207 | +		                <url>http://repo.akka.io/releases</url> | 
|  | 208 | +		        </repository> | 
|  | 209 | +		</repositories> | 
|  | 210 | +		<build> | 
|  | 211 | +		        <plugins> | 
|  | 212 | +		                <plugin> | 
|  | 213 | +		                        <groupId>org.apache.maven.plugins</groupId> | 
|  | 214 | +		                        <artifactId>maven-compiler-plugin</artifactId> | 
|  | 215 | +		                        <version>2.3</version> | 
|  | 216 | +		                        <configuration> | 
|  | 217 | +		                                <source>1.6</source> | 
|  | 218 | +		                                <target>1.6</target> | 
|  | 219 | +		                        </configuration> | 
|  | 220 | +		                </plugin> | 
|  | 221 | +		        </plugins> | 
|  | 222 | +		</build> | 
|  | 223 | +		<dependencies> | 
|  | 224 | +		        <dependency> <!-- Spark dependency --> | 
|  | 225 | +		                <groupId>org.apache.spark</groupId> | 
|  | 226 | +		                <artifactId>spark-core_2.10</artifactId> | 
|  | 227 | +		                <version>1.0.0</version> | 
|  | 228 | +		        </dependency> | 
|  | 229 | +		</dependencies> | 
|  | 230 | + | 
|  | 231 | +	</project> | 
|  | 232 | + | 
|  | 233 | +Compile and execute | 
|  | 234 | + | 
|  | 235 | +	mvn package | 
|  | 236 | +	SPARK_HOME/spark-submit  --class "SimpleApp"   --master local[4]   target/simple-project-1.0.jar | 
|  | 237 | + | 
0 commit comments