Dataproc GCS sample plus doc touchups (#1151)

waprin · Jon Wayne Parrott · commit be9f9d7cbd94 · 2017-11-15T11:06:43.000-08:00
diff --git a/dataproc/README.md b/dataproc/README.md
@@ -2,15 +2,23 @@
 
 Sample command-line programs for interacting with the Cloud Dataproc API.
 
+
+Please see [the tutorial on the using the Dataproc API with the Python client
+library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example)
+for more information.
+
 Note that while this sample demonstrates interacting with Dataproc via the API, the functionality
 demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI.
 
 `list_clusters.py` is a simple command-line program to demonstrate connecting to the
 Dataproc API and listing the clusters in a region
 
-`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the 
+`create_cluster_and_submit_job.py` demonstrates how to create a cluster, submit the
 `pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result.
 
+`pyspark_sort.py_gcs` is the asme as `pyspark_sort.py` but demonstrates
+ reading from a GCS bucket.
+
 ## Prerequisites to run locally:
 
 * [pip](https://pypi.python.org/pypi/pip)
@@ -19,50 +27,59 @@ Go to the [Google Cloud Console](https://console.cloud.google.com).
 
 Under API Manager, search for the Google Cloud Dataproc API and enable it.
 
+## Set Up Your Local Dev Environment
 
-# Set Up Your Local Dev Environment
 To install, run the following commands. If you want to use  [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
 (recommended), run the commands within a virtualenv.
 
     * pip install -r requirements.txt
 
-Create local credentials by running the following command and following the oauth2 flow:
+## Authentication
+
+Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/).
+The recommended approach to running these samples is a Service Account with a JSON key.
+
+## Environment Variables
 
-    gcloud beta auth application-default login
+Set the following environment variables:
+
+    GOOGLE_CLOUD_PROJECT=your-project-id
+    REGION=us-central1 # or your region
+    CLUSTER_NAME=waprin-spark7
+    ZONE=us-central1-b
+
+## Running the samples
 
 To run list_clusters.py:
 
-    python list_clusters.py <YOUR-PROJECT-ID> --region=us-central1
+    python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION
 
+`submit_job_to_cluster.py` can create the Dataproc cluster, or use an existing one.
+If you'd like to create a cluster ahead of time, either use the
+[Cloud Console](console.cloud.google.com) or run:
 
-To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with
-gsutil:
+    gcloud dataproc clusters create your-cluster-name
 
-    gsutil mb gs://<your-input-bucket-name>
-    
-Then, if you want to rely on an existing cluster, run:
-    
-    python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name>
-    
-Otherwise, if you want the script to create a new cluster for you:
+To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files, from the Cloud Console or with
+gsutil:
 
-    python submit_job_to_cluster.py --project_id=<your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name> --create_new_cluster
+    gsutil mb gs://<your-staging-bucket-name>
 
+Set the environment variable's name:
 
-This will setup a cluster, upload the PySpark file, submit the job, print the result, then
-delete the cluster.
+    BUCKET=your-staging-bucket
+    CLUSTER=your-cluster-name
 
-You can optionally specify a `--pyspark_file` argument to change from the default 
-`pyspark_sort.py` included in this script to a new script.
+Then, if you want to rely on an existing cluster, run:
 
-## Running on GCE, GAE, or other environments
+    python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET
 
-On Google App Engine, the credentials should be found automatically.
+Otherwise, if you want the script to create a new cluster for you:
 
-On Google Compute Engine, the credentials should be found automatically, but require that
-you create the instance with the correct scopes. 
+    python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster
 
-    gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
+This will setup a cluster, upload the PySpark file, submit the job, print the result, then
+delete the cluster.
 
-If you did not create the instance with the right scopes, you can still upload a JSON service 
-account and set `GOOGLE_APPLICATION_CREDENTIALS`. See [Google Application Default Credentials](https://developers.google.com/identity/protocols/application-default-credentials) for more details.
+You can optionally specify a `--pyspark_file` argument to change from the default
+`pyspark_sort.py` included in this script to a new script.
diff --git a/dataproc/pyspark_sort.py b/dataproc/pyspark_sort.py
@@ -24,5 +24,5 @@
 sc = pyspark.SparkContext()
 rdd = sc.parallelize(['Hello,', 'world!', 'dog', 'elephant', 'panther'])
 words = sorted(rdd.collect())
-print words
+print(words)
 # [END pyspark]
diff --git a/dataproc/pyspark_sort_gcs.py b/dataproc/pyspark_sort_gcs.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+""" Sample pyspark script to be uploaded to Cloud Storage and run on
+Cloud Dataproc.
+
+Note this file is not intended to be run directly, but run inside a PySpark
+environment.
+
+This file demonstrates how to read from a GCS bucket. See README.md for more
+information.
+"""
+
+# [START pyspark]
+import pyspark
+
+sc = pyspark.SparkContext()
+rdd = sc.textFile('gs://path-to-your-GCS-file')
+print(sorted(rdd.collect()))
+# [END pyspark]