22
33Sample command-line programs for interacting with the Cloud Dataproc API.
44
5+
6+ Please see [ the tutorial on the using the Dataproc API with the Python client
7+ library] ( https://cloud.google.com/dataproc/docs/tutorials/python-library-example )
8+ for more information.
9+
510Note that while this sample demonstrates interacting with Dataproc via the API, the functionality
611demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI.
712
813` list_clusters.py ` is a simple command-line program to demonstrate connecting to the
914Dataproc API and listing the clusters in a region
1015
11- ` create_cluster_and_submit_job.py ` demonstrates how to create a cluster, submit the
16+ ` create_cluster_and_submit_job.py ` demonstrates how to create a cluster, submit the
1217` pyspark_sort.py ` job, download the output from Google Cloud Storage, and output the result.
1318
19+ ` pyspark_sort.py_gcs ` is the asme as ` pyspark_sort.py ` but demonstrates
20+ reading from a GCS bucket.
21+
1422## Prerequisites to run locally:
1523
1624* [ pip] ( https://pypi.python.org/pypi/pip )
@@ -19,50 +27,59 @@ Go to the [Google Cloud Console](https://console.cloud.google.com).
1927
2028Under API Manager, search for the Google Cloud Dataproc API and enable it.
2129
30+ ## Set Up Your Local Dev Environment
2231
23- # Set Up Your Local Dev Environment
2432To install, run the following commands. If you want to use [ virtualenv] ( https://virtualenv.readthedocs.org/en/latest/ )
2533(recommended), run the commands within a virtualenv.
2634
2735 * pip install -r requirements.txt
2836
29- Create local credentials by running the following command and following the oauth2 flow:
37+ ## Authentication
38+
39+ Please see the [ Google cloud authentication guide] ( https://cloud.google.com/docs/authentication/ ) .
40+ The recommended approach to running these samples is a Service Account with a JSON key.
41+
42+ ## Environment Variables
3043
31- gcloud beta auth application-default login
44+ Set the following environment variables:
45+
46+ GOOGLE_CLOUD_PROJECT=your-project-id
47+ REGION=us-central1 # or your region
48+ CLUSTER_NAME=waprin-spark7
49+ ZONE=us-central1-b
50+
51+ ## Running the samples
3252
3353To run list_clusters.py:
3454
35- python list_clusters.py <YOUR-PROJECT-ID> --region=us-central1
55+ python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION
3656
57+ ` submit_job_to_cluster.py ` can create the Dataproc cluster, or use an existing one.
58+ If you'd like to create a cluster ahead of time, either use the
59+ [ Cloud Console] ( console.cloud.google.com ) or run:
3760
38- To run submit_job_to_cluster.py, first create a GCS bucket, from the Cloud Console or with
39- gsutil:
61+ gcloud dataproc clusters create your-cluster-name
4062
41- gsutil mb gs://<your-input-bucket-name>
42-
43- Then, if you want to rely on an existing cluster, run:
44-
45- python submit_job_to_cluster.py --project_id=<your-project-id > --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input-bucket-name >
46-
47- Otherwise, if you want the script to create a new cluster for you:
63+ To run submit_job_to_cluster.py, first create a GCS bucket for Dataproc to stage files, from the Cloud Console or with
64+ gsutil:
4865
49- python submit_job_to_cluster.py --project_id= <your-project-id> --zone=us-central1-b --cluster_name=testcluster --gcs_bucket=<your-input- bucket-name> --create_new_cluster
66+ gsutil mb gs:// <your-staging- bucket-name>
5067
68+ Set the environment variable's name:
5169
52- This will setup a cluster, upload the PySpark file, submit the job, print the result, then
53- delete the cluster.
70+ BUCKET=your-staging-bucket
71+ CLUSTER=your- cluster-name
5472
55- You can optionally specify a ` --pyspark_file ` argument to change from the default
56- ` pyspark_sort.py ` included in this script to a new script.
73+ Then, if you want to rely on an existing cluster, run:
5774
58- ## Running on GCE, GAE, or other environments
75+ python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET
5976
60- On Google App Engine, the credentials should be found automatically.
77+ Otherwise, if you want the script to create a new cluster for you:
6178
62- On Google Compute Engine, the credentials should be found automatically, but require that
63- you create the instance with the correct scopes.
79+ python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster
6480
65- gcloud compute instances create --scopes="https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/compute,https://www.googleapis.com/auth/compute.readonly" test-instance
81+ This will setup a cluster, upload the PySpark file, submit the job, print the result, then
82+ delete the cluster.
6683
67- If you did not create the instance with the right scopes, you can still upload a JSON service
68- account and set ` GOOGLE_APPLICATION_CREDENTIALS ` . See [ Google Application Default Credentials ] ( https://developers.google.com/identity/protocols/application-default-credentials ) for more details .
84+ You can optionally specify a ` --pyspark_file ` argument to change from the default
85+ ` pyspark_sort.py ` included in this script to a new script .
0 commit comments