Skip to content

Real-time stream processing project using Apache Kafka and Spark Streaming on Google Cloud Dataproc. Includes Python producers/consumers, Spark DStream word count, and full deployment with screenshots.

Notifications You must be signed in to change notification settings

NaveedMohiuddin/real-time-stream-processing-kafka-spark-gcp

Repository files navigation

Python Apache Spark Kafka GCP

📡 Real-Time Stream Processing with Kafka & Spark Streaming (GCP Dataproc)

This project demonstrates real-time streaming using:

  • Apache Kafka for messaging (Exercise 1)
  • Apache Spark Streaming for real-time data processing (Exercise 2)

Created and executed using Google Cloud Dataproc.


🔁 Exercise 1 – Kafka with Python

✅ Steps Completed

  • Created a Dataproc cluster with ZooKeeper
  • Downloaded Kafka via wget
  • Extracted Kafka, started ZooKeeper and Kafka server
  • Opened three terminals for Kafka (broker, producer, consumer)
  • Created a topic named sample
  • Implemented producer and consumer in Python using kafka-python

🗂 Files

  • exercise-1-kafka/put.py — sends 3 messages to Kafka topic
  • exercise-1-kafka/get.py — reads those messages and prints them

🔁 Exercise 2 – Spark Streaming Word Count

✅ Steps Completed

  • Opened a TCP socket on port 3333 using nc -lk 3333
  • Wrote and ran a PySpark streaming script to consume socket data
  • Produced word count output in real-time 10-second windows

🗂 Files

  • exercise-2-spark-streaming/consume.py
  • exercise-2-spark-streaming/log4j.properties

📷 Screenshot Highlights

Exercise - 1

  • DataProc cluster creation dataproc
  • Kafka terminal setup alt text alt text
  • Create Kafka topic alt text
  • Run the code
    alt text
  • Kafka consumer output alt text

Exercise - 2

  • Run this command to update Spark’s logging behavior: alt text

  • Netcat socket input

  • Open a TCP socket connection on port 3333 on the master node: alt text

  • Spark Streaming word count output alt text alt text


🚀 How to Run

Kafka:

cd exercise-1-kafka
python3 put.py #Terminal 1
python3 get.py #Terminal 2

Spark Streaming:

cd exercise-2-spark-streaming
nc -lk 3333              # Terminal 1
spark-submit consume.py  # Terminal 2

✅ Sample Outputs

Kafka Output:

Key=MYID, Value=a2054xxxx
Key=MYNAME, Value=Naveed
Key=MYEYECOLOR, Value=Brown

Spark Streaming Output:

('hello', 1)
('spark', 1)
('streaming', 1)
('works', 1)

📍 Author

Naveed
🎓 MS CS @ Illinois Institute of Technology
💼 Aspiring Software Engineer
🌐 GitHub

About

Real-time stream processing project using Apache Kafka and Spark Streaming on Google Cloud Dataproc. Includes Python producers/consumers, Spark DStream word count, and full deployment with screenshots.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages