📡 Real-Time Stream Processing with Kafka & Spark Streaming (GCP Dataproc)

This project demonstrates real-time streaming using:

Apache Kafka for messaging (Exercise 1)
Apache Spark Streaming for real-time data processing (Exercise 2)

Created and executed using Google Cloud Dataproc.

🔁 Exercise 1 – Kafka with Python

✅ Steps Completed

Created a Dataproc cluster with ZooKeeper
Downloaded Kafka via wget
Extracted Kafka, started ZooKeeper and Kafka server
Opened three terminals for Kafka (broker, producer, consumer)
Created a topic named sample
Implemented producer and consumer in Python using kafka-python

🗂 Files

exercise-1-kafka/put.py — sends 3 messages to Kafka topic
exercise-1-kafka/get.py — reads those messages and prints them

🔁 Exercise 2 – Spark Streaming Word Count

✅ Steps Completed

Opened a TCP socket on port 3333 using nc -lk 3333
Wrote and ran a PySpark streaming script to consume socket data
Produced word count output in real-time 10-second windows

🗂 Files

exercise-2-spark-streaming/consume.py
exercise-2-spark-streaming/log4j.properties

📷 Screenshot Highlights

Exercise - 1

DataProc cluster creation
Kafka terminal setup
Create Kafka topic
Run the code
Kafka consumer output

Exercise - 2

Run this command to update Spark’s logging behavior:
Netcat socket input
Open a TCP socket connection on port 3333 on the master node:
Spark Streaming word count output

🚀 How to Run

Kafka:

cd exercise-1-kafka
python3 put.py #Terminal 1
python3 get.py #Terminal 2

Spark Streaming:

cd exercise-2-spark-streaming
nc -lk 3333              # Terminal 1
spark-submit consume.py  # Terminal 2

✅ Sample Outputs

Kafka Output:

Key=MYID, Value=a2054xxxx
Key=MYNAME, Value=Naveed
Key=MYEYECOLOR, Value=Brown

Spark Streaming Output:

('hello', 1)
('spark', 1)
('streaming', 1)
('works', 1)

📍 Author

Naveed
🎓 MS CS @ Illinois Institute of Technology
💼 Aspiring Software Engineer
🌐 GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
exercise-1-kafka		exercise-1-kafka
exercise-2-spark-streaming		exercise-2-spark-streaming
README.md		README.md
image-1.png		image-1.png
image-2.png		image-2.png
image-3.png		image-3.png
image-4.png		image-4.png
image-5.png		image-5.png
image-6.png		image-6.png
image-7.png		image-7.png
image-8.png		image-8.png
image-9.png		image-9.png
image.png		image.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📡 Real-Time Stream Processing with Kafka & Spark Streaming (GCP Dataproc)

🔁 Exercise 1 – Kafka with Python

✅ Steps Completed

🗂 Files

🔁 Exercise 2 – Spark Streaming Word Count

✅ Steps Completed

🗂 Files

📷 Screenshot Highlights

Exercise - 1

Exercise - 2

🚀 How to Run

Kafka:

Spark Streaming:

✅ Sample Outputs

Kafka Output:

Spark Streaming Output:

📍 Author

About

Uh oh!

Releases

Packages

Languages

NaveedMohiuddin/real-time-stream-processing-kafka-spark-gcp

Folders and files

Latest commit

History

Repository files navigation

📡 Real-Time Stream Processing with Kafka & Spark Streaming (GCP Dataproc)

🔁 Exercise 1 – Kafka with Python

✅ Steps Completed

🗂 Files

🔁 Exercise 2 – Spark Streaming Word Count

✅ Steps Completed

🗂 Files

📷 Screenshot Highlights

Exercise - 1

Exercise - 2

🚀 How to Run

Kafka:

Spark Streaming:

✅ Sample Outputs

Kafka Output:

Spark Streaming Output:

📍 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages