This project demonstrates real-time streaming using:
- Apache Kafka for messaging (Exercise 1)
- Apache Spark Streaming for real-time data processing (Exercise 2)
Created and executed using Google Cloud Dataproc.
- Created a Dataproc cluster with ZooKeeper
- Downloaded Kafka via
wget
- Extracted Kafka, started ZooKeeper and Kafka server
- Opened three terminals for Kafka (broker, producer, consumer)
- Created a topic named
sample
- Implemented producer and consumer in Python using
kafka-python
exercise-1-kafka/put.py
— sends 3 messages to Kafka topicexercise-1-kafka/get.py
— reads those messages and prints them
- Opened a TCP socket on port 3333 using
nc -lk 3333
- Wrote and ran a PySpark streaming script to consume socket data
- Produced word count output in real-time 10-second windows
exercise-2-spark-streaming/consume.py
exercise-2-spark-streaming/log4j.properties
- DataProc cluster creation
- Kafka terminal setup
- Create Kafka topic
- Run the code
- Kafka consumer output
cd exercise-1-kafka
python3 put.py #Terminal 1
python3 get.py #Terminal 2
cd exercise-2-spark-streaming
nc -lk 3333 # Terminal 1
spark-submit consume.py # Terminal 2
Key=MYID, Value=a2054xxxx
Key=MYNAME, Value=Naveed
Key=MYEYECOLOR, Value=Brown
('hello', 1)
('spark', 1)
('streaming', 1)
('works', 1)
Naveed
🎓 MS CS @ Illinois Institute of Technology
💼 Aspiring Software Engineer
🌐 GitHub