Welcome to the Data Engineering Internship Assignment! This task is designed to evaluate your problem-solving skills, understanding of data pipelines, and ability to work with web crawling and data extraction. Please read the instructions carefully and submit your solution as per the guidelines provided.
You are tasked with building a basic web crawling pipeline to extract and process data from a target website. The goal is to:
- Crawl a given webpage to extract specific information.
- Clean and process the extracted data.
- Store the processed data into a MongoDB database.
You will be working with the Books to Scrape website (http://books.toscrape.com/) or any other publicly accessible e-commerce website containing product information. Ensure that your crawler abides by the website's robots.txt
policy.
- Use the
Scrapy
framework to:- Fetch the HTML content of the target webpage.
- Extract product details such as:
Product Name
Price
Rating
Availability Status
- Clean the extracted data (e.g., remove extra whitespace, convert prices to float, handle missing ratings).
- Standardize the data (e.g., convert availability status to
In Stock
orOut of Stock
).
- Store the processed data into a MongoDB database.
- Use a collection named
products
with the following schema:product_name
(string)price
(float)rating
(float)availability
(string)
Prepare a README.md
file that includes:
- An overview of your solution.
- Steps to set up and run your crawler.
- Dependencies and setup instructions.
- Use meaningful commit messages.
- Follow a proper branch naming convention (e.g.,
feature/<your_name>
). - Ensure your code is clean, modular, and well-commented.
- Fork this repository.
- Create a new branch named
submission/<your_name>
. - Commit your code and push it to your forked repository.
- Create a Pull Request (PR) to the
main
branch of this repository. - Include your
README.md
and ensure your code is well-documented.
- Correctness: Does your solution meet the requirements?
- Code Quality: Is your code clean, modular, and well-documented?
- Efficiency: Are the crawling and transformations optimized?
- Git Practices: Are proper git guidelines followed?
Good luck! We look forward to reviewing your submission.