Data Engineering Internship Assignment

Welcome to the Data Engineering Internship Assignment! This task is designed to evaluate your problem-solving skills, understanding of data pipelines, and ability to work with web crawling and data extraction. Please read the instructions carefully and submit your solution as per the guidelines provided.

Problem Statement

You are tasked with building a basic web crawling pipeline to extract and process data from a target website. The goal is to:

Crawl a given webpage to extract specific information.
Clean and process the extracted data.
Store the processed data into a MongoDB database.

Target Website

You will be working with the Books to Scrape website (http://books.toscrape.com/) or any other publicly accessible e-commerce website containing product information. Ensure that your crawler abides by the website's robots.txt policy.

Tasks

Step 1: Web Crawling

Use the Scrapy framework to:
- Fetch the HTML content of the target webpage.
- Extract product details such as:
  - Product Name
  - Price
  - Rating
  - Availability Status

Step 2: Data Transformation

Clean the extracted data (e.g., remove extra whitespace, convert prices to float, handle missing ratings).
Standardize the data (e.g., convert availability status to In Stock or Out of Stock).

Step 3: Data Storage

Store the processed data into a MongoDB database.
Use a collection named products with the following schema:
- product_name (string)
- price (float)
- rating (float)
- availability (string)

Step 4: Documentation

Prepare a README.md file that includes:

An overview of your solution.
Steps to set up and run your crawler.
Dependencies and setup instructions.

Step 5: Git Guidelines

Use meaningful commit messages.
Follow a proper branch naming convention (e.g., feature/<your_name>).
Ensure your code is clean, modular, and well-commented.

Submission Guidelines

Fork this repository.
Create a new branch named submission/<your_name>.
Commit your code and push it to your forked repository.
Create a Pull Request (PR) to the main branch of this repository.
Include your README.md and ensure your code is well-documented.

Evaluation Criteria

Correctness: Does your solution meet the requirements?
Code Quality: Is your code clean, modular, and well-documented?
Efficiency: Are the crawling and transformations optimized?
Git Practices: Are proper git guidelines followed?

Good luck! We look forward to reviewing your submission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineering Internship Assignment

Problem Statement

Target Website

Tasks

Step 1: Web Crawling

Step 2: Data Transformation

Step 3: Data Storage

Step 4: Documentation

Step 5: Git Guidelines

Submission Guidelines

Evaluation Criteria

About

Uh oh!

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

d2k-tech/data-engineer-internship-assignment

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Internship Assignment

Problem Statement

Target Website

Tasks

Step 1: Web Crawling

Step 2: Data Transformation

Step 3: Data Storage

Step 4: Documentation

Step 5: Git Guidelines

Submission Guidelines

Evaluation Criteria

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages