- Scrape TheHackernews.com and store the result (Description, Image, Title, Url)
- Maintain two relations - 1 with the url and title of the blog and other one with url and its meta data like (Description, Image, Title, Author)
- MongoDB
- Json
- MySQL (WIP)
- python3
- pip
- python libraries: _ requests _ BeautifulSoup4 _ pymongo _ jupyterlab * notebook
- MongoDB
- git
-
Type the following in your terminal
git clone https://github.com/pushp1997/Hackernews-Scraping.git -
Change the directory into the repository
cd ./Hackernews-Scraping -
Create python virtual environment
python3 -m venv ./scrapeVenv -
Activate the virtual environment created
- On linux / MacOS :
source ./scrapeVenv/bin/activate - On Windows (cmd) :
"./scrapeVenv/Scripts/activate.bat" - On Windows (powershell) :
"./scrapeVenv/Scripts/activate.ps1"
- On linux / MacOS :
-
Install python requirements
pip install -r requirements.txt -
Open the ipynb using jupyter notebook
jupyter notebook "Hackernews Scraper.ipynb" -
Run the notebook, you will be asked to provide inputs for no of pages to scrape to get the post and your MongoDB database URI to store the posts data.
-
Open mongodb shell connecting to the same URI you provided to the ipynb notebook while running it.
-
Change the database
use hackernews -
Print the documents in the 'url-title' collection
db["url-title"].find().pretty() -
Print the documents in the 'url-others' collection
db["url-others"].find().pretty()