Fetch html and assets of webpages and save them on disk for later browsing and retrieval.
- Install python3
- Add packages:
pip install -r requirements.txt
Inside program folder, run main.py directly from command-line:
$ cd program
$ python3 main.py https://www.google.com https://www.github.com --metadata --assets
Inside program folder, run using docker:
$ cd program
$ docker image build -t fetch-webpages:latest .
$ docker container run --rm -v ${PWD}:/fetch fetch-webpages:latest https://www.google.com https://www.github.com --metadata --assets
In the root folder, run using bash script:
$ ./fetch https://www.google.com https://www.github.com --metadata --assets
- Include
--metadata
to include statistics about the webpage loaded - Include
--assets
to download assets (img, css, js, etc.) to the same folder (Note: currently only downloadingimg
due to lack of time)
- Each webpage is stored as a separate folder in
output
in the current directory.