A java web crawler for listing all internal domain adress of a web site
A simple pure java batch, taking a main url as input,
crawls through its internal domains using htmlUnit that simulates a webClient,
and output the list of internal web pages found. Project is also creating a map of parent->child web pages.
Can evolve easily to a web api giving the site map through web service.
-download sources
-run: mvn clean package
execute the jar by giving it a base url
ex: java -jar target/webCrawler-0.0.1-SNAPSHOT-jar-with-dependencies.jar https://babylonhealth.com
run the jar with a config file in the same root folder containing the webClient configs.
Example file is in the src/main/ressource folder.
isCssEnabled : false
isJSEnabled : false
jsBGTimeout : 0ms
isRedirectEnabled : true