This is a web crawler that uses HTML parsing to detect links on a website, at the end it will list all links on the site and how many times that link is linked to in total. what is a web crawler?
Our spider can be ran on any machine so long as you install node.js
To get a local copy up and running follow these simple steps.
-
Find a website that does NOT use cloudflare google
-
Clone the repo (or download manually)
git clone https://github.com/abusedev/crawler.git
-
Install NPM packages
npm install
If "npm" does not appear as a command and you just installed node.js, try restarting your device
-
Create config type
npm run settings silent
-
Run the program
npm start https://google.com
- Normalize urls
- Ignore pages with status codes above 399 (server/client rejection)
- Custom user agent
- Ignore non HTML related content
- Automatically go to next linked page
- Ignore already loaded page
- Count times link is linked to
- Visually pleasing and readble log messages
- Cloudflare bypass (requires changing from fetch to a curl probably)
- Rate limit respecter
- Save results to file
- User settings
- Color coated logs
- Executable build
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. You can make a pull request here
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star!
Are you experiencing a bug? Head here
Permissions
- ✔️ Commercial use
- ✔️ Modification
- ✔️ Distribution
- ✔️ Private use
Release conditions
- ❕ License and copyright notice
- ❕ State changes
- ❕ Disclose source
- ❕ Same license
Limitations
- ❌ Liability
- ❌ Warranty
License being used: GNU General Public License v2.0 license