GitHub - abusedev/crawler: This is a web crawler that can detect links on a website

@buse

@buse crawler

This is a web crawler that uses HTML parsing to detect links on a website, at the end it will list all links on the site and how many times that link is linked to in total. what is a web crawler?

Built With

Our spider can be ran on any machine so long as you install node.js

Getting Started

To get a local copy up and running follow these simple steps.

Installation

Find a website that does NOT use cloudflare google

Clone the repo (or download manually)

git clone https://github.com/abusedev/crawler.git

Install NPM packages
```
npm install
```
If "npm" does not appear as a command and you just installed node.js, try restarting your device
Create config type
```
npm run settings silent
```
Run the program
```
npm start https://google.com
```
Roadmap
- Normalize urls
- Ignore pages with status codes above 399 (server/client rejection)
- Custom user agent
- Ignore non HTML related content
- Automatically go to next linked page
- Ignore already loaded page
- Count times link is linked to
- Visually pleasing and readble log messages
- Cloudflare bypass (requires changing from fetch to a curl probably)
- Rate limit respecter
- Save results to file
- User settings
- Color coated logs
- Executable build

Contribution

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. You can make a pull request here

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star!

Issues

Are you experiencing a bug? Head here

Acknowledgments

node.js - run javascript on the CLI
jest - handles our function testing
jsdom - handles HTML

Licensing

Permissions

✔️ Commercial use
✔️ Modification
✔️ Distribution
✔️ Private use

Release conditions

❕ License and copyright notice
❕ State changes
❕ Disclose source
❕ Same license

Limitations

❌ Liability
❌ Warranty

License being used: GNU General Public License v2.0 license

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
handler		handler
LICENSE		LICENSE
README.md		README.md
abuse.png		abuse.png
app.js		app.js
kami.png		kami.png
package-lock.json		package-lock.json
package.json		package.json
settings.js		settings.js
settings.json		settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

@buse

@buse crawler

Built With

Getting Started

Installation

Roadmap

Contribution

Issues

Acknowledgments

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

abusedev/crawler

Folders and files

Latest commit

History

Repository files navigation

@buse

@buse crawler

Built With

Getting Started

Installation

Roadmap

Contribution

Issues

Acknowledgments

Licensing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages