Skip to content

abusedev/crawler

Repository files navigation

@buse

Logo

@buse crawler

This is a web crawler that uses HTML parsing to detect links on a website, at the end it will list all links on the site and how many times that link is linked to in total. what is a web crawler?

Built With

Logo

Our spider can be ran on any machine so long as you install node.js

Getting Started

To get a local copy up and running follow these simple steps.

Installation

  1. Find a website that does NOT use cloudflare google

  2. Clone the repo (or download manually)

    git clone https://github.com/abusedev/crawler.git
  3. Install NPM packages

    npm install

    If "npm" does not appear as a command and you just installed node.js, try restarting your device

  4. Create config type

    npm run settings silent
  5. Run the program

    npm start https://google.com

    Roadmap

    • Normalize urls
    • Ignore pages with status codes above 399 (server/client rejection)
    • Custom user agent
    • Ignore non HTML related content
    • Automatically go to next linked page
    • Ignore already loaded page
    • Count times link is linked to
    • Visually pleasing and readble log messages
    • Cloudflare bypass (requires changing from fetch to a curl probably)
    • Rate limit respecter
    • Save results to file
    • User settings
    • Color coated logs
    • Executable build

Contribution

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. You can make a pull request here

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star!

Issues

Are you experiencing a bug? Head here

Acknowledgments

node.js - run javascript on the CLI
jest - handles our function testing
jsdom - handles HTML

Licensing

Permissions

  • ✔️ Commercial use
  • ✔️ Modification
  • ✔️ Distribution
  • ✔️ Private use

Release conditions

  • ❕ License and copyright notice
  • ❕ State changes
  • ❕ Disclose source
  • ❕ Same license

Limitations

  • ❌ Liability
  • ❌ Warranty

License being used: GNU General Public License v2.0 license

About

This is a web crawler that can detect links on a website

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published