Basic webcrawler made in python using BeautifulSoup

It has the following functions:

hTagExtraction This finds all H1 through H6 tags in a given site and appends them to a list.
pTagExtraction This finds all paragraph tags in a given site and then appends them to a list
remove_stopwords This function removes all the stopwords within the pTagList array. More information on stopwords can be found here: https://www.nltk.org/book/ch02.html

Usage

Install the prerequisites

pip install -r requirements.txt

Add your URL to the url variable
Call the method you would like to invoke

You need to run either the H tag extractor or the P tag extractor before the stopword removal functions otherwise the function will try and find stopwords within an empty list.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
extractor.py		extractor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Basic webcrawler made in python using BeautifulSoup

Usage

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Tier1Security/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Basic webcrawler made in python using BeautifulSoup

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages