It has the following functions:
-
hTagExtraction This finds all H1 through H6 tags in a given site and appends them to a list.
-
pTagExtraction This finds all paragraph tags in a given site and then appends them to a list
-
remove_stopwords This function removes all the stopwords within the pTagList array. More information on stopwords can be found here: https://www.nltk.org/book/ch02.html
- Install the prerequisites
pip install -r requirements.txt
- Add your URL to the url variable
- Call the method you would like to invoke
You need to run either the H tag extractor or the P tag extractor before the stopword removal functions otherwise the function will try and find stopwords within an empty list.