GitHub - Mr-Nnobody/Crawling

A Web Crawler that crawls University domains for useful information about the said university.

NB this crawler is the first step to a bigger project: AI for education. the script here is simply for data gathering purposes.

WORK FLOW -- Reads the university domains from the UK_Universities file and, -- grabs 300 Urls from within the domain ensuring that all links are not broken and that links belong to the domain -- Download text data, page titles from all 300 urls as txt files while logging activities. -- makes use of multithreading to perform tasks

UK_Universities -- file contains list of Universities (University Domain) across the United kingdom. -- it is worth noting that this list could be changed and another passed into crawler

OUTPUT --downloaded data is stored as text files in folders

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.devcontainer		.devcontainer
.gitignore		.gitignore
Links-Titles.csv		Links-Titles.csv
README.md		README.md
UK_Universities		UK_Universities
bot.thread.py		bot.thread.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Mr-Nnobody/Crawling

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages