Web scraping is a simple Selenium project to scrap data from a local provincial government web site.
This repo is free to fork or download, it is a selenium and beautiful soup learning project. Make to sure to follow all usage conditions posted by the owner of the web site.
- req.py contains python code to extract company information from the corresponding public web site (only returns data from the first page)
- This script controls a Chrome Browser on your local machine
- Save features to csv, json, excel
- Extract owners and managers from detailed information
git clone https://github.com/poivronjaune/web_scraping.git : Clone this repo to your workspace
python -m venv env : Create a virtual environment
env\Scripts\Activate : Activate the virtual environment (windows)
python -m pip install --upgrade pip : Upgrade your pip tool
pip install -r requirements.txt : Install python packages
.env : Create an .env file and insert the following line ` CHROME_DRIVER_LOCATION = "C:<path>\chromedriver.exe". Replace <'path'> with the location of your chromedriver.exe file
The selenium package controls a web browser installed on your local machine. Please follow instructions on the pypi installation page : https://pypi.org/project/selenium/ to setup correctly. In a word, you need a special program called "chromedriver.exe" that will be accessible by your app.
The ChromeDriver program is produced and distributed by google.
Once installation is complete, run python app.py <search_str>. The script will:
- open a Chrome Web Browser
- enter the <search_str> in the web site form
- automatically submit the form
- extract all company information found (loops through all pages available)
- display results on command line
- wait for user to press ENTER then close the controlled browser
python-dotenv : pyhton package to manage environment variables
selenium : pyhton package to extract data from web sites
beautifulsoup4 : pyhton package to extract information from html web pages
requests : pyhton package offering a simple python HTTP libray
lxml : pyhton package offering XML processing library
pandas : python library to manipulate structured data
openpyxl : Python library used by pandas to save to excel