AI Company Information Crawler

This Python-based crawler uses AI to extract company information from websites. It processes a CSV file containing company names and websites, and uses an AI server to extract detailed information about each company.

Features

Asynchronous web crawling for better performance
AI-powered information extraction
Configurable AI server connection
Progress tracking and logging
CSV input/output handling
Error handling and retry mechanisms
User agent rotation for better crawling success

Requirements

Python 3.8 or higher
Windows or Linux operating system (tested on Ubuntu 20.04 LTS)
Access to an AI server (configurable)

Installation

Clone the repository:

git clone <repository-url>
cd ArkAIScrape

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Linux

Install dependencies:

pip install -r requirements.txt

Folow the install video

Install Video:

https://youtu.be/LXT6GQ3yXRo

Directory Structure

ArkAIScrape/
├── ai_server.py       # AI Server Configuration
├── run.py             # Main crawler script
├── webapp.py          # Web app script
├── requirements.txt   # Python dependencies
└── output/            # Generated output files

example useage video

https://youtu.be/5htH5nEfXe4

Usage

Prepare your input CSV file with at least these columns:
- Company Name (One of 'Company', 'Company Name', 'company', 'company_name', 'CompanyName', 'COMPANY')
- Website (One of 'Website', 'website', 'URL', 'url', 'Web Site', 'Company Website', 'WEBSITE')
Run the crawler:

python run.py <your_input_file.csv>

Replace <your_input_file.csv> with the path to your input CSV file. The base name of this file will be used in the output filenames.

The script will:

Verify the AI server connection
Process each company in the input CSV
Generate two new CSVs (success and failure) with additional information
Create logs in the logs directory

Output

After running the script, two CSV files will be generated in the output/ directory:

Success File: Contains records where information was successfully extracted or a website was found via search. Named as yourInputFileName-success-YYYYMMDDHHMMSS.csv.
Failure File: Contains records where the website could not be fetched or AI processing failed. Named as yourInputFileName-failure-YYYYMMDDHHMMSS.csv.

Both output CSVs will contain the following columns:

Company Name
Website
Phone Number
Street Address
City
State
Zip Code
Facebook Page
Facebook Page Name
Facebook Likes
Facebook About
LinkedIn Page
Public Email
Contact Person
Processing Time (seconds)
Status
Last Updated

Status Codes

Success: Information successfully extracted
No Website: Company has no website listed
Failed to Fetch: Could not access the website
AI Processing Failed: AI server could not process the data
Error: Other errors occurred during processing
Pending: Initial state

Contributing

Feel free to submit issues and enhancement requests!

License

[Your chosen license]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.specstory		.specstory
templates		templates
.cursorindexingignore		.cursorindexingignore
README.md		README.md
ai_server.py		ai_server.py
requirements.txt		requirements.txt
run.py		run.py
webapp.py		webapp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Company Information Crawler

Features

Requirements

Installation

Folow the install video

Install Video:

Directory Structure

example useage video

Usage

Output

Status Codes

Contributing

License

About

Uh oh!

Contributors 2

Uh oh!

Languages

rportojr/ArkAIScrape

Folders and files

Latest commit

History

Repository files navigation

AI Company Information Crawler

Features

Requirements

Installation

Folow the install video

Install Video:

Directory Structure

example useage video

Usage

Output

Status Codes

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages