🕸️ Network Topology Mapper (Internet Cartographer)

Map the Internet from your living room. An open-source research tool for analyzing network infrastructure, technological dependencies, and visualizing the topology of the Web using physics-based graphs.

📸 Demo (Visualization)

Demo Page Vizualization

⚡ About the Project

This isn't just a standard crawler. It is a research tool designed with a Master-Worker architecture, allowing you to map the "invisible web" (background src dependencies vs. navigation href links) with minimal resource usage.

The system is highly optimized to run on IoT devices (e.g., Orange Pi Zero 2W, Raspberry Pi) by utilizing probabilistic data structures (Bloom Filters) and a lightweight SQLite database.

How it works:

Crawl: An autonomous "spider" traverses the web, making intelligent decisions on where to go next.
Analyze: Distinguishes between navigational links (href) and infrastructural dependencies (src - trackers, CDNs, scripts).
Map: Builds a connection graph, grouping domains into thematic clusters (e.g., Wikipedia family, Google Ecosystem, Wordpress sites).
Visualize: Generates an interactive HTML file where nodes behave like celestial bodies using physics simulation.

🚀 Key Features

High Performance: Uses Bloom Filters to store millions of visited URLs in just a few MBs of RAM.
Distributed Architecture:
- Worker (Orange Pi): Silent data gathering 24/7 (Power consumption ~2W).
- Master (PC): Rendering complex visualizations from the collected data.
Smart Crawler:
- Domain Throttling: Prevents getting stuck in a single domain loop (e.g., Wikipedia rabbit hole).
- Binary Skip: Detects and ignores PDFs, ZIPs, EXEs via HTTP headers to save bandwidth.
- Auto-Save & Resume: Full state persistence. Resumes exactly where it left off after a restart or power failure.
Advanced Visualization (PyVis):
- Clustering: Automatic coloring and grouping of domain "families".
- Physics: Uses ForceAtlas2Based simulation to create organic "archipelagos" of the web.
- Inspection: Node IDs, directional arrows, and type filtering (Links vs. Resources).

🛠️ Installation

Requires Python 3.8+.

# 1. Clone the repository
git clone https://github.com/your-username/network-topology-mapper.git
cd network-topology-mapper

# 2. Install dependencies
pip install -r requirements.txt

`requirements.txt`

requests
beautifulsoup4
pybloom-live
tldextract
pyvis

🕹️ Usage

1. Start Mapping (Crawler)

Run the data collector. By default, it collects technical dependencies (SRC) to build an infrastructure map.

python crawler.py

Creates/Updates: network_map.db, crawler_queue.json, crawler_visited.bin.

Tip: You can safely stop the process using Ctrl+C. The state will be saved automatically.

2. Database Cleaning (Optional)

If you have an old database with mixed data and want to keep only technical dependencies:

python cleaner.py

3. Generate Map (Visualizer)

Run this on a more powerful machine to process the SQL data into an interactive graph.

python visualizer.py

Generates an .html file and automatically opens it in your default browser.

⚙️ Configuration

In visualizer.py, you can tweak rendering parameters:

SHOW_LINKS = False       # Show navigation links (Blue, solid lines)
SHOW_RESOURCES = True    # Show scripts/trackers (Red, dashed lines)
MAX_NODES = 400          # Node limit (to prevent browser lag)
MIN_CONNECTIONS = 2      # Noise filter (hides single/isolated nodes)

In crawler.py, you can adjust the spider's behavior:

MAX_LINKS_PER_ROOT_DOMAIN = 50  # Depth limit per domain family
BATCH_SIZE = 20                 # Disk write frequency

🧠 Data Architecture

The project uses an optimized SQLite schema:

Table	Description
`hosts`	Domain dictionary (ID <-> Hostname). Unique entries.
`edges`	Lightweight relationship table (`source_id`, `target_id`, `type`).

Type 1: Navigation Link (HREF) - Represented as a solid blue line.
Type 2: Resource/Dependency (SRC) - Represented as a dashed red line.

🔮 Roadmap

Core Crawler & Visualizer
SQLite integration & RAM optimization
Domain clustering & Graph physics
Web panel for real-time statistics
Technology detection (e.g., "This site runs on WordPress")
Ranking system for "Most intrusive tracking domains"

🤝 Contributing

Pull requests are welcome! If you have ideas for optimizing the crawling algorithm or improving the D3.js/PyVis visualization—feel free to contribute.

📜 License

Project is available under the MIT License. Map responsibly. Do not use this tool for DDoS attacks.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
img		img
lib		lib
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
crawler.py		crawler.py
crawler_queue_dep.json		crawler_queue_dep.json
crawler_quotas_dep.json		crawler_quotas_dep.json
crawler_visited_dep.bin		crawler_visited_dep.bin
network_map_dependencies.db		network_map_dependencies.db
network_map_ultimate.html		network_map_ultimate.html
requirements.txt		requirements.txt
visualizer.py		visualizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕸️ Network Topology Mapper (Internet Cartographer)

📸 Demo (Visualization)

⚡ About the Project

How it works:

🚀 Key Features

🛠️ Installation

`requirements.txt`

🕹️ Usage

1. Start Mapping (Crawler)

2. Database Cleaning (Optional)

3. Generate Map (Visualizer)

⚙️ Configuration

🧠 Data Architecture

🔮 Roadmap

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Languages

License

zbirow/Internet_Netword_Research

Folders and files

Latest commit

History

Repository files navigation

🕸️ Network Topology Mapper (Internet Cartographer)

📸 Demo (Visualization)

⚡ About the Project

How it works:

🚀 Key Features

🛠️ Installation

requirements.txt

🕹️ Usage

1. Start Mapping (Crawler)

2. Database Cleaning (Optional)

3. Generate Map (Visualizer)

⚙️ Configuration

🧠 Data Architecture

🔮 Roadmap

🤝 Contributing

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`requirements.txt`

Packages