Web interface and tool to calls various web and data scrawlers!
Make sure you have the proper system dependencies with
- Install neo4j
- Intall Ruby on Rails
- On Debian, install depedencies:
sudo apt-get install libcurl3 libcurl3-gnutls libcurl4-openssl-dev libmagickcore-dev libmagickwand-dev
- Get the DocumentLoader code
git clone https://github.com/TransparencyToolkit/DocumentLoader - Install Ruby dependencies from
cd DocumentLoaderand thenbundle install - Download and install CrawlerManager
- Download & install LookingGlass and it's depedencies
By default document conversion (pdf, docs, etc..) is handled by GiveMeText, this approach sends your documents over the clear internet. DO NOT USE THIS with sensitive documents, instead install Tika & Tesseract.
- Install java package manager
apt-get install default-jdk maven unzip - Download
Tikafrom github and unzip it
mkdir install
curl https://codeload.github.com/apache/tika/zip/trunk -o trunk.zip
unzip trunk.zip
- Go into the
tika-trunkdirecoty created during the last step & install
cd tika-trunk
mvn -DskipTests=true clean install
cp tika-server/target/tika-server-1.*-SNAPSHOT.jar /srv/tika-server-1.*-SNAPSHOT.jar
- Now install
Tesseractwith the following
apt-get -y -q install tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng
Start DocumentLoader
Note: make sure that neo4j is not running when starting DocumentLoader
- Start
CrawlerManagerin that directoryrails server -p 9506
Start Tika
- Start
Tikawithjava -jar tika-server/target/tika-server-*.jar - If you need Tika to have custom URL or port
--host=localhost --port=1234
Start Harvester
- Start neo4j
rake neo4j:start - Then run Harvester with
rails server -p 3333 - Go to 0.0.0.0:3000 in a browser
Start LookingGlass
- Start
elasticsearchas it was installed on your server - Start
LookingGlassfrom that directoryrails server
Add CAPTCHA Solving
Crawling some sites using tools like Tor or VPNs sometimes require solving of CAPTCHA's, Harvester can support this you just need to do the following:
..........To be filled out..................
When your done running DocumentLoader
- From within the
Harvesterrepo, stop neo4j withrake neo4j:stop