Our gold standard dataset for figure extraction from scanned ETDs can be downloaded using this link.
The original readme (readme from the original fork) can be found here.
Use the requirements.txt from the repository's root to create your python environment.
Another quick way to set up the environment using Anaconda is:
ENV_NAME=deepfigures_3 && conda remove --name $ENV_NAME --all -y && conda create --name $ENV_NAME python=3.6 -y && source activate $ENV_NAME && pip install -r /home/sampanna/deepfigures-open/requirements.txt --no-cache-dircd /home/sampanna/deepfigures-open/vendor/tensorboxresnet/tensorboxresnet/utils && makeIf you have sudo access, run:
sudo apt-get install texlive-latex-base \
texlive-fonts-recommended \
texlive-fonts-extra \
texlive-latex-extra \
texlive-font-utilsIf you do not have sudo access, run:
wget http://mirror.ctan.org/systems/texlive/tlnet/install-tl-unx.tar.gz
tar -xvf install-tl-unx.tar.gz > untar.log
cd `head -1 untar.log`
./install-tl -profile texlive.profileIf you need you need to download data from AWS, please add your credentials to the credentials file.
A sample of this file should look like:
[default]
aws_access_key_id=dummy_sample_credentials
aws_secret_access_key=dummy_sample_credentials_dummy_sample_credentials
aws_session_token=dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_Also, don't forget to set the ARXIV_DATA_TMP_DIR and ARXIV_DATA_OUTPUT_DIR variables as mentioned in the README.md.
sudo docker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results/31219:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.5 /bin/bashThis command will pull the sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.5 docker image from Docker Hub, run it and give us bash access to it.
If this image is already pulled, this command will simply run it.
sampyash/vt_cs_6604_digital_libraries:deepfigures_cpu_0.0.5 is also available for CPU use-cases.
Note: Please check the latest version before pulling.
In the above command, the first '--volume' argument connects the local output directory with the docker output directory. The second '--volume' argument does the same for the input directory. Please modify the local file paths as per your local host system. More info here.
Further, the --gpus all option tells docker to use all the GPUs available on the system.
Try running nvidia-smi once inside the container to check if GPUs are accessible.
The --gpus all option is not required when running the CPU docker image.
docker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.6 python deepfigures/data_generation/arxiv_pipeline.pyThis command will run the deepfigures/data_generation/arxiv_pipeline.py script from the source code which will:
- Download data from AWS's requester-pays buckets using the credentials set above.
- Cache this data in the directory
/work/host-output/download_cache. - Unzip and generate the relevant training data.
docker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_cpu_0.0.6 python figure_json_transformer.pydocker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_cpu_0.0.6 python figure_boundaries_train_test_split.pyThe data generated by the arxiv_pipeline.py is not in the format needed by tensorbox for training.
Hence, this command will transform it. The second command will split the data in test and train split.
python manage.py train /work/host-input/weights/hypes.json /home/sampanna/deepfigures-results /home/sampanna/deepfigures-resultsHere, the python environment created in one of the steps above should be activated.
- The first argument to
manage.pyis thetraincommand. /work/host-input/weights/hypes.jsonis the path to the hyper-parameters as visible from inside the docker container./home/sampanna/deepfigures-resultsis the host's input directory for the container. This will be linked to/work/host-input./home/sampanna/deepfigures-resultsis the host's output directory for the container. This will be linked to/work/host-output.
python manage.py detectfigures '/home/sampanna/workspace/bdts2/deepfigures-results' '/home/sampanna/workspace/bdts2/deepfigures-results/LD5655.V855_1935.C555.pdf'Here, the python environment created in one of the steps above should be activated.
- The first argument to
manage.pyis thedetectfigurescommand. '/home/sampanna/workspace/bdts2/deepfigures-results'is the host path to the output directory to put the detection results in.'/home/sampanna/workspace/bdts2/deepfigures-results/LD5655.V855_1935.C555.pdf'is the host path to the PDF file to be processes.
Instructions to run on ARC using Singularity:
Docker is not available on Virginia Tech's Advanced Research Computing HPC cluster. However, Singularity can be used to run pre-built Docker images on ARC using singularity.
Each time you ssh into either the login node or any of the compute nodes, please lode the Singularity module using:
module load singularity/3.3.0mkdir /work/cascades/${USER}/singularityMake the directory required for Singularity.
singularity pull docker://sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.6- This command will pull the given image from Docker Hub.
- This command needs internet access and hence needs to be run on the login node.
- This command will take some time.
singularity run --nv -B /home/sampanna/deepfigures-results:/work/host-output -B /home/sampanna/deepfigures-results:/work/host-input /work/cascades/sampanna/singularity/vt_cs_6604_digital_libraries_deepfigures_cpu_0.0.6.sif /bin/bash- This command will run the pulled Docker image and give the user the shel access inside the container.
- The
--nvflag is analogous to the--gpus alloption of Docker. - The
-Bflag is analogous to the--volumeoption of Docker.
The executions of the remaining commands is straightforward is left as an exercise to the reader.
The master branch of the original repository was not working for me. So I debugged and made this fork. Following are the changes which were made.
Docker-file was not building (both cpu and gpu).
There was some error related to 'libjasper1 libjasper-dev not found'.
Hence, added corresponding changes to the Dockerfile to make them buildable.
Have also pushed the built images to Docker Hub. Link here.
You can simply fetch the two images and re-tag them as deepfigures-cpu:0.0.6 and deepfigures-gpu:0.0.6.
Further, added the functionality to make read AWS credentials from the ./credentials file.
pdffigures jar has been built and committed in the bin folder in this repository. Hence, you should not need to build it. Please have java 8 in your system to make it work.
Version 1.3.0 of scipy does not have imread and imsave in scipy.misc. As a result, the import statement from scipy.misc import imread, imsave in detections.py was not working. Hence, downgraded the version of scipy to 1.1.0 in requirements.txt. The import worked as a result.
Imported it separately using from scipy import optimize and started using it like scipy.optimize().