The purpose of this repo is to test the extraction of data from various file types in preparation for text and img embeddings for AI-Studio in an isolated environment.
However, for testing purposes it also serves as a CLI tool to transform data into markdown whenever possible:
Usage: AIStudioDataPreparation.exe [OPTIONS]
Options:
-p <PATH>, --path <PATH> The path to the file containing the data to be extracted.
-h, --help Print help information.
Examples:
AIStudioDataPreparation.exe --path "/absolute/path/to/your/file"
AIStudioDataPreparation.exe -p "/absolute/path/to/your/file"
Extracting pdf content relies on Pdfium, so you will need to have Pdfium available on your local machine.
The easiest way to set this up is to download the latest binaries from projects such as bblanchon
/ pdfium-binaries and place the pdfium.dll next to the rust executable.
This would be target/release/ or target/debug/.
More information about binding the library can be found
here