Ankabut

An automated LLM prompting framework for:

preprocessing journal articles into a uniform structure;
extracting content and metadata from preprocessed articles;
indexing and searching with LLM embeddings;
generating new article structure plans;
generating article content for specified plan.

Preprocessing journal articles

python . "<src>.rdf" "<dst dir>"

"<src>.rdf" - path to source Zotero RDF file containing links to PDF attachments
"<dst dir>" - destination directory where preprocessed article files shall be stored

Parses Zotero RDF file containing article citations.
Parses linked PDF files via marker-pdf.
Stores parsed article content (without images) in the following Markdown structure:

<article id (DOI, URL, etc)>
...
<article id (DOI, URL, etc)>

<year of publication> <journal name> <issue> <volume> etc

<author name> <author title> <institution> etc
...
<author name> <author title> <institution> etc

# <article title>

Tags: <from Zotero> ...

[Abstract. ...]

[Keywords: ...]

[## <section title>]

<section content>
...
<section content>

[
## (References|Literature|etc)

[1. ] <Author>, <Title> etc

...

[N. ] <Author>, <Title> etc
]

Extracting content and metadata from articles

python . "<src dir>" "<dst>.json"

"<src dir>" - source directory containing Markdown article files
"<dst>.json" - destination JSON file where article content and metadata shall be stored

Parses articles in Markdown format described above.
Stores all articles in the follfwing JSON format:

{
    "articles": [
        {
            "ids": [ "DOI", "URL", "etc" ],
            "year": "2025",
            "authors": [
                "Author Name, Title, Institution", // ...
            ],
            "title": "<article title>",
            "abstract": "<...>",
            "keywords": [ "keyword", /* ... */ ],
            "sections": [
                {
                    [ "title": "<section title>", ]
                    "content": [
                        "<paragraph>", // ...
                    ]
                }, // ...
            ],
            "citations": [
                "Author, Title ...", // ...
            ]
        }
    ],
    "authors": {
        "<author name>": [
            "DOI", "DOI", "DOI", // ...
        ]
    },
    "keywords": {
        "keyword": 1, // ... occurence count
    },
    "citations": [
        "Author, Title ...", // ...
    ],
    "paragraph_ids": [
        "<DOI|URL|etc>:<section index>:<paragraph index>", // ...
    ]
}

Indexing and searching with LLM embeddings

python . "<src>.json" "<dst>.json" configs/embed.yaml

"<src>.json" - source JSON file containing article content and metadata
"<dst>.json" - destination JSON file to contain both articles and embedding vectors

Retrieves embeddings for each paragraph of each section of each article in the source JSON file from an LLM, cofigured in configs/embed.yaml.
Stores embeddings along with article contents and metadata into destination JSON file.

echo 'search query' | python . "<dst>.json" . configs/embed.yaml

"<dst>.json" - JSON file to contain both articles and embedding vectors

Reads article contents and metadata along with paragraph embeddings from source JSON file.
Retrieves embedding of the search query using an LLM, cofigured in configs/embed.yaml.
Runs a vector search using the index build from source paragraph embeddings.
Retrieves the content of corresponding paragraphs from the source JSON file.
Writes to standard output relevant search results grouped by source:

{
Из источника <n>. <author> (<year>) <title>

<relevant section>

<relevant section>
}

python . "<src>.json" "<dst>.html" configs/analyze.yaml

"<src>.json" - source JSON file containing both articles and embedding vectors
"<dst>.html" - destination HTML file to contain article graph and similarity report

Builds a graph of articles (represented by year), author names and keywords as configured in configs/analyze.yaml.
Prints max_samples pairs of paragraphs from source documents having least embedding similarity.
Plots a heatmap of cosine distance beetween pairs of paragraph embeddings.

Generating new article structure plans

python . "<src>.json" "<dst>.json" configs/plan.yaml

"<src>.json" - source JSON file containing source article content and metadata
"<dst>.json" - target JSON file to contain new article structure plan

Reads <dst>.json (you can copy data/template.json as an example).
Starts with an article structure plan specified in last document of <dst>.json.
Starts with a first article in <src>.json not in citation list of last document of <dst>.json.
For each article in <src>.json uses the LLM specified in configs/plan.yaml to refine the plan:
1. Collects article content paragraphs in each section.
2. In case the collected content overfits the specified LLM context size, summarises the collected content.
3. Requests the specified LLM for a new version of the plan basing on the current version the plan and current article content.
4. Stores the new plan as new document of <dst>.json along with all processed articles in the new document's citation list.
5. Prints the updated plan to standard output.

Generating article content for specified plan

python . "<src>.json" "<dst>.json" configs/write.yaml

"<src>.json" - source JSON file containing source article content, embeddings and metadata
"<dst>.json" - target JSON file to contain new article structure plan and generated content

Starts with an article structure plan specified in last document of <dst>.json.
For each section of the plan uses the LLM specified in configs/write.yaml to generate section content:
1. Collects section headers on a path from structure top to current section.
2. Uses the search index to retrieve the paragraphs from source articles relevant to collected section headers.
3. Collects all article section headers along with their content if present.
4. Requests the specified LLM for new content of the current section basing on the collected paragraphs and present article content.
5. Stores the resulting section in last document of <dst>.json along with all source articles in the new document's citation list.
6. Prints the generated section to standard output.

Rewriting article content

python . "<src>.json" "<dst>.json" configs/rewrite.yaml

"<src>.json" - source JSON file containing source article content, embeddings and metadata
"<dst>.json" - target JSON file to contain rewritten article content

Starts with an article specified in last document of <dst>.json.
For each section of the article uses the LLM specified in configs/write.yaml to generate new section content:
1. Collects section headers on a path from structure top to current section, then appends current section content.
2. Uses the search index to retrieve the paragraphs from source articles relevant to collected section headers and content.
3. Requests the specified LLM for new content of the current section basing on the collected paragraphs and present section content.
4. Stores the resulting section in last document of <dst>.json along with all source articles in the new document's citation list.
5. Prints the generated section to standard output.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
analyze		analyze
configs		configs
data		data
generate		generate
scripts		scripts
.gitignore		.gitignore
README.md		README.md
__main__.py		__main__.py
database.py		database.py
document.py		document.py
embedder.py		embedder.py
llm.py		llm.py
parser.py		parser.py
pdf.py		pdf.py
requirements.txt		requirements.txt
schema.py		schema.py
zotero.py		zotero.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ankabut

Preprocessing journal articles

Extracting content and metadata from articles

Indexing and searching with LLM embeddings

Generating new article structure plans

Generating article content for specified plan

Rewriting article content

About

Uh oh!

Languages

aitsvet/ankabut

Folders and files

Latest commit

History

Repository files navigation

Ankabut

Preprocessing journal articles

Extracting content and metadata from articles

Indexing and searching with LLM embeddings

Generating new article structure plans

Generating article content for specified plan

Rewriting article content

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages