An automated LLM prompting framework for:
- preprocessing journal articles into a uniform structure;
- extracting content and metadata from preprocessed articles;
- indexing and searching with LLM embeddings;
- generating new article structure plans;
- generating article content for specified plan.
python . "<src>.rdf" "<dst dir>"
"<src>.rdf" - path to source Zotero RDF file containing links to PDF attachments
"<dst dir>" - destination directory where preprocessed article files shall be stored- Parses Zotero RDF file containing article citations.
- Parses linked PDF files via marker-pdf.
- Stores parsed article content (without images) in the following Markdown structure:
<article id (DOI, URL, etc)>
...
<article id (DOI, URL, etc)>
<year of publication> <journal name> <issue> <volume> etc
<author name> <author title> <institution> etc
...
<author name> <author title> <institution> etc
# <article title>
Tags: <from Zotero> ...
[Abstract. ...]
[Keywords: ...]
[## <section title>]
<section content>
...
<section content>
[
## (References|Literature|etc)
[1. ] <Author>, <Title> etc
...
[N. ] <Author>, <Title> etc
]python . "<src dir>" "<dst>.json"
"<src dir>" - source directory containing Markdown article files
"<dst>.json" - destination JSON file where article content and metadata shall be stored- Parses articles in Markdown format described above.
- Stores all articles in the follfwing JSON format:
{
"articles": [
{
"ids": [ "DOI", "URL", "etc" ],
"year": "2025",
"authors": [
"Author Name, Title, Institution", // ...
],
"title": "<article title>",
"abstract": "<...>",
"keywords": [ "keyword", /* ... */ ],
"sections": [
{
[ "title": "<section title>", ]
"content": [
"<paragraph>", // ...
]
}, // ...
],
"citations": [
"Author, Title ...", // ...
]
}
],
"authors": {
"<author name>": [
"DOI", "DOI", "DOI", // ...
]
},
"keywords": {
"keyword": 1, // ... occurence count
},
"citations": [
"Author, Title ...", // ...
],
"paragraph_ids": [
"<DOI|URL|etc>:<section index>:<paragraph index>", // ...
]
}python . "<src>.json" "<dst>.json" configs/embed.yaml
"<src>.json" - source JSON file containing article content and metadata
"<dst>.json" - destination JSON file to contain both articles and embedding vectors- Retrieves embeddings for each paragraph of each section of each article in the source JSON file from an LLM, cofigured in
configs/embed.yaml. - Stores embeddings along with article contents and metadata into destination JSON file.
echo 'search query' | python . "<dst>.json" . configs/embed.yaml
"<dst>.json" - JSON file to contain both articles and embedding vectors- Reads article contents and metadata along with paragraph embeddings from source JSON file.
- Retrieves embedding of the search query using an LLM, cofigured in
configs/embed.yaml. - Runs a vector search using the index build from source paragraph embeddings.
- Retrieves the content of corresponding paragraphs from the source JSON file.
- Writes to standard output relevant search results grouped by source:
{
Из источника <n>. <author> (<year>) <title>
<relevant section>
<relevant section>
}python . "<src>.json" "<dst>.html" configs/analyze.yaml
"<src>.json" - source JSON file containing both articles and embedding vectors
"<dst>.html" - destination HTML file to contain article graph and similarity report- Builds a graph of articles (represented by
year), author names and keywords as configured inconfigs/analyze.yaml. - Prints
max_samplespairs of paragraphs from source documents having least embedding similarity. - Plots a heatmap of cosine distance beetween pairs of paragraph embeddings.
python . "<src>.json" "<dst>.json" configs/plan.yaml
"<src>.json" - source JSON file containing source article content and metadata
"<dst>.json" - target JSON file to contain new article structure plan- Reads
<dst>.json(you can copydata/template.jsonas an example). - Starts with an article structure plan specified in last document of
<dst>.json. - Starts with a first article in
<src>.jsonnot in citation list of last document of<dst>.json. - For each article in
<src>.jsonuses the LLM specified inconfigs/plan.yamlto refine the plan:- Collects article content paragraphs in each section.
- In case the collected content overfits the specified LLM context size, summarises the collected content.
- Requests the specified LLM for a new version of the plan basing on the current version the plan and current article content.
- Stores the new plan as new document of
<dst>.jsonalong with all processed articles in the new document's citation list. - Prints the updated plan to standard output.
python . "<src>.json" "<dst>.json" configs/write.yaml
"<src>.json" - source JSON file containing source article content, embeddings and metadata
"<dst>.json" - target JSON file to contain new article structure plan and generated content- Starts with an article structure plan specified in last document of
<dst>.json. - For each section of the plan uses the LLM specified in
configs/write.yamlto generate section content:- Collects section headers on a path from structure top to current section.
- Uses the search index to retrieve the paragraphs from source articles relevant to collected section headers.
- Collects all article section headers along with their content if present.
- Requests the specified LLM for new content of the current section basing on the collected paragraphs and present article content.
- Stores the resulting section in last document of
<dst>.jsonalong with all source articles in the new document's citation list. - Prints the generated section to standard output.
python . "<src>.json" "<dst>.json" configs/rewrite.yaml
"<src>.json" - source JSON file containing source article content, embeddings and metadata
"<dst>.json" - target JSON file to contain rewritten article content- Starts with an article specified in last document of
<dst>.json. - For each section of the article uses the LLM specified in
configs/write.yamlto generate new section content:- Collects section headers on a path from structure top to current section, then appends current section content.
- Uses the search index to retrieve the paragraphs from source articles relevant to collected section headers and content.
- Requests the specified LLM for new content of the current section basing on the collected paragraphs and present section content.
- Stores the resulting section in last document of
<dst>.jsonalong with all source articles in the new document's citation list. - Prints the generated section to standard output.