-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Introduction
- Requirements
- Container 1: Metadata Splitting and Cleaning
- Container 2: Benchmark Annotation via Interactive Interface
- Container 3: GPT Interaction
- Container 4: GPT performance evaluation
This repository contains a modular, containerized pipeline for processing and annotating environmental sample metadata as part of the MicrobeAtlasProject. Each container encapsulates a specific set of tasks β as you see depicted below - from preprocessing metadata (C1 - π΄ red), creating a benchmark dataset (C2 - π orange), interacting with GPT models and generating embeddings (C3 - π£ purple), to GPT output evaluation (C3 - π’ green).

Let's start with the requirements!
cd ~
git clone link_to_clone_repo
- Make a directory:
cd ~
mkdir MicrobeAtlasProject
- Download these large files:
sample.info.gz,metadata.out,... - Place them into ~/MicrobeAtlasProject/.
~/MicrobeAtlasProject/
~/github/metadata_mining/scripts/
Download and install Docker Desktop:
- Download Docker Desktop (macOS/Windows)
- Install Docker Engine (Linux)
docker --version
open -a Docker
docker build -t metadmin .
This Docker container is part of the MicrobeAtlasProject pipeline. It provides a consistent environment to run all scripts related to processing and cleaning environmental metadata, including coordinate parsing, ontology translation, and exploratory analysis.
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/dirs.py \
--input_file sample.info_test.gz \
--output_dir sample_info_split_dirs_test \
--figure_path files_distribution_in_dirs_test.pdf
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/fetch_and_join_ontologies.py \
--wanted_ontologies FOODON ENVO UBERON PO \
--output_file ontologies_dict
Increase the file descriptor limit first. By default, many operating systems limit how many files can be open at once. Since this script processes many files in parallel, you must increase the ulimit:
ulimit -n 200000
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/clean_and_envo_translate.py \
--ontology_dict ontologies_dict.pkl \
--metadata_dirs sample_info_split_dirs_test \
--max_processes 8
This script compares file sizes before and after cleaning and estimates the token-level reduction after the cleaning. It calculates token reduction using bootstrap sampling (default: 100 iterations Γ 100 samples).
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/check_metadata_sizes.py
This script examines in which metadata fields the benchmark sub-biome information appears. It scans the cleaned metadata files and checks whether the sub-biome (e.g. human gut, sediment, leaf) is found fully or partially in each metadata field. This helps identify the most informative fields across samples and biomes. It outputs a plot and csv summaries with the top-matching fields, based on 1000 random files.
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/field_distrib_analysis.py \
--gold_dict gold_dict.pkl
docker run --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/parse_lat_lon_from_metadata.py \
--reversal_file samples_with_lat_lon_reversal.tsv \
--metadata_file metadata.out \
| grep '^OUTPUT:' \
| cut -f1-5 \
| tr '\t' ' ' \
| sed 's/ */ /g' \
| sed 's/ *$//' \
> ~/MicrobeAtlasProject/sample.coordinates.reparsed.filtered_fresh
This container supports the creation and manual curation of a benchmark (also referred to as gold standard dictionary - gold_dict.pkl), which maps selected sample IDs to:
- a biome (animal, plant, soil, water, other)
- a specific sample origin (sub-biome)
- geographic coordinates (latitude/longitude)
- a short geographic location description
This container includes two interactive scripts:
- make_gold_dict.py: read the metadata from each sample and annotate samples yourself.
- edit_gold_dict.py: modify or correct existing entries (when you realise you made a mistake).
Launch Docker container interactively:
docker run -it --rm \
--entrypoint bash \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin
Activate the environment inside the container:
conda activate metadmin_env
To create the benchmark from scratch or to continue making it, run:
python /app/scripts/make_gold_dict.py
This starts a session where you can annotate samples one by one. Your progress is automatically saved to gold_dict.pkl.
To edit entries in the existing dictionary:
python /app/scripts/edit_gold_dict.py
To exit either session just type:
exit
πΎ In both cases your changes are automatically saved to /MicrobeAtlasProject/gold_dict.pkl, which is mounted from your local system.
This container handles all steps related to GPT-based annotation of metadata, including: synchronous or asynchronous interactions with the OpenAI API, preparing and submitting batch jobs (asynchronous runs), fetching responses (asynchronous runs), generating sub-biome embeddings from GPT outputs and from benchmark data.
You can run either:
- Synchronous interaction with OpenAI
- Asynchronous interaction via the batch API (two steps)
Then, generate embeddings from GPT results and from benchmark data.
This script performs end-to-end metadata annotation in a single script using synchronous OpenAI API requests.
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/openai_main.py \
--work_dir . \
--input_gold_dict gold_dict.pkl \
--n_samples_per_biome 5 \
--chunking no \
--chunk_size 2000 \
--seed 22 \
--directory_with_split_metadata sample_info_split_dirs \
--system_prompt_file openai_system_better_prompt_json.txt \
--encoding_name cl100k_base \
--api_key_path my_api_key \
--model gpt-3.5-turbo-1106 \
--temperature 1.00 \
--max_tokens 4096 \
--top_p 0.75 \
--frequency_penalty 0.25 \
--presence_penalty 1.5 \
--max_requests_per_minute 3500 \
--opt_text normal \
--output_format json
Use this if you want to take advantage of OpenAI's batch API for more efficient, large-scale requests.
- Step 1: Submit Batch Job (async requests)
This script prepares metadata and submits it as an OpenAI batch job.
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/gpt_async_batch.py \
--work_dir . \
--input_gold_dict gold_dict.pkl \
--n_samples_per_biome 5 \
--chunking "no" \
--chunk_size 3000 \
--seed 22 \
--directory_with_split_metadata "sample_info_split_dirs" \
--system_prompt_file "openai_system_prompt.txt" \
--encoding_name "cl100k_base" \
--api_key_path "my_api_key" \
--model "gpt-3.5-turbo-1106" \
--temperature 1.00 \
--max_tokens 4096 \
--top_p 0.75 \
--frequency_penalty 0.25 \
--presence_penalty 1.5 \
--output_format "inline"
- Step 2: Fetch Batch Results
After your batch job is submitted, OpenAI typically processes it within a few minutes to a few hours. However, the maximum processing time is 24 hours. If your job hasn't completed within that window, it will expire, and you'll need to resubmit it.
We have successfully submitted up to 700,000 metadata samples per day and consistently received results well within 24 hours.
To fetch and save completed results locally, run:
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/gpt_async_fetch_and_save.py \
--work_dir . \
--api_key_path my_api_key
This script creates text-embedding-3-small vectors from:
- GPT-generated sub-biomes (gpt_clean_output*.csv / .txt)
- Your benchmark sub-biomes (gold_dict.pkl)
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/embeddings_from_sb.py \
--directory_path . \
--api_key_path my_api_key_embeddings \
--gold_dict_path gold_dict.pkl
This container evaluates GPT performance by comparing biomes and sub-biomes annotations by GPT (.tsv files for biomes and .json files for sub-biomes) against those of the benchmark (gold_dict.pkl). It does so for each GPT run. For biomes annotation evaluation, it compares strings for either a lenient or an exact match. It produces a summary CSV with per-run biome agreement metrics. For sub-biomes annotation evaluation, it uses embeddings of GPT runs versus embeddings of the benchmark. It computes cosine similarity between matched embeddings, it calculates the distribution of similarities versus the background, and it produces a summary CSV with per-run sub-biome similarity metrics. Pairwise statistical comparisons are performed. Additionally, it evaluates geographic annotations by GPT by comparing them to the metadata-extracted coordinates.
Four scripts to run:
- One to obtain the metrics .... validate_biomes_subbiomes.py
- overall_analysis.py
- coord_to_text.py
- geo_check.py
.....
This script ......
docker run -it --rm \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python /app/scripts/validate_biomes_subbiomes.py \
--work_dir . \
--map_file gpt_file_label_map.tsv
This script ...... needs to run interatcively because you may need to choose files in case few files have the same label. Launch Docker container interactively:
docker run -it --rm \
--entrypoint bash \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin
Activate the environment inside the container:
conda activate metadmin_env
To select files... run:
python /app/scripts/overall_analysis.py \
--work_dir . \
--metadata_dir sample_info_split_dirs \
--keyword_based_annot_file joao_biomes_parsed.csv
This starts a session where you can achoose files when they have same label.
To exit the session just type: exit
This script performs reverse geocoding on a set of unique latitude/longitude coordinates. This means it convertseach coordinate pair into a humna-readable place name (like a city, region, or country). It uses the Nomatin geocoding service OpensStreetMap. It may take long to run as we are using the free version (no API). Approximately you can expect it to take 1.3 seconds per coordinates pair.
docker run -it --rm \
-e PYTHONUNBUFFERED=1 \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin \
python -u /app/scripts/coord_to_text.py \
--work_dir . \
--coordinates_file sample.coordinates.reparsed.filtered \
--output_file geocoded_coordinates.csv \
--min_delay_seconds 1.3
You can check the progress by running from another terminal:
tail -f ~/MicrobeAtlasProject/geocoding_progress.log
This script needs to run interactively because it gives you the possibility to evaluate a set of GPT geographic locations versus the extracted coordinates. You will pick "who" was correct: coordinates-derived geographic location (from metadata) or GPT-derived geographic location. This will help you qualify the mismatches between the two. Start by launching the Docker container interactively:
docker run -it --rm \
--entrypoint bash \
-v ~/MicrobeAtlasProject:/MicrobeAtlasProject \
-v ~/github/metadata_mining/scripts:/app/scripts \
metadmin
Activate the environment inside the container:
conda activate metadmin_env
Then run:
python /app/scripts/geo_check.py \
--work_dir . \
--metadata_dir sample_info_split_dirs \
--api_key_file /MicrobeAtlasProject/google_maps_api_key \
--coordinates_file sample.coordinates.reparsed.filtered \
--translated_coordinates geocoded_coordinates.csv \
--random_misclassified_samples_dict random_misclassified_samples_dict.pkl \
--output_map_all_matches map_with_color_coded_points_all.html \
--output_map_all_mismatches map_with_color_coded_points_mismatches.html
You can quit at any time 'QUIT'.
To exit the session just type: exit