Skip to content
/ A2SL Public
forked from shiyuanlsy/A2SL

Learning to Retrieve for Environmental Knowledge Discovery: An Augmentation-Adaptive Self-Supervised Learning Framework

Notifications You must be signed in to change notification settings

RunlongYu/A2SL

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning to Retrieve for Environmental Knowledge Discovery: An Augmentation-Adaptive Self-Supervised Learning Framework

Overview

A2SL Framework

Requirements

System Requirements

  • Python 3.8 or higher
  • CUDA-compatible GPU (recommended for training)

Core Dependencies

torch>=2.0.0
torchvision>=0.15.0
torchaudio>=2.0.0
numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0
matplotlib>=3.8.0
scipy>=1.11.0

Quick Start

Basic Usage

python train.py

With Custom Parameters

python train.py --task DO --save_path results_experiment1

Training Configuration

  • --task: Task type - 'DO' for dissolved oxygen or 'Temp' for temperature (default: 'DO')
  • --seed: Random seed for reproducibility (default: 21)
  • --save_path: Directory to save all training outputs (default: 'top_n_100')

Training Pipeline

The model follows a sophisticated 5-phase training pipeline:

Phase 1: Encoder Training

  • Model: BiLSTM encoder
  • Objective: Learn lake embeddings using contrastive learning
  • Loss: Contrastive loss with positive, semi-positive, and negative pairs
  • Output: Trained encoder + precomputed embeddings for retrieval

Phase 2: Yearly Model Training

  • Model: LSTM
  • Objective: Learn annual dissolved oxygen patterns
  • Training: Two-step process:
    1. Pretraining with simulation data
    2. Fine-tuning with real observations
  • Output: Trained yearly prediction model

Phase 3: Joint Encoder-Decoder Training

  • Models: Encoder + Monthly decoders
  • Objective: Jointly optimize encoder and monthly prediction decoders
  • Features: Retrieval-augmented learning with top-N similar samples

Phase 4: Discriminator Training

  • Models: Two discriminators for epi/hyp
  • Objective: Learn to discriminate between yearly vs monthly model predictions
  • Training: Binary classification on synthetic vs retrieved predictions
  • Output: Trained discriminators with optimal thresholds

Phase 5: Final Testing and Inference

  • Process: Use discriminators to select between yearly and monthly predictions
  • Output: Final predictions and performance metrics

Data description

There are 47 features in total:

Core features:

  • datetime: Date and time information
  • sat_hypo: Simulated hypolimnion DO saturation concentration
  • thermocline_depth: Simulated thermocline depth
  • temperature_epi: Simulated epilimnion water temperature
  • temperature_hypo: Simulated hypolimnion water temperature
  • volume_epi: Simulated epilimnion volume
  • volume_hypo: Simulated hypolimnion volume
  • wind: Derived wind speed
  • airtemp: Derived air temperature
  • fnep: Simulated net ecosystem production flux
  • fmineral: Simulated mineralisation flux
  • fsed: Simulated net sedimentation flux
  • fatm: Simulated atmospheric exchange flux
  • fdiff: Simulated diffusion flux
  • fentr_epi: Simulated entrainment flux (epilimnion)
  • fentr_hyp: Simulated entrainment flux (hypolimnion)
  • eutro: Derived classification for eutrophic state
  • oligo: Derived classification for oligotrophic state
  • dys: Derived classification for dystrophic state
  • water: Derived classification proportion for water landuse
  • developed: Derived classification proportion for developed landuse
  • barren: Derived classification proportion for barren landuse
  • forest: Derived classification proportion for forest landuse
  • shrubland: Derived classification proportion for shrubland landuse
  • herbaceous: Derived classification proportion for herbaceous landuse
  • cultivated: Derived classification proportion for cultivated landuse
  • wetlands: Derived classification proportion for wetlands landuse
  • depth: Derived maximum lake depth
  • area: Derived maximum lake surface area
  • elev: Derived lake elevation
  • Shore_len: Shore length
  • Vol_total: Lake volume
  • Vol_res: Lake volume residual
  • Vol_src: Lake volume supplement
  • Depth_avg: Average lake depth
  • Dis_avg: Average inflow discharge
  • Res_time: Lake residence time
  • Elevation: Alternative lake elevation
  • Slope_100: Lake slope information
  • Wshd_area: Watershed area
  • ShortWave: Daily shortwave radiation
  • LongWave: Daily longwave radiation
  • RelHum: Daily relative humidity
  • Rain: Daily precipitation (rain)
  • Snow: Daily precipitation (snow)
  • ice: Ice cover indicator
  • sim_epi: Simulated epilimnion DO concentration
  • sim_hyp: Simulated hypolimnion DO concentration

Target variables:

  • obs_epi: Observed epilimnion DO concentration
  • obs_hyp: Observed hypolimnion DO concentration

Model-specific usage:

  • Encoder (BiLSTM): Uses all 47 features including sim_epi and sim_hyp
  • Yearly model (LSTM): Uses 45 features (excludes sim_epi and sim_hyp)
  • Monthly decoder: Uses 26 features (subset of the core features for monthly predictions)

Contact

For questions, issues, or collaborations related to this project, please contact:

About

Learning to Retrieve for Environmental Knowledge Discovery: An Augmentation-Adaptive Self-Supervised Learning Framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%