Learning to Retrieve for Environmental Knowledge Discovery: An Augmentation-Adaptive Self-Supervised Learning Framework
- Python 3.8 or higher
- CUDA-compatible GPU (recommended for training)
torch>=2.0.0
torchvision>=0.15.0
torchaudio>=2.0.0
numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0
matplotlib>=3.8.0
scipy>=1.11.0
python train.pypython train.py --task DO --save_path results_experiment1--task: Task type - 'DO' for dissolved oxygen or 'Temp' for temperature (default: 'DO')--seed: Random seed for reproducibility (default: 21)--save_path: Directory to save all training outputs (default: 'top_n_100')
The model follows a sophisticated 5-phase training pipeline:
- Model: BiLSTM encoder
- Objective: Learn lake embeddings using contrastive learning
- Loss: Contrastive loss with positive, semi-positive, and negative pairs
- Output: Trained encoder + precomputed embeddings for retrieval
- Model: LSTM
- Objective: Learn annual dissolved oxygen patterns
- Training: Two-step process:
- Pretraining with simulation data
- Fine-tuning with real observations
- Output: Trained yearly prediction model
- Models: Encoder + Monthly decoders
- Objective: Jointly optimize encoder and monthly prediction decoders
- Features: Retrieval-augmented learning with top-N similar samples
- Models: Two discriminators for epi/hyp
- Objective: Learn to discriminate between yearly vs monthly model predictions
- Training: Binary classification on synthetic vs retrieved predictions
- Output: Trained discriminators with optimal thresholds
- Process: Use discriminators to select between yearly and monthly predictions
- Output: Final predictions and performance metrics
There are 47 features in total:
- datetime: Date and time information
- sat_hypo: Simulated hypolimnion DO saturation concentration
- thermocline_depth: Simulated thermocline depth
- temperature_epi: Simulated epilimnion water temperature
- temperature_hypo: Simulated hypolimnion water temperature
- volume_epi: Simulated epilimnion volume
- volume_hypo: Simulated hypolimnion volume
- wind: Derived wind speed
- airtemp: Derived air temperature
- fnep: Simulated net ecosystem production flux
- fmineral: Simulated mineralisation flux
- fsed: Simulated net sedimentation flux
- fatm: Simulated atmospheric exchange flux
- fdiff: Simulated diffusion flux
- fentr_epi: Simulated entrainment flux (epilimnion)
- fentr_hyp: Simulated entrainment flux (hypolimnion)
- eutro: Derived classification for eutrophic state
- oligo: Derived classification for oligotrophic state
- dys: Derived classification for dystrophic state
- water: Derived classification proportion for water landuse
- developed: Derived classification proportion for developed landuse
- barren: Derived classification proportion for barren landuse
- forest: Derived classification proportion for forest landuse
- shrubland: Derived classification proportion for shrubland landuse
- herbaceous: Derived classification proportion for herbaceous landuse
- cultivated: Derived classification proportion for cultivated landuse
- wetlands: Derived classification proportion for wetlands landuse
- depth: Derived maximum lake depth
- area: Derived maximum lake surface area
- elev: Derived lake elevation
- Shore_len: Shore length
- Vol_total: Lake volume
- Vol_res: Lake volume residual
- Vol_src: Lake volume supplement
- Depth_avg: Average lake depth
- Dis_avg: Average inflow discharge
- Res_time: Lake residence time
- Elevation: Alternative lake elevation
- Slope_100: Lake slope information
- Wshd_area: Watershed area
- ShortWave: Daily shortwave radiation
- LongWave: Daily longwave radiation
- RelHum: Daily relative humidity
- Rain: Daily precipitation (rain)
- Snow: Daily precipitation (snow)
- ice: Ice cover indicator
- sim_epi: Simulated epilimnion DO concentration
- sim_hyp: Simulated hypolimnion DO concentration
- obs_epi: Observed epilimnion DO concentration
- obs_hyp: Observed hypolimnion DO concentration
- Encoder (BiLSTM): Uses all 47 features including sim_epi and sim_hyp
- Yearly model (LSTM): Uses 45 features (excludes sim_epi and sim_hyp)
- Monthly decoder: Uses 26 features (subset of the core features for monthly predictions)
For questions, issues, or collaborations related to this project, please contact:
- Shiyuan Luo - shl298@pitt.edu
- Runlong Yu - ryu5@ua.edu
