LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data
Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce LakeFM, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that LakeFM learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.
project_root/
├── src/
│ ├── cli/
│ ├── conf/ # model and data config yamls
│ ├── main.py # driver script
│ ├── data/
│ ├── builder/ # dataset-specific builder classes
│ ├── dataset.py # dataset class for training
│ ├── eval_dataset.py # dataset class for evaluation
│ ├── loader.py # loader
│ ├── lakefm/
│ ├── model.py
│ ├── trainer.py
│ ├── evaluator.py
│ └── utils/
| └── scripts/
│
└── resources/
├── data/ # datasets
├── dev/
├── norm_stats # normalization stats for lakefm
├── pretain_ckpts
conda create -n lakefm python=3.11conda activate lakefmMake sure you have the requirements.txt file available in the project directory.
Then install all required packages using pip:
pip install -r requirements.txtYou can check that all necessary packages are installed:
pip listYou must install PyTorch separately according to your CUDA version. Refer to the official PyTorch guide:
👉 https://pytorch.org/get-started/previous-versions/
pip install torch==2.6.0+cu118 torchvision==0.21.0+cu118 torchaudio==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu118For using any of the data and/or ckpt, add them to the corresponding dir (mentioned for each of them below) under resources.
-
FCR Simulation dataset Download (
/resources/lakefm/data/FCR_data) -
WQHanson Simulation dataset Download (
/resources/lakefm/data/WQHanson_Simulation) -
LakeBeD dataset Download (
/resources/lakefm/data/LakeBeD-US)
LakeFM 5M Checkpoint Download (resources/lakefm/dev/pretrain_ckpts)
Navigate to the src/ directory:
cd srcRun eval for a lake
bash scripts/driver.sh <run_name> <lake_name>
<run_name> is where the output of the evaluation gets stored (it is the name of the output folder)
<lake_name> lake to be evaluated (e.g. AL, BARC, etc)
Note: update the train-val-test split in the LakeBeD.yaml (src/cli/conf/pretrain/data/LakeBeD.yaml) based on ID or OOD lake evaluation
Example:
bash scripts/driver.sh eval_BARC BARC
-
To run denormalized evaluation,
bash scripts/driver.sh <run_name> <lake_name> --denorm -
To generate plots,
bash scripts/driver.sh <run_name> <lake_name> <depth_m> --plotor
bash scripts/driver.sh <run_name> <lake_name> --plot --depth <depth_m>where
depth_mis the depth at which to plot -
To plot for a subset of variables
bash scripts/driver.sh <run_name> <lake_name> <depth_m> --plot --vars '["WaterTemp_C","Water_DO_mg_per_L"]'plots for Water Temp and Water DO
-
To perform variable masking
bash scripts/driver.sh <run_name> <lake_name> --mask-vars '["WaterTemp_C","Chla_ugL"]'masks Water Temp and Chla. To generate plots, pass
--plotand--varswith the list of variables to plot -
To perform depth masking
bash scripts/driver.sh <run_name> <lake_name> --mask-depths '[1.0,2.0,5.0]'
