Fluxnet model training¶
To create fluxnet models efficiently, we created a workflow runner.
After specifying some directories, which datasets you want to include, and which labels from these datasets you'd want to use, you can run the workflow.
We'll start by importing the necessary modules and setting up Dask:
from pathlib import Path
from dask.distributed import Client
from excited_workflow.train_fluxnet_models import FluxnetExperiment
from excited_workflow.train_fluxnet_models import calculate_era5_derived_vars
from excited_workflow.train_fluxnet_models import collect_training_data
from excited_workflow.train_fluxnet_models import run_workflow
client = Client(n_workers=2, threads_per_worker=2)
Next we have to define some directories:
- where is the pre-processed fluxnet data stored?
- where should the pre-processed ERA5 data be stored?
- where do you want the trained models to be written to?
Additionally, you have to define which additional (monthly) datasets are required:
ameriflux_file = Path("/data/volume_2/NEE_ameriflux_transcom2.nc")
preprocessed_dir = Path("/data/volume_2/preprocessed_site_data")
output_directory = Path("/data/volume_2/trained_models")
additional_datasets = [
"biomass",
"spei",
"modis",
]
If you want to know which variables will be available when you run this workflow, you can load the dataset that the workflow uses. Note that loading in all this data takes some time, especially if the ERA5 data has not been pre-processed yet.
The collect_training_data can be provided with a function that derives variables from the collected xarray Dataset. Here we use the calculate_era5_derived_vars function from excited_workflow.train_fluxnet_models.
ds = collect_training_data(
ameriflux_file, preprocessed_dir, additional_datasets,
variable_derivation=calculate_era5_derived_vars,
)
ds
Valid file fluxnet-sites_era5_10m_v_component_of_wind_2004.nc already exists, skipping.
<xarray.Dataset>
Dimensions: (time: 271755, site: 61)
Coordinates:
* time (time) datetime64[ns] 1991-01-01T06:00:00...
* site (site) object 'US-Rws' 'US-ARM' ... 'US-KLS'
Data variables: (12/29)
GPP_NT_VUT_REF (site, time) float64 nan nan nan ... nan nan
GPP_DT_VUT_REF (site, time) float64 nan nan nan ... nan nan
NEE_VUT_REF (site, time) float64 nan nan nan ... nan nan
latitude (site) float64 43.17 36.61 ... 45.56 38.77
longitude (site) float64 -116.7 -97.49 ... -97.57
resp (site, time) float64 nan nan nan ... nan nan
... ...
day_of_year (time) int64 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
hour (time) int64 6 7 8 9 10 11 ... 3 4 5 6 7 8
t2m_1w_rolling (site, time) float32 dask.array<chunksize=(61, 43902), meta=np.ndarray>
mean_air_temperature (site) float32 dask.array<chunksize=(61,), meta=np.ndarray>
mean_dewpoint_depression (site) float32 dask.array<chunksize=(61,), meta=np.ndarray>
dewpoint_depression_1w_rolling (site, time) float32 dask.array<chunksize=(61, 43902), meta=np.ndarray>Additionally, if you want to get the pandas DataFrame that is used for training the model, run the following command; ds.to_dataframe().dropna().reset_index()
Defining the experiments¶
Now you can define your model training parameters. You will need to define:
- The
group_key: this is the name of the variable used for splitting up data in cross-validation (i.e. the site names). - The predictor variables (
X_keys). - The target variable (
y_key) - the cross validation method (see the scikit-learn documentation)
- The name of the ML model you want to use. For the available ones, see the documentation on pycaret
# Define keys for models
group_key = "site" # for fold groups
X_keys_resp = [
"d2m", "t2m", "ssr", # era5
"biomass", "spei", "NDVI", "NIRv", # other datasets
"day_of_year", "t2m_1w_rolling", "mean_air_temperature",
"mean_dewpoint_depression", "dewpoint_depression_1w_rolling"
]
y_key_resp = "resp"
X_keys_gpp = [
"d2m", "mslhf", "msshf", "ssr", "ssr_6hr", "str", "t2m", # era5
"biomass", "spei", "NDVI", "NIRv", # other datasets
]
y_key_gpp = "GPP_NT_VUT_REF"
from sklearn.model_selection import GroupShuffleSplit
cv_method = GroupShuffleSplit(n_splits=10, test_size=0.4)
All this information has to be provided to the FluxnetExperiment 'dataclass'.
We will train three models here, two for the respiration (using different ML models) and one for GPP.
The name of the experiment is used to create the model's output directory, you're free to define this.
models = [
FluxnetExperiment(
name="respiration",
X_keys=X_keys_resp,
y_key=y_key_resp,
ml_model_name="ridge",
cv_method=cv_method,
cv_group_key=group_key,
output_dir=output_directory
),
FluxnetExperiment(
name="respiration",
X_keys=X_keys_resp,
y_key=y_key_resp,
ml_model_name="lightgbm",
cv_method=cv_method,
cv_group_key=group_key,
output_dir=output_directory
),
FluxnetExperiment(
name="gpp",
X_keys=X_keys_gpp,
y_key=y_key_gpp,
ml_model_name="lightgbm",
cv_method=cv_method,
cv_group_key=group_key,
output_dir=output_directory
),
]
Executing the workflow¶
Now you can run the workflow. For a deeper look into the specific steps of the workflow, see the file src/excited_workflow/train_fluxnet_models.py.
The workflow will create new folders inside the specified output directory (output_directory defined in cell 2).
Each folder contains:
- a model description file (in markdown)
- validation plots
- a JSON file with the used variables (and attributes)
- the model ONNX file
This should make it easier to asses the model training and use the model to produce a dataset.
run_workflow(
fluxnet_file=ameriflux_file,
preprocessing_dir=preprocessed_dir,
additional_datasets=additional_datasets,
models=models,
variable_derivation=calculate_era5_derived_vars,
)
Valid file fluxnet-sites_era5_10m_v_component_of_wind_2004.nc already exists, skipping.
| Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
|---|---|---|---|---|---|---|---|---|
| ridge | Ridge Regression | 1.53713 | 6.32171 | 2.47449 | 0.36426 | 0.50615 | 19573.32939 | 0.73600 |
| Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
|---|---|---|---|---|---|---|---|---|
| lightgbm | Light Gradient Boosting Machine | 1.39383 | 5.39083 | 2.28724 | 0.34206 | 0.46973 | 16018.56091 | 3.79500 |
The maximum opset needed by this model is only 8.
| Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
|---|---|---|---|---|---|---|---|---|
| lightgbm | Light Gradient Boosting Machine | 2.40524 | 23.23889 | 4.81172 | 0.56179 | 0.62698 | 7.12475 | 3.14500 |
The maximum opset needed by this model is only 8.