Fluxnet model training¶

To create fluxnet models efficiently, we created a workflow runner.

After specifying some directories, which datasets you want to include, and which labels from these datasets you'd want to use, you can run the workflow.

We'll start by importing the necessary modules and setting up Dask:

In [1]:

Copied!

from pathlib import Path

from dask.distributed import Client

from excited_workflow.train_fluxnet_models import FluxnetExperiment
from excited_workflow.train_fluxnet_models import calculate_era5_derived_vars
from excited_workflow.train_fluxnet_models import collect_training_data
from excited_workflow.train_fluxnet_models import run_workflow

client = Client(n_workers=2, threads_per_worker=2)
from pathlib import Path

from dask.distributed import Client

from excited_workflow.train_fluxnet_models import FluxnetExperiment
from excited_workflow.train_fluxnet_models import calculate_era5_derived_vars
from excited_workflow.train_fluxnet_models import collect_training_data
from excited_workflow.train_fluxnet_models import run_workflow

client = Client(n_workers=2, threads_per_worker=2)

Next we have to define some directories:

where is the pre-processed fluxnet data stored?
where should the pre-processed ERA5 data be stored?
where do you want the trained models to be written to?

Additionally, you have to define which additional (monthly) datasets are required:

In [2]:

Copied!





ameriflux_file = Path("/data/volume_2/NEE_ameriflux_transcom2.nc")
preprocessed_dir = Path("/data/volume_2/preprocessed_site_data")
output_directory = Path("/data/volume_2/trained_models")

additional_datasets = [
    "biomass",
    "spei",
    "modis",
]
ameriflux_file = Path("/data/volume_2/NEE_ameriflux_transcom2.nc")
preprocessed_dir = Path("/data/volume_2/preprocessed_site_data")
output_directory = Path("/data/volume_2/trained_models")

additional_datasets = [
    "biomass",
    "spei",
    "modis",
]

If you want to know which variables will be available when you run this workflow, you can load the dataset that the workflow uses. Note that loading in all this data takes some time, especially if the ERA5 data has not been pre-processed yet.

The collect_training_data can be provided with a function that derives variables from the collected xarray Dataset. Here we use the calculate_era5_derived_vars function from excited_workflow.train_fluxnet_models.

In [3]:

Copied!





ds = collect_training_data(
    ameriflux_file, preprocessed_dir, additional_datasets,
    variable_derivation=calculate_era5_derived_vars,
)
ds
ds = collect_training_data(
    ameriflux_file, preprocessed_dir, additional_datasets,
    variable_derivation=calculate_era5_derived_vars,
)
ds

Valid file fluxnet-sites_era5_10m_v_component_of_wind_2004.nc already exists, skipping.

Out[3]:

<xarray.Dataset>
Dimensions:                         (time: 271755, site: 61)
Coordinates:
  * time                            (time) datetime64[ns] 1991-01-01T06:00:00...
  * site                            (site) object 'US-Rws' 'US-ARM' ... 'US-KLS'
Data variables: (12/29)
    GPP_NT_VUT_REF                  (site, time) float64 nan nan nan ... nan nan
    GPP_DT_VUT_REF                  (site, time) float64 nan nan nan ... nan nan
    NEE_VUT_REF                     (site, time) float64 nan nan nan ... nan nan
    latitude                        (site) float64 43.17 36.61 ... 45.56 38.77
    longitude                       (site) float64 -116.7 -97.49 ... -97.57
    resp                            (site, time) float64 nan nan nan ... nan nan
    ...                              ...
    day_of_year                     (time) int64 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
    hour                            (time) int64 6 7 8 9 10 11 ... 3 4 5 6 7 8
    t2m_1w_rolling                  (site, time) float32 dask.array<chunksize=(61, 43902), meta=np.ndarray>
    mean_air_temperature            (site) float32 dask.array<chunksize=(61,), meta=np.ndarray>
    mean_dewpoint_depression        (site) float32 dask.array<chunksize=(61,), meta=np.ndarray>
    dewpoint_depression_1w_rolling  (site, time) float32 dask.array<chunksize=(61, 43902), meta=np.ndarray>

Additionally, if you want to get the pandas DataFrame that is used for training the model, run the following command; ds.to_dataframe().dropna().reset_index()

Defining the experiments¶

Now you can define your model training parameters. You will need to define:

The group_key: this is the name of the variable used for splitting up data in cross-validation (i.e. the site names).
The predictor variables (X_keys).
The target variable (y_key)
the cross validation method (see the scikit-learn documentation)
The name of the ML model you want to use. For the available ones, see the documentation on pycaret

In [4]:

Copied!





# Define keys for models
group_key = "site"  # for fold groups
X_keys_resp = [
    "d2m", "t2m", "ssr", # era5
    "biomass", "spei", "NDVI", "NIRv", # other datasets
    "day_of_year", "t2m_1w_rolling", "mean_air_temperature",
    "mean_dewpoint_depression", "dewpoint_depression_1w_rolling"
]
y_key_resp = "resp"

X_keys_gpp = [
    "d2m", "mslhf", "msshf", "ssr", "ssr_6hr", "str", "t2m", # era5
    "biomass", "spei", "NDVI", "NIRv", # other datasets
]
y_key_gpp = "GPP_NT_VUT_REF"

from sklearn.model_selection import GroupShuffleSplit
cv_method = GroupShuffleSplit(n_splits=10, test_size=0.4)
# Define keys for models
group_key = "site"  # for fold groups
X_keys_resp = [
    "d2m", "t2m", "ssr", # era5
    "biomass", "spei", "NDVI", "NIRv", # other datasets
    "day_of_year", "t2m_1w_rolling", "mean_air_temperature",
    "mean_dewpoint_depression", "dewpoint_depression_1w_rolling"
]
y_key_resp = "resp"

X_keys_gpp = [
    "d2m", "mslhf", "msshf", "ssr", "ssr_6hr", "str", "t2m", # era5
    "biomass", "spei", "NDVI", "NIRv", # other datasets
]
y_key_gpp = "GPP_NT_VUT_REF"

from sklearn.model_selection import GroupShuffleSplit
cv_method = GroupShuffleSplit(n_splits=10, test_size=0.4)

All this information has to be provided to the FluxnetExperiment 'dataclass'.

We will train three models here, two for the respiration (using different ML models) and one for GPP.

The name of the experiment is used to create the model's output directory, you're free to define this.

In [5]:

Copied!





models = [
    FluxnetExperiment(
        name="respiration",
        X_keys=X_keys_resp,
        y_key=y_key_resp,
        ml_model_name="ridge",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
    FluxnetExperiment(
        name="respiration",
        X_keys=X_keys_resp,
        y_key=y_key_resp,
        ml_model_name="lightgbm",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
    FluxnetExperiment(
        name="gpp",
        X_keys=X_keys_gpp,
        y_key=y_key_gpp,
        ml_model_name="lightgbm",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
]
models = [
    FluxnetExperiment(
        name="respiration",
        X_keys=X_keys_resp,
        y_key=y_key_resp,
        ml_model_name="ridge",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
    FluxnetExperiment(
        name="respiration",
        X_keys=X_keys_resp,
        y_key=y_key_resp,
        ml_model_name="lightgbm",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
    FluxnetExperiment(
        name="gpp",
        X_keys=X_keys_gpp,
        y_key=y_key_gpp,
        ml_model_name="lightgbm",
        cv_method=cv_method,
        cv_group_key=group_key,
        output_dir=output_directory
    ),
]

Executing the workflow¶

Now you can run the workflow. For a deeper look into the specific steps of the workflow, see the file src/excited_workflow/train_fluxnet_models.py.

The workflow will create new folders inside the specified output directory (output_directory defined in cell 2).

Each folder contains:

a model description file (in markdown)
validation plots
a JSON file with the used variables (and attributes)
the model ONNX file

This should make it easier to asses the model training and use the model to produce a dataset.

In [6]:

Copied!





run_workflow(
    fluxnet_file=ameriflux_file,
    preprocessing_dir=preprocessed_dir,
    additional_datasets=additional_datasets,
    models=models,
    variable_derivation=calculate_era5_derived_vars,
)
run_workflow(
    fluxnet_file=ameriflux_file,
    preprocessing_dir=preprocessed_dir,
    additional_datasets=additional_datasets,
    models=models,
    variable_derivation=calculate_era5_derived_vars,
)

Valid file fluxnet-sites_era5_10m_v_component_of_wind_2004.nc already exists, skipping.

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
ridge	Ridge Regression	1.53713	6.32171	2.47449	0.36426	0.50615	19573.32939	0.73600

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
lightgbm	Light Gradient Boosting Machine	1.39383	5.39083	2.28724	0.34206	0.46973	16018.56091	3.79500

The maximum opset needed by this model is only 8.

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
lightgbm	Light Gradient Boosting Machine	2.40524	23.23889	4.81172	0.56179	0.62698	7.12475	3.14500

The maximum opset needed by this model is only 8.

In [ ]: