Documentation
This documentation page should give you an overview of how to get started with CODES.
The technical API documentation can be found here.
Setup
First, clone the GitHub Repository with
git clone ssh://git@github.com/robin-janssen/CODES-Benchmark
.
Optionally, you can set up a virtual environment (recommended).
Then, install the required packages with pip install -r requirements.txt
.
The installation is now complete. To be able to run and evaluate the benchmark, you need to first set up a configuration YAML file. There is one provided, but it should be configured. For more information, check the configuration page. There, we also offer an interactive Config-Generator tool with some explanations to help you set up your benchmark.
You can also add your own datasets and models to the benchmark to evaluate them against each other or some of our baseline models. For more information on how to do this, please refer to the documentation.
Run the benchmark
The first step in running the benchmark is to train all the different models specified in the configuration. As this step usually takes a lot longer than the actual benchmarking, it is executed as a separate step.
To start the training, run the run_training.py
file. to pass in a config file that
has a
filename different from the default config.yaml
, use the --config
argument when executing from the command line like this:
/path/to/python3 run_training.py --config MyConfig.yaml
.
After the training is complete, the benchmark can be run. To start the benchmark, run the
run_benchmark.py
file. Remember to pass in the same config file as you used for the
training.
Configuring the benchmark
The training and evaluation of different models is mainly configured from a YAML config file in the base directory of the repository. In this file, all of the tweakble run parameters can be set. This includes
- A Name for a benchmark run (also used to create a path to store the results)
- the surrogates to train and compare
- the dataset to train and evaluate on
- training parameters like number of epochs, batch sizes, GPUs
- what evaluations to perform and their required parameters
If you don't feel like manually creating or editing your config, check our online config generator. You can configure everything and simply download the YAML config file.
The config file has the following structure (the order of parameters is not important as long as the nesting is correct):
training_id: str
The name of the benchmark runsurrogates: list[str]
The list of surrogates to evaluate. See our surrogates for available options and how to add your own model. The name corresponds to the name of the surrogate's class.batch_size: int | list[int]
Specifies the batch size for the surrogates during training. Can either be a single integer if all surrogates share a batch size, or a list of batch sizes as long as the list of surrogates.epochs: int | list[int]
The number of epochs to train the surrogates for. Can be a single integer if all surrogates share the same number of epochs, or a list of epochs as long as the list of surrogates.dataset:
name: str
The dataset to train and evaluate on. See our datasets for available options and how to add your own dataset.log10_transform: bool
Whether to take the logarithm of the dataset. This is recommended, unless the raw data in the data.hdf5 file is already on a log scale.normalise: str
How to normalise the data. Options are"minmax"
- applies min-max normalization to the data to rescale it to [-1, 1]."standardise"
- applies standardization to the data to have a mean of 0 and a standard deviation of 1."disable"
- no normalization is applied.-
use_optimal_params: str
Whether to use previously determined optimal hyperparamters for this dataset. This only works if these hyperparameters are stored in a correspondingsurrogates_config.py
(refer to this section for more details). seed: int
The random seed used to initialize the random seeds for Python, PyTorch and NumPy. In some benchmarks, the seed is altered deterministically to train for example an Ensemble (where each model requires a different seed).losses: bool
Whether to record the losses in the output filesverbose: bool
Whether to output additional information about the current processing step to the CLIaccuracy: bool
dynamic_accuracy: bool
timing: bool
compute: bool
Overall training parameters
If you want to apply another form of normalization to the data, you may have to add your own dataset and normalise the data beforehand.
Benchmark parameters
Once the configuration is complete, the configuration YAML file needs to be placed into the root
directory of the CODES-Benchmark repository. The default filename the training and benchmark
look for is config.yaml
, however, you can specify any filename with the
--config
argument of the run_training.py
and
run_benchmark.py
files.
Add your own dataset
Adding your own data to the CODES repository is fairly straight forward using the
create_dataset
function.
You can simply pass your raw (numpy) data to the function along with some additional
optional data and it will create the appropriate file(s) in the datasets
directory of
the
repository. After this, you will not need to interact with the data again, as the
benchmark handles the data automatically based on the dataset name provided in the
configuration.
A note on dataset availability: The benchmark can be run on local data as soon as you created the
dataset with create_dataset
(i.e., the data can be completely offline/local).
The actual data.hdf5 file in your new dataset directory is ignored by git and
should not be added to the repository. If you want to make your dataset available to others
(which we highly encourage),
you can upload it to Zenodo and provide the download link in
datasets/data_sources.yaml
.
If you choose to do this (which we highly encourage), you can push the created dataset directory
to the
repository, as it will later be used to store visualisations of the data or a
surrogate_config.py
that contains the hyperparameters for the surrogate models.
You can import the create_dataset
function from the codes
package. It has the following signature:
create_dataset
name: str
The name of the dataset and also the directory in which it will be stored, e.g. a dataset called "MyDataset" will be stored in thedatasets/mydataset
directory.train_data: np.ndarray
The array of training data. It should be of the shape (n_trajectories, n_timesteps, n_species).test_data: np.ndarray | None
The array of test data, optional. Should follow the same shape convention as the training data.val_data: np.ndarray | None
The array of validation data, optional. Should follow the same shape convention as the training data.split: tuple[float, float, float] | None
If test and validation data are not provided, the training data array can be split into train, test and validation based on the split tuple provided. For example, a value ofsplit=(0.8, 0.15, 0.05)
will split the data into 80% training, 15% test and 5% validation data.timesteps: np.ndarray | None
The timesteps array for the data, optional. Can be used if required in the surrogates or to set the time axis in the plots. If not provided, a [0, 1] array will be inferred from the shape of the data.labels: list[str] | None
The species labels for the evaluation plots.
create_dataset
function like this:
import numpy as np
from data.data_utils import create_dataset
# load your data
train_data = np.load("path/to/train_data.npy")
test_data = np.load("path/to/test_data.npy")
val_data = np.load("path/to/val_data.npy")
timesteps = np.load("path/to/timesteps.npy")
labels = ["species1", "species2", "species3"]
# create the dataset
create_dataset(
name="MyDataset",
train_data=train_data,
test_data=test_data,
val_data=val_data,
timesteps=timesteps,
labels=labels
)
Alternatively, if you only have a single dataset array and want to split it into train, test and
validation data, you can do this:
import numpy as np
from data.data_utils import create_dataset
# load your data
data = np.load("path/to/data.npy")
timesteps = np.load("path/to/timesteps.npy")
# create the dataset
create_dataset(
name="MyDataset",
train_data=data,
split=(0.8, 0.15, 0.05),
timesteps=timesteps,
labels=["species1", "species2", "species3"]
)
After calling the create_dataset
function, the dataset will be stored in the
datasets/mydataset
directory of the repository (the dataset name is not case sensitive, it
will always be stored in lowercase). The benchmark will automatically load the
data from there based on the dataset name provided in the configuration.
Add your own model
To be able to compare your own models to each other or to some of the baseline models provided by us, you need to add your own surrogate implementation to the repository. The AbstractSurrogateModel class offers a blueprint as well as some basic functionality like saving and loading models. Your own model needs to be implemented or wrapped in a class that inherits from the AbstractSurrogateModel class.
We recommend you structure your model in such a way that hyperparameters you might want to change and tune in the future be stored in a separate dataclass. This keeps the hyperparameters and the actual code logic separate and easily acessible and allows you to tune your surrogate without modigying the actual code. Check the Surrogate Configuration section and tutorial with code examples below on how to do this.
For the integration into the benchmark, you need to implement four methods for your own model class:
-
__init__
The initialization method. In this method you can instantiate any objects you need during training and set attributes required later. The method should also call the super classes constructor and set the model configuration.
Arguments:
self
The required self argument for instance methodsdevice: str
The device the model will train/evaluate onn_chemicals: int
The dimensionality (i.e. number of chemicals) in the datasetn_timesteps: int
The number of timesteps in the datasetmodel_config: dict
The configuration dictionary that is passed to the model upon initialization. This dictionary contains all the parameters from the configuration file that are relevant for the model.
-
prepare_data
This method serves as a helper function which creates and returns the torch dataloaders that provide the training data in a suitable format for your model.
Arguments:
self
The required self argument for instance methodsdataset_train: np.ndarray
The raw training data as a numpy array. COMMENT ON DATA FORMAT + LINKdataset_test: np.ndarray | None
The raw test data as a numpy array (Optional)dataset_val: np.ndarray | None
The raw validation data as a numpy array (Optional)timesteps: np.ndarray
The array of timesteps in the training data. If your model does not explicitly use these, you can just ignore this argument.batch_size: int
The batch size your dataloader should have. This value is read form the configuration and shoul be directly passed to the Dataloader constructor (see example below).shuffle: bool
The shuffle argument is set by the benchmark and should be directly passed to the constructor of the Dataloader.
Return:
The method should return a tuple of three dataloaders in the order train, test, val. If the dataset_test or dataset_val arguments are None, the respective dataloader should also be None instead.
-
fit
This method's purpose is to execute the training loop and train the
self.model
instantiated in the__init__
method. Optionally, a test prediction can be made on the test dataset to evaluate training progess.Important: This method should save the training loss (and optionally test loss and the mean absolute error on the test set) as tensors in the in the
self.train_loss
,self.test_loss
andself.MAE
attributes. See example below on how to do that.Arguments:
self
The required self argument for instance methodstrain_loader: torch.utils.data.DataLoader
The training dataloadertest_loader: torch.utils.data.DataLoader | None
The test dataloader (Optional)epochs: int
The number of epochs to train the model for. This value is read from the configuration and should be used to determine the number of iterations in the training loop.position: int
Position argument used for the progress bar. See example below on how to use.description: str
Label argument used for the progress bar. See example below on how to use.
-
forward
This method should simply call the forward method of the model and return the output together with the targets.
Arguments:
self
The required self argument for instance methodsinputs: Any
Whatever the dataloader outputs
Return:
Returns a tuple of predictions and targets
Surrogate Configuration
To keep hyperparameters (such as model dimensions, activation functions, learning rates, latent space dimensions etc.) of surrogates separate from the code of the actual surrogate model and to subsequently make the modification of those hyperparameters at a later point easy, we employ dataclasses as configurators for a surrogate model. Since the optimal parameters for a given surrogate will likely vary between datasets, our arcitecture enables you to define a configuration per dataset.
Each model comes with a default (or fallback) configuration which will be loaded by default.
Example Implementation
This short tutorial will go over all the requred steps to add your own Surrogate class to the benchmark and will provide some sample code. The Surrogate we will add is just a variant of a fully connected neural network and serves only to demonstrate the process of adding your own implementation.
To get started, add a folder in the surrogates/
directory of the repository,
named after your model. For this example, the model we will add is called
MySurrogate, so we create the directory
surrogates/MySurrogate/
.
In this directory, we create the python file which will include the code for our surrogate
called my_surrogate.py
. We will also create a second file
my_surrogate_config.py
, where we can define the hyperparameters of our surrogate.
If you plan to use several datasets with your surrogate, you can also define a set of
hyperparameters per dataset, as the optimal parmeters might vary between datasets. Check the dataset section on how to do this.
For this demonstration, we will use the OSU2008 dataset. Our demonstration surrogate will simply take the initial abundances and make a prediction based on those.
Before implementing the surrogate itself, we will define its configuration dataclass. For this,
open the my_surrogate_config.py
file you created and add the hyperparameters you
might want to change in the future. For this example, we will add the width, depth,
activation function and learning rate of our neural network.
from dataclasses import dataclass
from torch.nn import ReLU, Module
@dataclass
class MySurrogateConfig:
"""Model config for MySurrogate for the osu2008 dataset"""
network_hidden_layers: int = 2
network_layer_width: int = 128
network_activation: Module = ReLU()
learning_rate: float = 1e-3
Next, we will implement a dataset class for our surrogate. You can put this class into the
my_surrogate.py
file we just created, or alternatively put it in a separate file
and import it to my_surrogate.py
.
import torch
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self, abundances, device):
# abundances with shape (n_samples, n_timesteps, n_species)
self.abundances = torch.tensor(abundances).to(device)
self.length = self.abundances.shape[0]
def __getitem__(self, index):
return self.abundances[index, :, :]
def __len__(self):
return self.length
Now we implement the surrogate itself. It is important that the custom surrogate
class is derived from the AbstractSurrogateModel
class and adheres to its method signatures in order to be compatible with the benchmark.
Let's begin by implementing the __init__
method. All we need to do here is
initialize our neural network and call the super classes constructor, as well as initializing
our model config so its parameters are available inside our surrogate class.
from surrogates.surrogates import AbstractSurrogateModel
from torch import nn
from surrogates.MySurrogate.my_surrogate_config import MySurrogateConfig
class MySurrogate(AbstractSurrogateModel):
def __init__(
self,
device: str | None,
n_chemicals: int,
n_timesteps: int,
model_config: dict | None,
):
super().__init__(device, n_chemicals, n_timesteps, model_config)
model_config = model_config if model_config is not None else {}
self.config = MySurrogateConfig(**model_config)
# construct the model according to the parameters in the config
modules = []
modules.append(nn.Linear(n_chemicals, self.config.layer_width))
modules.append(self.config.activation)
for _ in range(self.config.hidden_layers):
modules.append(nn.Linear(self.config.layer_width, self.config.layer_width))
modules.append(self.config.activation)
modules.append(nn.Linear(self.config.layer_width, n_chemicals*n_timesteps))
self.model = nn.Sequential(*modules).to(device)
The next step is to implement the prepare_data
method. There, we instantiate and
return the dataloaders for our model using our custom defined dataset.
from torch.utils.data import DataLoader
import numpy as np
class MySurrogate(AbstractSurrogateModel):
...
def prepare_data(
self,
dataset_train: np.ndarray,
dataset_test: np.ndarray | None,
dataset_val: np.ndarray | None,
timesteps: np.ndarray,
batch_size: int,
shuffle: bool,
) -> tuple[DataLoader, DataLoader | None, DataLoader | None]:
train = MyDataset(dataset_train, self.device)
train_loader = DataLoader(
train, batch_size=batch_size, shuffle=shuffle
)
if dataset_test is not None:
test = MyDataset(dataset_test, self.device)
test_loader = DataLoader(
test, batch_size=batch_size, shuffle=shuffle
)
else:
test_loader = None
if dataset_val is not None:
val = MyDataset(dataset_val, self.device)
val_loader = DataLoader(
val, batch_size=batch_size, shuffle=shuffle
)
else:
val_loader = None
return train_loader, test_loader, val_loader
Finally, we implement the training loop inside the fit
function and define the
forward
function. Note that the fit
function should set the
train_loss
, test_loss
and MAE
(mean absolute error)
attributes of the surrogate to ensure their availability for plotting later. To have access to
training durations later on, we wrap the fit
function with the
time_execution
function for the utils module.
from torch.optim import Adam
from utils import time_execution
class MySurrogate(AbstractSurrogateModel):
...
def forward(self, inputs):
targets = inputs
initial_cond = inputs[..., 0, :]
outputs = self.model(initial_cond)
return outputs, targets
@time_execution
def fit(
self,
train_loader: DataLoader,
test_loader: DataLoader,
epochs: int,
position: int,
description: str,
):
criterion = nn.MSELoss()
optimizer = Adam(self.model.parameters(), lr=self.config.learning_rate)
# initialize the loss tensors
losses = torch.empty((epochs, len(train_loader)))
test_losses = torch.empty((epochs))
MAEs = torch.empty((epochs))
# setup the progress bar
progress_bar = self.setup_progress_bar(epochs, position, description)
# training loop as usual
for epoch in progress_bar:
for i, x_true in enumerate(train_loader):
optimizer.zero_grad()
x_pred, _ = self.forward(x_true)
loss = criterion(x_true, x_pred)
loss.backward()
optimizer.step()
losses[epoch, i] = loss.item()
# set the progress bar output
clr = optimizer.param_groups[0]["lr"]
print_loss = f"{losses[epoch, -1].item():.2e}"
progress_bar.set_postfix({"loss": print_loss, "lr": f"{clr:.1e}"})
# evaluate the model on the test set
with torch.inference_mode():
self.model.eval()
preds, targets = self.predict(test_loader)
self.model.train()
loss = criterion(preds, targets)
test_losses[epoch] = loss
MAEs[epoch] = self.L1(preds, targets).item()
progress_bar.close()
self.train_loss = torch.mean(losses, dim=1)
self.test_loss = test_losses
self.MAE = MAEs
Now that your surrogate class is completely implemented, the last thing left to do is to add it
to the surrogate_classes.py
file in the surrogates
directory of the
repository to make it available for the benchmark. In our case this looks like this (other,
already existing surrogates are omitted in the code example)
...
from surrogates.MySurrogate.my_surrogate import MySurrogate
surrogate_classes = [
...
# Add any additional surrogate classes here
MySurrogate,
]
Now you're all set! You can now use you own surrogate model in the benchmark and compare it with any of the other surrogates present.