CODES Benchmark

We introduce CODES, a benchmark for comprehensive evaluation of surrogate architectures for coupled ODE systems. Besides standard metrics like mean squared error (MSE) and inference time, CODES provides insights into surrogate behaviour across multiple dimensions like interpolation, extrapolation, sparse data, uncertainty quantification and gradient correlation. The benchmark emphasizes usability through features such as integrated parallel training, a web-based configuration generator, and pre-implemented baseline models and datasets. Extensive documentation ensures sustainability and provides the foundation for collaborative improvement. By offering a fair and multi-faceted comparison, CODES helps researchers select the most suitable surrogate for their specific dataset and application while deepening our understanding of surrogate learning behaviour.

Motivation

There are many efforts to use machine learning models ("surrogates") to replace the costly numerics required involved in solving coupled ODEs. But for the end user, it is not obvious how to choose the right surrogate for a given task. Usually, the best choice depends on both the dataset and the target application.

Dataset specifics - how "complex" is the dataset?

How many samples are there?
Are the trajectories very dynamic or are the developments rather slow?
How dense is the distribution of initial conditions?
Is the data domain of interest well-covered by the domain of the training set?

Task requirements:

What is the required accuracy?
How important is inference time? Is the training time limited?
Are there computational constraints (memory or processing power)?
Is uncertainty estimation required (e.g. to replace uncertain predictions by numerics)?
How much predictive flexibility is required? Do we need to interpolate or extrapolate across time?

Besides these practical considerations, one overarching question is always: Does the model only learn the data, or does it "understand" something about the underlying dynamics?

Goals

This benchmark aims to aid in choosing the best surrogate model for the task at hand and additionally to shed some light on the above questions.

To achieve this, a selection of surrogate models are implemented in this repository. They can be trained on one of the included datasets or a custom dataset and then benchmarked on the corresponding test dataset.

Some metrics included in the benchmark (but there is much more!):

Absolute and relative error of the models.
Inference time.
Number of trainable parameters.
Memory requirements (WIP).
Predictive uncertainty.
Pearson correlation coefficients.

Besides this, there are plenty of plots and visualisations providing insights into the models behaviour:

Error distributions - per model, across time or per quantity.
Insights into interpolation and extrapolation across time.
Behaviour when training with sparse data or varying batch size.
Predictions with uncertainty and predictive uncertainty across time.
Correlations between the either predictive uncertainty or dynamics (gradients) of the data and the prediction error.

Some prime use-cases of the benchmark are:

Finding the best-performing surrogate on a dataset. Here, best-performing could mean high accuracy, low inference times or any other metric of interest (e.g. most accurate uncertainty estimates, ...).
Comparing performance of a novel surrogate architecture against the implemented baseline models.
Gaining insights into a dataset or comparing datasets using the built-in dataset insights.

On This Website

This website should be a helpful guide to using CODES. Here is what can be found where (all links in the header):

Overview. You are here :)
Benchmark. If you want a thorough overview over the structure of CODES and how to use it, this is the page to consult. There are many explanations on structure and configuration without too much technical details.
Docs. This is the more detailed and technical guide on code structure, configuration and on how to add your own dataset or model to the benchmark.
Results. Exemplary evaluation of a benchmark run. This page gives an overview about what to expect as output of the benchmark.
Config Maker. A javascript-based tool that helps you set up the configuration of the benchmark. It generates the required config.yaml which can then be downloaded.
GitHub. Link to the CODES GitHub repo.
API Docs. Link to the technical API documentation, auto-generated with Sphinx.
Paper. Link to the CODES paper on arXiv, which was accepted for ML4PS@NeurIPS2024.