Exemplary Results

This page shows some examplary evaluations performed on the osu2008 dataset to show the benchmark's capabilities and explain considerations for the different plots.

Plots

The benchmark has two kinds of outputs: Metrics and plots. Both are given on a per-surrogate basis and as comparison between the benchmarked surrogates.
All metric output is stored in results/training_id/. The individual surrogate metrics are stored in surrogatename_metrics.yaml (e.g. fullyconnected_metrics.yaml), while the comparative parts are stored as metrics.csv and metrics_table.txt (this table is also printed to the CLI at the end of the benchmark).
All plots are stored in plots/training_id/, the individual plots for each surrogate in the respective surrogate directory and the comparative plots in the main directory.

Comparative Plots

Below is a selection of the plots that can be created in the benchmark if at least two surrogate architectures were trained. These plots serve to compare characteristics between surrogates.

This smoothed histogram plot serves to compare the distribution of the relative errors of the surrogate models. This can aid in identifying long tails of the distribution or multimodality. Of note could also be the difference in discrepancy between mean and median error.

This plot compares mean and median relative errors across time to identify common spikes or noteable differences between errors. In this example, there are some spikes in the second half which are roughly shared between models (although more pronounced for LatentNeuralODE and LatentPoly). These could be explained by overall more dynamic sections of the trajectories at the corresponding time steps, which can be confirmed by looking at some plots of the trajectories or average gradients across time.

One of the main axes of comparison provided by CODES is the scaling of model errors across different modalities. This plot allows for insights into how the surrogats handle interpolating and extrapolating in the time domain and how much they are able to learn from sparse data.

There is a range of loss plots in the benchmark which can help identify problems during training or potential for further improvement in case one of the models has not converged fully. From this plot, it is evident that adding a learning rate scheduler for MultiONet and LatentPoly could be beneficial, as the fluctuations in the loss are quite pronounced. This plot has the additional benefit of showing the loss not over the course of epochs, but as a function of actual training time (i.e., actual compute). This is in many cases more informative as epochs are much faster for some models than others.

A pretty self-explanatory plot comparing the inference times of all models. The differences mainly arise from the difference in model size (trainable parameters) and, in the case of LatentNeuralODE from the computional cost of numerical integration.

The key plot regarding uncertainty quantification. It compares the mean absolute error across time with the 1σ interval of the DeepEnsemble predictions. This plot can give insights into the quality of the uncertainty estimates, showing over or underconfidence of the model. It should be noted that the 1σ interval is a somewhat arbitrary choice and hence it could be argued that the predicted uncertainty could be arbitrarily rescaled. To gain further insights into the quality of the uncertainty estimates, we can use the following plot as well as the Pearson correlation coefficient between uncertainty and error.

Uncertainty Error Correlation Comparison

One of two heatmap plots to visualise the correlatory investigations of the benchmark. This plot is related to uncertainty quantification, visualising the correlation between predicted uncertainty and actual prediction error at each point. The white dashed line visualises an ideally calibrated model which would always accurately predict the magnitude of each error. One clearly visible effect is the overconfidence of FullyConnected, which is evidenced by its skewing above the diagonal. Note that the colormap is log scale, so much of the observed scattering is rather minor and a large fraction of all counts falls into the pixel in the lower left corner (very small error and uncertainty).

This second heatmap plot serves to visualise the correlation between the prediction errors and the gradient of the trajectry at this point. This can show whether a given surrogate architecture struggles with dynamic regions of the data. LatentNeuralODE and LatentPoly make a larger fraction of their errors at points with higher gradients compared to FullyConnected and MultiONet. The colormap is again in logscale.

Individual Plots

Below is a selection of the plots that are created in the benchmark for every surrogate architecture that was trained. They can be used to gain deeper insights into the behaviour of single models/architectures.

This plot is similar to the smoothed-histogram plot comparing the relative errors of all benchmarked surrogates, but shows the relative errors for each quantity separately for a single surrogate model. This helps identify quantities which the model struggles with or which the model is able to predict very well.

A detailed view into how the relative errors of the given architecture are distributed across time. This plot displays both mean and median error for the given surrogate architecture as well as the range of different percentiles.

This and the two subsequent plots show how the mean absolute error (MAE) develops across time for different modalities of the benchmark (here: Interpolation). As expected, the errors increase for higher intervals, as there are fewer datapoints to train on and the remaining points may inaccurately represent the trajectories.

Development of the MAE across time for the extrapolation modality. This plot in particular serves as a good sanity check for whether the models behave as expected, it clearly shows the error increasing rapidly behind the data cutoff.

Development of the MAE across time for the sparse modality. As expected, overall errors increase rather uniformly with increasing sparsity. However, the overall rate of this increase may vary between surrogate architectures as shown in the comparative plot for the different modalities.

There are a range of loss plots for the different models that are trained for the different modalities for each surrogate. This plot is one example of these loss plots for the sparse modality, showing the loss trajectory for each trained FullyConnected model in this modality.

To get a feeling for the character of the uncertainty estimates, this plot shows the spread of the DeepEnsemble predictions on one sample in the dataset (i.e., one full set of trajectories). The predictions of all models in the ensemble (here 5) are averaged and the resulting 1σ, 2σ and 3σ intervals are plotted. The spread is is slightly difficult to see as it is rather small, but it is recognisable e.g. for CH2+ or O2+.

Comparative Metrics

Individual Metrics

The surrogatename_metrics.yaml is the most detailed result file, as it contains every metric that was calculated during the benchmark of the corresponding surrogate. Which metrics are calculated depends on the modalities that are activated in the config.