Failure handling · EnsembleKalmanProcesses.jl

Handling forward model failures

In situations where the forward model $\mathcal{G}$ represents a diagnostic of a complex computational model, there might be cases where for some parameter combinations $\theta$, attempting to evaluate $\mathcal{G}(\theta)$ may result in model failure. Some examples could be

A member is numerically unstable, or throws error exceptions
A member exceeds a user-defined wall-clock time for computation
A member's outputs produce (full or partial) values that are user-defined as "failure"
Possible corruption of data or compute nodes during HPC model evaluation

In such cases, this package offers the option for users to replace entries in the evaluation with NaN values, and then updates will continue with the following procedures

Imputation: If only partial output of an ensemble member is NaN, then values may be redeemed with values derived from other ensemble members, or from user-defined values. Imputed members will be considered successful.
Failure handling: The update equations are then modified to handle/replace/resample failed members. Crucially, this is done in a way that does not break parallelism (i.e., no re-running of failed members)

EnsembleKalmanProcesses.jl implements such modifications through the FailureHandler structure, an input to the EnsembleKalmanProcess constructor. Currently, the only failsafe modification available is SampleSuccGauss(), described in Lopez-Gomez et al (2022).

These are last-resort handlers

These modifications are not a magic bullet. If large fractions of ensemble members fail during an iteration, this will degenerate the span of the ensemble, as one may not be able to replace lost information.

Implementation

When available, the failure handling and imputation is active by default, see here for the different options. Alternatively, one can use the failure_handler_method keyword

using EnsembleKalmanProcesses
using EnsembleKalmanProcesses.ParameterDistributions


# for e.g., EKI
J = 50  # number of ensemble members
initial_ensemble = construct_initial_ensemble(prior, J) # Initialize ensemble from prior
ekiobj = EnsembleKalmanProcess(
    initial_ensemble,
    y,
    obs_noise_cov,
    Inversion(),
    failure_handler_method = SampleSuccGauss())

# for e.g., UKI
ukiobj = EnsembleKalmanProcess(
    y
    obs_noise_cov,
    Unscented(prior),
    failure_handler_method = SampleSuccGauss())

When has a particle "failed"?

Full particle failure

The user determines if a ensemble member has failed and replace the output (column) of $\mathcal{G}(\theta)$ with NaNs. The FailureHandler takes care of the rest.

Partial particle failure (a.k.a Imputation)

If there is a partial failure within an output then the user sets only failed entries within a particle to NaN. The user can then use two keywords to determine the imputation approach, given to EnsembleKalmanProcess.

nan_tolerance: If the number of NaNs in any observation is below a given nan_tolerance (default 10%) then the particle will still be considered successful (not handled by the FailureHandler) and the algorithm will impute NaNs using information from other ensemble members or user-defined values.
nan_rows_value: if the same entry fails for all ensemble members, then imputing is not possible without additional information; here the user may supply a vector with the dimension of the output and this value will be imputed for all members in this row.

Example of Imputation and Failure Handling

For example imagine we have a 7-dimensional output and 4 ensemble members. The latest run produces

g = 
7×4 Matrix{Float64}:
NaN           0.992373   NaN           -0.200025
-1.37133     0.0527899  NaN           -1.15243
NaN         NaN          NaN          NaN
NaN         NaN          NaN          NaN
-2.28799   NaN            0.614214     1.54217
0.582582    0.0744582   -1.37219      0.253971
0.620447    0.441834     0.0641471    0.490588

Now, given the user-defined settings in EnsembleKalmanProcess(...)

nan_tolerance = 0.5         # failure if > 50% NaN
nan_rows_value = collect(1:7) # replace failed row k with value k

The imputation will produce:

┌ Warning: In forward map ensemble g, detected 12 NaNs. 
│ Given nan_tolerance = 0.5 to determine failed members: 
│ - Ensemble members failed:       1 
│ - NaNs in successful members:    8 
│ - rows index set for imputation: [1, 3, 4, 5] 
│ - rows index entirely NaN:       [3, 4]
[ Info: Imputed 8 NaNs
7×4 Matrix{Float64}:
0.396174   0.992373   NaN          -0.200025     # `g[1, 1]`        -> `mean(g[1, [2, 4] ])`
-1.37133    0.0527899  NaN          -1.15243
3.0        3.0        NaN           3.0          # `g[3, [1, 2, 4]]`-> `3.0`
4.0        4.0        NaN           4.0          # `g[4, [1, 2, 4]]`-> `4.0`
-2.28799   -0.0438687    0.614214    1.54217     # `g[5, 2]`        -> `mean(g[5, [1, 3, 4])`
0.582582   0.0744582   -1.37219     0.253971
0.620447   0.441834     0.0641471   0.490588

Here,

A warning is given to describe the context
member 3 was deemed a failure(>50% NaN) and will be replaced with the failure handler, SampleSuccGauss during EKP update
members 1, 2, and 4 were deemed successful, and so are imputed with values for the EKP update.

`SampleSuccGauss()` for EKI

The SampleSuccGauss() modification is based on updating all ensemble members with a distribution given by only the successful parameter ensemble. Let $\Theta_{s,n}=[ \theta^{(1)}_{s,n},\dots,\theta^{(J_s)}_{s,n}]$ be the successful ensemble, for which each evaluation $\mathcal{G}(\theta^{(j)}_{s,n})$ does not fail, and let $\theta_{f,n}^{(k)}$ be the ensemble members for which the evaluation $\mathcal{G}(\theta^{(k)}_{f,n})$ fails. The successful ensemble $\Theta_{s,n}$ is updated to $\Theta_{s,n+1}$ using expression (2), and each failed ensemble member as

\[ \theta_{f,n+1}^{(k)} \sim \mathcal{N} \left({m}_{s, {n+1}}, \Sigma_{s, n+1} \right),\]

where

\[ {m}_{s, {n+1}} = \dfrac{1}{J_s}\sum_{j=1}^{J_s} \theta_{s,n+1}^{(j)}, \qquad \Sigma_{s, n+1} = \mathrm{Cov}(\theta_{s, n+1}, \theta_{s, n+1}) + \kappa_*^{-1}\mu_{s,1}I_p.\]

Here, $\kappa_*$ is a limiting condition number, $\mu_{s,1}$ is the largest eigenvalue of the sample covariance $\mathrm{Cov}(\theta_{s, n+1}, \theta_{s, n+1})$ and $I_p$ is the identity matrix of size $p\times p$.

`SampleSuccGauss()` for UKI

The SampleSuccGauss() modification is based on performing the UKI quadratures over the successful sigma points. Consider the set of off-center sigma points $\{\hat{\theta}\} = \{\hat{\theta}_s\} \cup \{\hat{\theta}_f\}$ where $\hat{\theta}_{s}^{(j)}$, $j=1, \dots, J_s$ are successful members and $\hat{\theta}_{f}^{(k)}$ are not. For ease of notation, consider an ordering of $\{\hat{\theta}\}$ such that $\{\hat{\theta}_s\}$ are its first $J_s$ elements, and note that we deal with the central point $\hat{\theta}^{(0)}$ separately. We estimate the covariances $\mathrm{Cov}_q(\mathcal{G}_n, \mathcal{G}_n)$ and $\mathrm{Cov}_q(\theta_{n}, \mathcal{G}_n)$ from the successful ensemble,

\[ \tag{1} \mathrm{Cov}_q(\theta_n, \mathcal{G}_n) \approx \sum_{j=1}^{J_s}w_{s,j} (\hat{\theta}_{s, n}^{(j)} - \bar{\theta}_{s,n})(\mathcal{G}(\hat{\theta}_{s, n}^{(j)}) - \bar{\mathcal{G}}_{s,n})^T,\]

\[ \tag{2} \mathrm{Cov}_q(\mathcal{G}_n, \mathcal{G}_n) \approx \sum_{j=1}^{J_s}w_{s,j} (\mathcal{G}(\hat{\theta}_{s, n}^{(j)}) - \bar{\mathcal{G}}_{s,n})(\mathcal{G}(\hat{\theta}_{s, n}^{(j)}) - \bar{\mathcal{G}}_{s,n})^T,\]

where the weights at each successful sigma point are scaled up, to preserve the sum of weights,

\[ w_{s,j} = \left(\dfrac{\sum_{i=1}^{2p} w_i}{\sum_{k=1}^{J_s} w_k}\right)w_j.\]

In equations (1) and (2), the means $\bar{\theta}_{s,n}$ and $\bar{\mathcal{G}}_{s,n}$ must be modified from the original formulation if the central point $\hat{\theta}^{(0)}=m_n$ results in model failure. If this is the case, then an average is taken across the other (successful) ensemble members

\[ \bar{\theta}_{s,n} = \dfrac{1}{J_s}\sum_{j=1}^{J_s}\hat{\theta}_{s, n}^{(j)}, \qquad \bar{\mathcal{G}}_{s,n} = \dfrac{1}{J_s}\sum_{j=1}^{J_s}\mathcal{G}(\hat{\theta}_{s, n}^{(j)}).\]