Data Processing and Dimension Reduction
Overview
When working with high-dimensional problems with modest training data pairs, the bottleneck of CES procedure is the training of a competent emulator. It is often necessary to process and dimensionally-reduce the data to ease the learning task of the emulator. To make results as transparent and reproducible as possible we provide a flexible (and extensible!) framework to create encoders and decoders for this purpose.
The framework works as follows
- An
encoder_scheduledefines the type of processing to be applied to input and/or output spaces - The
encoder_scheduleis passed into theEmulatorwhere it is initialized and stored. Sometimes it will require additional information, passed in asencoder_kwargs. - The encoder will be used automatically to encode training data, covariance matrices, and predictions within the Emulate and Sample routines.
- Users can predict to/from encoded space with
predict(...; encode=X)withXbeing"in","out", or"in_and_out"if needed.
An external API is also available using the encode_data, and encode_structure_matrix methods if needed to encode new data, or new covariance-type matrices.
Defaults and recommendations
Recommended schedule: ("latest and greatest")
Given ekp and prior from an ensemble kalman process run (e.g., the "Calibrate" stage). Assume we have chosen our machine learning tool and obtained training data. Our recommendeded family to balance efficiency and reduce dimension is to use the retain_var and retain_info.
retain_var = 0.999 # reduce dimension retaining 99.9% of variance (for PCA)
retain_info = 0.999 # reduce dimension retaining 99.9% of information (for likelihood-informed)
ekp_it = 5 # number of eki iteration (for likelihood-informed)
encoder_schedule = [
(decorrelate_sample_cov(retain_var = retain_var), "in"),
(decorrelate_structure_mat(retain_var = retain_var), "out"),
(likelihood_informed(retain_info=retain_info, iters = 1:ekp_it), "in"),
(likelihood_informed(retain_info=retain_info, iters = 1:1), "out"),
]The motivation is that PCA is an efficient, but blunt, tool for reducing dimension from ((>100s) -> (100s)) dimension, then use likelihood-informed tools to hone down ((100s) -> (10s)) dimensions. iters=1:X is more performant in input than output dimension for X>1 so we only do this here. This schedule is added to the emulator as follows
emulator = Emulator(
machine_learning_tool,
input_output_pairs;
encoder_schedule = encoder_schedule,
encoder_kwargs = encoder_kwargs_from(ekp, prior), # kwargs from "Calibrate" objects
)Recommended encoder_kwargs
The EnsembleKalmanProcesses infrastructure naturally provides information to build the data processors. We provide some utilities to extract these into kwargs of the correct format, the recommended (but not only) option is:
# `prior::ParameterDistribution`
# `ekp::EnsembleKalmanProcess`
encoder_kwargs = encoder_kwargs_from(ekp, prior) Default schedule (to whiten with no dimension reduction)
When the user provides no encoder_schedule, the following is created
emulator = Emulator(
machine_learning_tool,
input_output_pairs;
encoder_kwargs = encoder_kwargs,
)The default schedule under-the-hood, is given by
schedule = [
(decorrelate_sample_cov(), "in"),
(decorrelate_structure_mat(), "out"), # uses obs_noise_cov kwarg
]With no kwargs also provided, the following schedule under-the-hood, is given by
schedule = [
(decorrelate_sample_cov(), "in_and_out"),
]To ensure that no encoding is happening, the user must pass in an empty schedule encoder_schedule = []
Other typical schedule: PCA dimension reduction
Robust and blunt dimension reduction can be done with only PCA
retain_var_in = 0.99 # reduce dimension retaining 99% of input variance
retain_var_out = 0.95 # reduce dimension retaining 95% of output variance
encoder_schedule = [
(decorrelate_sample_cov(retain_var = retain_var_in), "in"),
(decorrelate_structure_mat(retain_var = retain_var_out), "out"),
]
encoder_kwargs = (; obs_noise_cov = obs_noise_cov)
emulator = Emulator(
machine_learning_tool,
input_output_pairs;
encoder_schedule = encoder_schedule,
encoder_kwargs = encoder_kwargs,
)kwargs from other objects:
One can also obtain individual kwarg sets from other objects, and can merge these as follows
# `prior::ParameterDistribution`
# `observation_series::ObservationSeries`
input_kwargs = get_kwargs_from(prior)
output_kwargs = get_kwargs_from(observation_series)
encoder_kwargs = merge(input_kwargs, output_kwargs)Though not recommended, you can also build your own keyword arguments.
- The
obs_noise_covobject does not need to be a constructedAbstractMatrix, rather it can be any type of compactly stored matrix compatible withEnsembleKalmanProcesses(e.g., here). - When
ObservationSeriescontains multiple observations, it is assumed that the covariance of the noise of all observations are the same. (If they are not, only the first will be taken by default inget_kwargs_from(...)
Building and interpreting encoder schedules
The user may provide an encoder schedule to transform data in a useful way. For example, mapping all output data each dimension to be bounded in [0,1]
simple_schedule = (minmax_scale(), "out")The unit of the encoder contains a DataProcessor, the type of processing, and an string, whether to apply to input data,"in", or output data "out", or both "in_and_out".
The encoder schedule can be a vector of several units that apply multiple DataProcessors in order:
complex_schedule = [
(decorrelate_sample_cov(), "in"),
(quartile_scale(), "in"),
(decorrelate_structure_mat(retain_var=0.95), "out"),
(canonical_correlation(), "in_and_out"),
]In this (rather unrealistic) chain;
- The inputs are decorrelated with their sample mean and covariance (and projected to low dimensional subspace if necessary) i.e PCA
- The scaled inputs are then subject to a "Robust" univariate scaling, mapping 1st-3rd quartiles to [0,1]
- The outputs are decorrelated using an "output structure matrix" (provided to the emulator in the
encoder_kwargskeyword parameter, e.g. as(; obs_cov_noise =)). Furthermore, apply a dimension-reduction to a space that retains 95% of the total variance. - In the reduced input-output space, a canonical correlation analysis is performed. Data is oriented and reduced (if necessary) maximize the joint correlation between inputs and outputs.
The current default encoder schedule applies decorrelate_structure_mat() if a structure matrix (input or output) is provided, else it applies decorrelate_sample_cov().
Creating an emulator with a schedule
The schedule is then passed into the Emulator, along with the data and desired structure matrices
emulator = Emulator(
machine_learning_tool,
input_output_pairs;
encoder_schedule = complex_schedule,
encoder_kwargs = (; obs_noise_cov = obs_noise_cov),
)Note that due to the item (decorrelate_structure_mat(retain_var=0.95), "out") in the schedule, we must provide an output structure matrix. In this case, we provide obs_noise_cov.
Additional API
Once created, the user does not need to worry about encoding/decoding. However, we do provide an external API that can be used to encode both new data vectors (or matrix-of-columns), and structure matrices (i.e. covariances) for validation or error diagnosis.
# encoding new data: Vector, `DataContainer`, or Matrix viewed as columns
enc_in_data = encode_data(emulator, new_input_data, "in")
enc_out_data = encode_data(emulator, new_output_data, "out")
# decoding it
dec_in_data = decode_data(emulator, enc_in_data, "in")
dec_out_data = decode_data(emulator, enc_out_data, "out")
# new structure matrix: Matrix (typically acting as a covariance)
enc_in_cov = encode_structure_matrix(emulator, new_input_covariance, "in")
enc_out_cov = encode_structure_matrix(emulator, new_output_covariance, "out")
dec_in_cov = decode_structure_matrix(emulator, enc_in_cov, "in")
dec_out_cov = decode_structure_matrix(emulator, enc_out_cov, "out")Types of data processors
We currently provide two main types of data processing: the DataContainerProcessor and PairedDataContainerProcessor.
The DataContainerProcessor encodes "input" data agnostic of the "output" data, and vice versa, examples of current implementations are:
UnivariateAffineScaling: such asquartile_scale(),minmax_scale(), andzscore_scale(), which apply some basic univariate scaling to the data in each dimensionDecorrelator: such asdecorrelate_structure_mat()anddecorrelate_sample_cov(), ordecorrelate()which perform (truncated-)PCA using either the sample-estimated or user-provided covariance matrices (or their sum).
The PairedDataContainerProcessor encodes inputs (or outputs) using information of the both inputs and outputs in pairs
CanonicalCorrelation- constructed withcanonical_correlation(), which performs canonical correlation analysis to process the pairs. In effect this performs PCA on the cross-correlation from input and output samples.LikelihoodInformed- constructed withlikelihood_informed()- we use the data or likelihood from the inverse problem at hand to build diagnostic matrices that are used to find informative directions for dimension reduction. In particular we build generalizations of current frameworks (e.g., Cui, Zahm 2021, Baptista, Marzouk, Zahm 2022) to use the latest EKP iterations. We in fact generalize this framework here [Preprint Coming soon] for use in partially-informed distributions (e.g. EKI iterations)
This is an extensible framework, and so new data processors can be added to this library.
Some effects of the data processing (with older API) are outlined in a practical setting in the results and appendices of Howland, Dunbar, Schneider, (2022).
A note on LinearMaps and the internals for encoding large observations
To handle the heterogeneity and possible massive size of matrices involved with encoding observations, all obs_noise_cov and encoder-decoder objects are converted into linear maps, using LinearMaps.jl. For high-dimensional inputs or outputs (>3000), these are used in within iterative or (stochastically) approximate matrix-free methods, thus giving some possible additional error to gain better scaling in dimension. Such balances are controllable by the user with keywords that can be passed to certain encoders in the schedule.