DataHandling

The DataHandling module is responsible for reading data from files and resampling it onto the simulation grid.

This is no trivial task. Among the challenges:

  • data can be large and cannot be read all in one go and/or held in memory,
  • regridding onto the simulation grid can be very expensive,
  • IO can be very expensive,
  • CPU/GPU communication can be a bottleneck.

The DataHandling takes the divide-and-conquer approach: the various core tasks and features and split into other independent modules (chiefly FileReaders, and Regridders regridder_module). Such modules can be developed, tested, and extended independently (as long as they maintain a consistent interface). For instance, if need arises, the DataHandler can be used (almost) directly to process files with a different format from NetCDF.

The key struct in DataHandling is the DataHandler. The DataHandler contains one or more FileReader(s), a Regridder, and other metadata necessary to perform its operations (e.g., target ClimaCore.Space). The DataHandler can be used for static or temporal data, and exposes the following key functions:

  • regridded_snapshot(time): to obtain the regridded field at the given time. time has to be available in the data.
  • available_times (available_dates): to list all the times (dates) over which the data is defined.
  • previous_time(time/date) (next_time(time/date)): to obtain the time of the snapshot before the given time or date. This can be used to compute the interpolation weight for linear interpolation, or in combination with regridded_snapshot to read a particular snapshot

Most DataHandling functions take either time or date, with the difference being that time is intended as "simulation time" and is expected to be in seconds; date is a calendar date (from Dates.DateTime). Conversion between time and date is performed using the reference date and simulation starting time provided to the DataHandler.

The DataHandler has a caching mechanism in place: once a field is read and regridded, it is stored in the local cache to be used again (if needed). This is a least-recently-used (LRU) cache implemented in DataStructures, which removes the least-recently-used data when its maximum size is reached. The default maximum size is 128.

While the reading backend could be generic, at the moment, this module uses only the NCFileReader.

This extension is loaded when loading ClimaCore and NCDatasets are loaded. In addition to this, a Regridder is needed (which might require importing additional packages) - see Regridders for more information.

It is possible to pass down keyword arguments to underlying constructors in DataHandler with the regridder_kwargs and file_reader_kwargs. These have to be a named tuple or a dictionary that maps Symbols to values.

A DataHandler can contain information about a variable that we read directly from an input file, or about a variable that is produced by composing data from multiple input variables. In the latter case, the input variables may either all come from the same input file, or may each come from a separate input file. The user must provide the composing function, which operates pointwise on each of the inputs, as well as an ordered list of the variable names to be passed to the function. Additionally, input variables that are composed together must have the same spatial and temporal dimensions. Note that, if a non-identity pre-processing function is provided as part of file_reader_kwargs, it will be applied to each input variable before they are composed. Composing multiple input variables is currently only supported with the InterpolationsRegridder, not with TempestRegridder.

Sometimes, the time development of a variable is split across multiple NetCDF files. DataHandler knows how to combine them and treat multiple files as if they were a single one. To use this feature, just pass the list of NetCDF files (while the file don't have to be sorted, it is good practice to pass them sorted in ascending order by time).

Heuristics to do-what-you-mean

DataHandler tries to interpret the files provided and identify if they are split across variables or along the time dimension. The heuristics implement are the following:

  • When just a file is passed, it is assumed that it contains everything
  • When multiple files are passed, DataHandler will assume that the files are split along variables if the number of files is the same the number of variables, otherwise, it will assume that each file contains all the variables for a portion of the total time.
  • When the above assumption is incorrect, you can pass a list of list of files that fully specifies variables and times.

For example,

data_handler = DataHandling.DataHandler(
    ["era1980.nc", "era1981.nc"],
    ["lai_hv", "lai_lv"],
    target_space;
    compose_function = (x, y) -> x + y,
)

In this case, DataHandler will incorrectly assume that lai_hv is contained in era1980.nc, and lai_lv is in era1980.nc. Instead, construct the data_handler by passing a list of lists

files = ["era1980.nc", "era1981.nc"]
data_handler = DataHandling.DataHandler(
    [files, files],
    ["lai_hv", "lai_lv"],
    target_space;
    compose_function = (x, y) -> x + y,
)

where each element of the list is the collection of files that contain the time evolution of that variable.

Example: Linear interpolation of a single data variable

As an example, let us implement a simple linear interpolation for a variable u defined in the era5_example.nc NetCDF file. The file contains monthly averages starting from the year 2000.

import ClimaUtilities.DataHandling
import ClimaCore
import NCDatasets
# Loading ClimaCore and Interpolations automatically loads DataHandling
import Interpolations
# This will load InterpolationsRegridder

import Dates

# Define pre-processing function to convert units of input
unit_conversion_func = (data) -> 1000 * data

data_handler = DataHandling.DataHandler("era5_example.nc",
                                        "u",
                                        target_space,
                                        start_date = Dates.DateTime(2000, 1, 1),
                                        regridder_type = :InterpolationsRegridder,
                                        file_reader_kwargs = (; preprocess_func = unit_conversion_func))

function linear_interpolation(data_handler, time)
    # Time is assumed to be "simulation time", ie seconds starting from start_date

    time_of_prev_snapshot = DataHandling.previous_time(data_handler, time)
    time_of_next_snapshot = DataHandling.next_time(data_handler, time)

    prev_snapshot = DataHandling.regridded_snaphsot(data_handler, time_of_prev_snapshot)
    next_snapshot = DataHandling.regridded_snaphsot(data_handler, time_of_next_snapshot)

    # prev and next snapshots are ClimaCore.Fields defined on the target_space

    return @. prev_snapshot + (next_snapshot - prev_snapshot) *
        (time - time_of_prev_snapshot) / (time_of_next_snapshot - time_of_prev_snapshot)
end

If, for example, the data was split across multiple files named era5_1980.nc, era5_1981.nc, ... (e.g., each file containing one year), we could directly pass the list to the constructor for DataHandler (instead of just passing one file path), which knows how to combine them.

Example appendix: Using multiple input data variables

Suppose that the input NetCDF file era5_example.nc contains two variables u and v, and we care about their sum u + v but not their individual values. We can provide a pointwise composing function to perform the sum, along with the InterpolationsRegridder to produce the data we want, u + v. The preprocess_func passed in file_reader_kwargs will be applied to u and to v individually, before the composing function is applied. The regridding is applied after the composing function. u and v could also come from separate NetCDF files, but they must still have the same spatial and temporal dimensions.

# Define the pointwise composing function we want, a simple sum in this case
compose_function = (x, y) -> x + y
data_handler = DataHandling.DataHandler("era5_example.nc",
                                        ["u", "v"],
                                        target_space,
                                        start_date = Dates.DateTime(2000, 1, 1),
                                        regridder_type = :InterpolationsRegridder,
                                        file_reader_kwargs = (; preprocess_func = unit_conversion_func),
                                        compose_function)

API

ClimaUtilities.DataHandling.DataHandlerFunction
DataHandler(file_paths,
            varnames,
            target_space::ClimaCore.Spaces.AbstractSpace;
            start_date::Dates.DateTime = Dates.DateTime(1979, 1, 1),
            regridder_type = nothing,
            cache_max_size::Int = 2,
            regridder_kwargs = (),
            file_reader_kwargs = ())

Create a DataHandler to read varnames from file_paths and remap them to target_space.

This function supports reading across multiple files and composing variables that are in different files.

file_paths may contain either one path for all variables or one path for each variable. In the latter case, the entries of file_paths and varnames are expected to match based on position.

The DataHandler maintains an LRU cache of Fields that were previously computed. The default size for the cache is only two fields, so if you expect to re-use the same fields often, increasing the cache size can lead to improved performances.

Creating this object results in the file being accessed (to preallocate some memory).

Positional arguments

  • file_paths: Paths of the NetCDF file(s) that contain the input data. file_paths should be as "do-what-I-mean" as possible, meaning that it should behave as you expect.

    To be specific, there are three options for file_paths:

    • It is a string that points to a single NetCDF file.
    • It is a list that points to multiple NetCDF files. In this case, we support two modes:
      1. if varnames is a vector with the number of entries as file_paths, we assume that each file contains a different variable.
      2. otherwise, we assume that each file contains all the variables and is temporal chunk.
    • It is a list of lists of paths to NetCDF files, where the inner list identifies temporal chunks of a given variable, and the outer list identifies different variables (supporting the mode where different variables live in different files and their time development is split across multiple files). In other words, file_paths[i] is the list of files that define the temporal evolution of varnames[i]
  • varnames: Names of the datasets in the NetCDF that have to be read and processed.

  • target_space: Space where the simulation is run, where the data has to be regridded to.

Keyword arguments

Time/date information will be ignored for static input files. (They are still set to make everything more type stable.)

  • start_date: Calendar date corresponding to the start of the simulation.
  • regridder_type: What type of regridding to perform. Currently, the ones implemented are :TempestRegridder (using TempestRemap) and :InterpolationsRegridder (using Interpolations.jl). TempestRemap regrids everything ahead of time and saves the result to HDF5 files. Interpolations.jl is online and GPU compatible but not conservative. If the regridder type is not specified by the user, and multiple are available, the default :InterpolationsRegridder regridder is used.
  • cache_max_size: Maximum number of regridded fields to store in the cache. If the cache is full, the least recently used field is removed.
  • regridder_kwargs: Additional keywords to be passed to the constructor of the regridder. It can be a NamedTuple, or a Dictionary that maps Symbols to values.
  • file_reader_kwargs: Additional keywords to be passed to the constructor of the file reader. It can be a NamedTuple, or a Dictionary that maps Symbols to values.
  • compose_function: Function to combine multiple input variables into a single data variable. The default, to be used in the case of one input variable, is the identity. The compose function has to take N arguments, where N is the number of variables in varnames, and return a scalar. The order of the arguments in compose_function has to match the order of varnames. This function will be broadcasted to data read from file.
source
ClimaUtilities.DataHandling.previous_timeFunction
previous_time(data_handler::DataHandler, time::AbstractFloat)
previous_time(data_handler::DataHandler, date::Dates.DateTime)

Return the time in seconds of the snapshot before the given time. If time is one of the snapshots, return itself.

If time is not in the data_handler, return an error.

source
ClimaUtilities.DataHandling.next_timeFunction
next_time(data_handler::DataHandler, time::AbstractFloat)
next_time(data_handler::DataHandler, date::Dates.DateTime)

Return the time in seconds of the snapshot after the given time. If time is one of the snapshots, return the next time.

If time is not in the data_handler, return an error.

source
ClimaUtilities.DataHandling.previous_dateFunction
DataHandling.previous_date(data_handler::DataHandler, time::Dates.TimeType)

Return the date of the snapshot before the given date. If date is one of the snapshots, return itself.

If date is not in the data_handler, return an error.

source
ClimaUtilities.DataHandling.next_dateFunction
DataHandling.next_date(data_handler::DataHandler, time::Dates.TimeType)

Return the date of the snapshot after the given time. If date is one of the snapshots, return itself.

If date is not in the data_handler, return an error.

source
ClimaUtilities.DataHandling.regridded_snapshotFunction
regridded_snapshot(data_handler::DataHandler, date::Dates.DateTime)
regridded_snapshot(data_handler::DataHandler, time::AbstractFloat)
regridded_snapshot(data_handler::DataHandler)

Return the regridded snapshot from data_handler associated to the given time (if relevant).

The time has to be available in the data_handler.

When using multiple input variables, the varnames argument determines the order of arguments to the compose_function function used to produce the data variable.

regridded_snapshot potentially modifies the internal state of data_handler and it might be a very expensive operation.

source
ClimaUtilities.DataHandling.regridded_snapshot!Function
regridded_snapshot!(dest::ClimaCore.Fields.Field, data_handler::DataHandler, date::Dates.DateTime)

Write to dest the regridded snapshot from data_handler associated to the given time.

The time has to be available in the data_handler.

regridded_snapshot! potentially modifies the internal state of data_handler and it might be a very expensive operation.

source
ClimaUtilities.DataHandling.dtFunction
dt(data_handler::DataHandler)

Return the time interval between data points for the data in data_handler.

This requires the data to be defined on a equispaced temporal mesh.

source