Split Apply Combine

Experimental

This is an initial implementation of the split-apply-combine pattern that supports a limited set of features. As a result, the public API will not be covered under semantic versioning and is subjected to changes, but we will try not to change it.

If you need a feature that is not yet implemented, please let us know!

The paper "The Split-Apply-Combine Strategy for Data Analysis", written by Hadley Wickham, illustrates that most data processing follows the pattern of:

  • splitting data into groups,
  • applying a reduction, transformation, or filter on each group,
  • combining the results together via concatenation.

ClimaAnalysis currently implements splitting operations via a single group via GroupAll or seasonal splits via SplitSeason and reductions via Reduce.

Tutorial

The split-apply-combine pattern is only defined for OutputVars. Assuming you already have var, a OutputVar, here's a complete example of computing a time average in a functional style approach.

var
Attributes:
  short_name => lwu
  start_date => 2010-01-01T00:00:00
Dimension attributes:
  time:
    units => s
  lat:
    units => degrees
  lon:
    units => degrees
Data defined over:
  time with 4 elements (0.0 to 7.776e6)
  lat  with 3 elements (-90.0 to 90.0)
  lon  with 7 elements (-180.0 to 180.0)
import ClimaAnalysis
import Statistics: mean

time_averaged_var =
    var |>
    ClimaAnalysis.GroupAll("time") |>
    ClimaAnalysis.Reduce(mean) |>
    ClimaAnalysis.combine
Attributes:
  short_name => lwu
  start_date => 2010-01-01T00:00:00
Dimension attributes:
  time:
    units => s
  lat:
    units => degrees
  lon:
    units => degrees
Data defined over:
  time with 1 element (0.0)
  lat  with 3 elements (-90.0 to 90.0)
  lon  with 7 elements (-180.0 to 180.0)
What is the difference between this example and `average_time`?

There are no modifications to the attributes in this example and average_time does modifies the attributes to reflect that the operation happens. Also, average_time squeezes the time dimension and remove it while the time dimension is kept when using the split-apply-combine pattern.

First, a AbstractSplitOperation is called on a OutputVar which produces a SplitApplyVar. In this example, var is piped to GroupAll("time"). The result is a SplitApplyVar which represents a lazy evaluation of the split-apply-combine pattern on a OutputVar.

Second, a AbstractApplyOperation is called on the SplitApplyVar which record the apply operation. In this example, we compute a time average over every group which there is only one. For Reduce, the reduction function passed must accept an Array and a dims keyword argument, and must return an array with the same number of dimensions where the size along dims is 1 (i.e., the dimension is kept but collapsed to a single element). Functions such as Statistics.mean, sum, minimum, and maximum satisfy this requirement.

Third, combine is called on the SplitApplyVar which starts the split, apply, and combine operations to produce the resulting OutputVar. The final result is a time-averaged OutputVar that still have a time dimension. The single value of the time dimension is the first date of the time dimension. The split dimension is still present in the returned OutputVar, but its size equals the number of groups. For GroupAll, this is always 1. The coordinate value for each group is taken from the first element of that group.

julia> ClimaAnalysis.dates(var)4-element Vector{Dates.DateTime}:
 2010-01-01T00:00:00
 2010-02-01T00:00:00
 2010-03-01T00:00:00
 2010-04-01T00:00:00
julia> ClimaAnalysis.dates(time_averaged_var)1-element Vector{Dates.DateTime}: 2010-01-01T00:00:00