DINCAE.jl

DINCAE (Data-Interpolating Convolutional Auto-Encoder) is a neural network to reconstruct missing data in satellite observations. It can work with gridded data (DINCAE.reconstruct) or a clouds of points (DINCAE.reconstruct_points). In the later case, the data can be organized in e.g. tracks (or not).

The code is available at: https://github.com/gher-uliege/DINCAE.jl

The method is described in the following articles:

  • Barth, A., Alvera-Azcárate, A., Ličer, M., & Beckers, J.-M. (2020). DINCAE 1.0: a convolutional neural network with error estimates to reconstruct sea surface temperature satellite observations. Geoscientific Model Development, 13(3), 1609–1622. https://doi.org/10.5194/gmd-13-1609-2020
  • Barth, A., Alvera-Azcárate, A., Troupin, C., & Beckers, J.-M. (2022). DINCAE 2.0: multivariate convolutional neural network with error estimates to reconstruct sea surface temperature satellite and altimetry observations. Geoscientific Model Development, 15(5), 2183–2196. https://doi.org/10.5194/gmd-15-2183-2022

The neural network will be trained on the GPU. Note convolutional neural networks can require a lot of GPU memory depending on the domain size. Flux.jl supports NVIDIA GPUs as well as other vendors (see https://fluxml.ai/Flux.jl/stable/gpu/ for details). Training on the CPU can be performed, but it is prohibitively slow.

User API

In most cases, a user only needs to interact with the function DINCAE.reconstruct or DINCAE.reconstruct_points.

DINCAE.reconstructFunction
reconstruct(Atype,data_all,fnames_rec;...)

Train a neural network to reconstruct missing data using the training data set and periodically run the neural network on the test dataset. The data is assumed to be available on a regular longitude/latitude grid (which is the case of L3 satellite data).

Mandatory parameters

  • Atype: array type to use
  • data_all: list of named tuples. Every tuple should have filename and varname. data_all[1] will be used for training (and perturbed to prevent overfitting). All others entries data_all[2:end] will be reconstructed using the training network

at the epochs defined by save_epochs.

  • fnames_rec: vector of filenames corresponding to the entries data_all[2:end]

Optional parameters:

  • epochs: the number of epochs (default 1000)
  • batch_size: the size of a mini-batch (default 50)
  • enc_nfilter_internal: number of filters of the internal encoding layers (default [16,24,36,54])
  • skipconnections: list of layers with skip connections (default 2:(length(enc_nfilter_internal)+1))
  • clip_grad: maximum allowed gradient. Elements of the gradients larger than this values will be clipped (default 5.0).
  • regularization_L2_beta: Parameter for L2 regularization (default 0, i.e. no regularization)
  • save_epochs: list of epochs where the results should be saved (default 200:10:epochs)
  • is3D: Switch to apply 2D (is3D == false) or 3D (is3D == true) convolutions (default false)
  • upsampling_method: interpolation method during upsampling which can be either :nearest or :bilinear (default :nearest)
  • ntime_win: number of time instances within the time window. This number should be odd. (default 3)
  • learning_rate: initial learning rate of the ADAM optimizer (default 0.001)
  • learning_rate_decay_epoch: the exponential decay rate of the learning rate. After learning_rate_decay_epoch the learning rate is halved. The learning rate is computed as learning_rate * 0.5^(epoch / learning_rate_decay_epoch). learning_rate_decay_epoch can be Inf for a constant learning rate (default)
  • min_std_err: minimum error standard deviation preventing a division close to zero (default exp(-5) = 0.006737946999085467)
  • loss_weights_refine: the weigh of the individual refinement layers using in the cost function. If loss_weights_refine has a single element, then there is no refinement. (default (1.,))
Note

Note that also the optional parameters should be to tuned for a particular application.

Internally the time mean is removed (per default) from the data before it is reconstructed. The time mean is also added back when the file is saved. However, the mean is undefined for for are pixels in the data defined as valid (sea) by the mask which do not have any valid data in the training dataset.

See DINCAE.load_gridded_nc for more information about the netCDF file.

source
DINCAE.reconstruct_pointsFunction
DINCAE.reconstruct_points(T,Atype,filename,varname,grid,fnames_rec )

Mandatory parameters:

  • T: Float32 or Float64: float-type used by the neural network
  • Array{T} or KnetArray{T}: array-type used by the neural network.
  • filename: NetCDF file in the format described below.
  • varname: name of the primary variable in the NetCDF file.
  • grid: tuple of ranges with the grid in the longitude and latitude direction e.g. (-180:1:180,-90:1:90).
  • fnames_rec: NetCDF file names of the reconstruction.

Optional parameters:

  • jitter_std_pos: standard deviation of the noise to be added to the position of the observations (default (5,5))
  • auxdata_files: gridded auxiliary data file for a multivariate reconstruction. auxdata_files is an array of named tuples with the fields (filename, the file name of the NetCDF file, varname the NetCDF name of the primary variable and errvarname the NetCDF name of the expected standard deviation error). For example:
  • probability_skip_for_training: For a given time step n, every track from the same time step n will be skipped by this probability during training (default 0.2). This does not affect the tracks from previous (n-1,n-2,..) and following time steps (n+1,n+2,...). The goal of this parameter is to force the neural network to learn to interpolate the data in time.
  • paramfile: the path of the file (netCDF) where the parameter values are stored (default: nothing).

For example, a single entry of auxdata_files could be:

auxdata_files = [
  (filename = "big-sst-file.nc"),
   varname = "SST",
   errvarname = "SST_error")]

The data in the file should already be interpolated on the targed grid. The file structure of the NetCDF file is described in DINCAE.load_gridded_nc. The fields defined in this file should not have any missing value (see DIVAnd.ufill).

See DINCAE.reconstruct for other optional parameters.

An (minimal) example of the NetCDF file is:

netcdf all-sla.train {
dimensions:
	time_instances = 9628 ;
	obs = 7445528 ;
variables:
	int64 size(time_instances) ;
		size:sample_dimension = "obs" ;
	double dates(time_instances) ;
		dates:units = "days since 1900-01-01 00:00:00" ;
	float sla(obs) ;
	float lon(obs) ;
	float lat(obs) ;
	int64 id(obs) ;
	double dtime(obs) ;
		dtime:long_name = "time of measurement" ;
		dtime:units = "days since 1900-01-01 00:00:00" ;
}

The file should contain the variables lon (longitude), lat (latitude), dtime (time of measurement) and id (numeric identifier, only used by post processing scripts) and dates (time instance of the gridded field). The file should be in the contiguous ragged array representation as specified by the CF convention allowing to group data points into "features" (e.g. tracks for altimetry). Every feature can also contain a single data point.

source

Internal functions

DINCAE.load_gridded_ncFunction
lon,lat,time,data,missingmask,mask = load_gridded_nc(fname,varname; minfrac = 0.05)

Load the variable varname from the NetCDF file fname. The variable lon is the longitude in degrees east, lat is the latitude in degrees north, time is a DateTime vector, data_full is a 3-d array with the data, missingmask is a boolean mask where true means the data is missing and mask is a boolean mask where true means the data location is valid, e.g. sea points for sea surface temperature.

At the bare-minimum a NetCDF file should have the following variables and attributes:

netcdf file.nc {
dimensions:
        time = UNLIMITED ; // (5266 currently)
        lat = 112 ;
        lon = 112 ;
variables:
        double lon(lon) ;
        double lat(lat) ;
        double time(time) ;
                time:units = "days since 1900-01-01 00:00:00" ;
        int mask(lat, lon) ;
        float SST(time, lat, lon) ;
                SST:_FillValue = -9999.f ;
}

The the netCDF mask is 0 for invalid (e.g. land for an ocean application) and 1 for pixels (e.g. ocean).

source
DINCAE.NCDataType
dd = NCData(lon,lat,time,data_full,missingmask,ndims;
            train = false,
            obs_err_std = fill(1.,size(data_full,3)),
            jitter_std = fill(0.05,size(data_full,3)),
            mask = trues(size(data_full)[1:2]),

)

Return a structure holding the data for training (train = true) or testing (train = false) the neural network. obs_err_std is the error standard deviation of the observations. The variable lon is the longitude in degrees east, lat is the latitude in degrees north, time is a DateTime vector, data_full is a 3-d array with the data and missingmask is a boolean mask where true means the data is missing. jitter_std is the standard deviation of the noise to be added to the data during training.

source

Reducing GPU memory usage

Convolutional neural networks can require "a lot" of GPU memory. These parameters can affect GPU memory utilisation:

  • reduce the mini-batch size
  • use fewer layers (e.g. enc_nfilter_internal = [16,24,36] or [16,24])
  • use less filters (reduce the values of the optional parameter encnfilterinternal)
  • use a smaller domain or a lower resolution

Troubleshooting

Installation of cuDNN

If you get the warning Package cuDNN not found in current path or the error Scalar indexing is disallowed:

julia> using DINCAE
┌ Warning: Package cuDNN not found in current path.
│ - Run `import Pkg; Pkg.add("cuDNN")` to install the cuDNN package, then restart julia.
│ - If cuDNN is not installed, some Flux functionalities will not be available when running on the GPU.

You need to install and load cuDNN before calling a function in DINCAE.jl:

using cuDNN
using DINCAE
# ...

Dependencies of DINCAE.jl

DINCAE.jl depends on Flux.jl and CUDA.jl, which will automatically be installed. If you have some problems installing these package you might consult the documentation of Flux.jl or CUDA.jl.