deafrica_tools.datahandling

Functions for loading and handling Digital Earth Africa data.

License

The code in this notebook is licensed under the Apache License, Version 2.0 (https://www.apache.org/licenses/LICENSE-2.0). Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license (https://creativecommons.org/licenses/by/4.0/).

Contact

If you need assistance, please post a question on the Open Data Cube Slack channel (http://slack.opendatacube.org/) or on the GIS Stack Exchange (https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the open-data-cube tag (you can view previously asked questions here: https://gis.stackexchange.com/questions/tagged/open-data-cube).

If you would like to report an issue with this script, you can file one on Github: https://github.com/digitalearthafrica/deafrica-sandbox-notebooks/issues/new

Functions

array_to_geotiff(fname, data, geo_transform, …)

Create a single band GeoTIFF file with data from an array.

dilate(array[, dilation, invert])

Dilate a binary array by a specified nummber of pixels using a disk-like radial dilation.

download_unzip(url[, output_dir, remove_zip])

Downloads and unzips a .zip file from an external URL to a local directory.

first(array, dim[, index_name])

Finds the first occuring non-null value along the given dimension.

last(array, dim[, index_name])

Finds the last occuring non-null value along the given dimension.

load_ard(dc[, products, min_gooddata, …])

Loads analysis ready data.

mostcommon_crs(dc, product, query)

Takes a given query and returns the most common CRS for observations returned for that spatial extent.

nearest(array, dim, target[, index_name])

Finds the nearest values to a target label along the given dimension, for all other dimensions.

wofs_fuser(dest, src)

Fuse two WOfS water measurements represented as ndarray objects.

deafrica_tools.datahandling.array_to_geotiff(fname, data, geo_transform, projection, nodata_val=0, dtype=osgeo.gdal.GDT_Float32)

Create a single band GeoTIFF file with data from an array.

Because this works with simple arrays rather than xarray datasets from DEA, it requires geotransform info ((upleft_x, x_size, x_rotation, upleft_y, y_rotation, y_size)) and projection data (in “WKT” format) for the output raster. These are typically obtained from an existing raster using the following GDAL calls:

>>> from osgeo import gdal
>>> gdal_dataset = gdal.Open(raster_path)
>>> geotrans = gdal_dataset.GetGeoTransform()
>>> prj = gdal_dataset.GetProjection()

or alternatively, directly from an xarray dataset:

>>> geotrans = xarraydataset.geobox.transform.to_gdal()
>>> prj = xarraydataset.geobox.crs.wkt
Parameters
  • fname (str) – Output geotiff file path including extension

  • data (numpy array) – Input array to export as a geotiff

  • geo_transform (tuple) – Geotransform for output raster; e.g. (upleft_x, x_size, x_rotation, upleft_y, y_rotation, y_size)

  • projection (str) – Projection for output raster (in “WKT” format)

  • nodata_val (int, optional) – Value to convert to nodata in the output raster; default 0

  • dtype (gdal dtype object, optional) – Optionally set the dtype of the output raster; can be useful when exporting an array of float or integer values. Defaults to gdal.GDT_Float32

deafrica_tools.datahandling.dilate(array, dilation=10, invert=True)

Dilate a binary array by a specified nummber of pixels using a disk-like radial dilation.

By default, invalid (e.g. False or 0) values are dilated. This is suitable for applications such as cloud masking (e.g. creating a buffer around cloudy or shadowed pixels). This functionality can be reversed by specifying invert=False.

Parameters
  • array (array) – The binary array to dilate.

  • dilation (int, optional) – An optional integer specifying the number of pixels to dilate by. Defaults to 10, which will dilate array by 10 pixels.

  • invert (bool, optional) – An optional boolean specifying whether to invert the binary array prior to dilation. The default is True, which dilates the invalid values in the array (e.g. False or 0 values).

Returns

An array of the same shape as array, with valid data pixels dilated by the number of pixels specified by dilation.

Return type

array

deafrica_tools.datahandling.download_unzip(url, output_dir=None, remove_zip=True)

Downloads and unzips a .zip file from an external URL to a local directory.

Parameters
  • url (str) – A string giving a URL path to the zip file you wish to download and unzip

  • output_dir (str, optional) – An optional string giving the directory to unzip files into. Defaults to None, which will unzip files in the current working directory

  • remove_zip (bool, optional) – An optional boolean indicating whether to remove the downloaded .zip file after files are unzipped. Defaults to True, which will delete the .zip file.

deafrica_tools.datahandling.first(array: xarray.DataArray, dim: str, index_name: Optional[str] = None) → xarray.DataArray

Finds the first occuring non-null value along the given dimension.

Parameters
  • array (xr.DataArray) – The array to search.

  • dim (str) – The name of the dimension to reduce by finding the first non-null value.

Returns

reduced – An array of the first non-null values. The dim dimension will be removed, and replaced with a coord of the same name, containing the value of that dimension where the last value was found.

Return type

xr.DataArray

deafrica_tools.datahandling.last(array: xarray.DataArray, dim: str, index_name: Optional[str] = None) → xarray.DataArray

Finds the last occuring non-null value along the given dimension.

Parameters
  • array (xr.DataArray) – The array to search.

  • dim (str) – The name of the dimension to reduce by finding the last non-null value.

  • index_name (str, optional) – If given, the name of a coordinate to be added containing the index of where on the dimension the nearest value was found.

Returns

reduced – An array of the last non-null values. The dim dimension will be removed, and replaced with a coord of the same name, containing the value of that dimension where the last value was found.

Return type

xr.DataArray

deafrica_tools.datahandling.load_ard(dc, products=None, min_gooddata=0.0, categories_to_mask_ls={'cloud': 'high_confidence', 'cloud_shadow': 'high_confidence'}, categories_to_mask_s2=['cloud high probability', 'cloud medium probability', 'thin cirrus', 'cloud shadows', 'saturated or defective'], categories_to_mask_s1=['invalid data'], mask_filters=None, mask_pixel_quality=True, ls7_slc_off=True, predicate=None, dtype='auto', verbose=True, **kwargs)

Loads analysis ready data.

Loads and combines Landsat USGS Collections 2, Sentinel-2, and Sentinel-1 for multiple sensors (i.e. ls5t, ls7e and ls8c for Landsat; s2a and s2b for Sentinel-2), optionally applies pixel quality masks, and drops time steps that contain greater than a minimum proportion of good quality (e.g. non- cloudy or shadowed) pixels.

The function supports loading the following DE Africa products:

Landsat:
  • ls5_sr (‘sr’ denotes surface reflectance)

  • ls7_sr

  • ls8_sr

  • ls5_st (‘st’ denotes surface temperature)

  • ls7_st

  • ls8_st

Sentinel-2:
  • s2_l2a

Sentinel-1:
  • s1_rtc

Last modified: August 2021

Parameters
  • dc (datacube Datacube object) – The Datacube to connect to, i.e. dc = datacube.Datacube(). This allows you to also use development datacubes if required.

  • products (list) –

    A list of product names to load data from. For example:

    • Landsat C2: [‘ls5_sr’, ‘ls7_sr’, ‘ls8_sr’]

    • Sentinel-2: [‘s2_l2a’]

    • Sentinel-1: [‘s1_rtc’]

  • min_gooddata (float, optional) – An optional float giving the minimum percentage of good quality pixels required for a satellite observation to be loaded. Defaults to 0.0 which will return all observations regardless of pixel quality (set to e.g. 0.99 to return only observations with more than 99% good quality pixels).

  • categories_to_mask_ls (dict, optional) – An optional dictionary that is used to identify poor quality pixels for masking. This mask is used for both masking out low quality pixels (e.g. cloud or shadow), and for dropping observations entirely based on the min_gooddata calculation.

  • categories_to_mask_s2 (list, optional) – An optional list of Sentinel-2 Scene Classification Layer (SCL) names that identify poor quality pixels for masking.

  • categories_to_mask_s1 (list, optional) – An optional list of Sentinel-1 mask names that identify poor quality pixels for masking.

  • mask_filters (iterable of tuples, optional) –

    Iterable tuples of morphological operations - (“<operation>”, <radius>) to apply on mask, where: operation: string, can be one of these morphological operations:

    closing = remove small holes in cloud - morphological closing opening = shrinks away small areas of the mask dilation = adds padding to the mask erosion = shrinks bright regions and enlarges dark regions

    radius: int e.g. mask_filters=[(‘erosion’, 5),(“opening”, 2),(“dilation”, 2)]

  • mask_pixel_quality (bool, optional) – An optional boolean indicating whether to apply the poor data mask to all observations that were not filtered out for having less good quality pixels than min_gooddata. E.g. if min_gooddata=0.99, the filtered observations may still contain up to 1% poor quality pixels. The default of False simply returns the resulting observations without masking out these pixels; True masks them and sets them to NaN using the poor data mask. This will convert numeric values to floating point values which can cause memory issues, set to False to prevent this.

  • ls7_slc_off (bool, optional) – An optional boolean indicating whether to include data from after the Landsat 7 SLC failure (i.e. SLC-off). Defaults to True, which keeps all Landsat 7 observations > May 31 2003.

  • predicate (function, optional) – An optional function that can be passed in to restrict the datasets that are loaded by the function. A filter function should take a datacube.model.Dataset object as an input (i.e. as returned from dc.find_datasets), and return a boolean. For example, a filter function could be used to return True on only datasets acquired in January: dataset.time.begin.month == 1

  • dtype (string, optional) – An optional parameter that controls the data type/dtype that layers are coerced to after loading. Valid values: ‘native’, ‘auto’, ‘float{16|32|64}’. When ‘auto’ is used, the data will be converted to float32 if masking is used, otherwise data will be returned in the native data type of the data. Be aware that if data is loaded in its native dtype, nodata and masked pixels will be returned with the data’s native nodata value (typically -999), not NaN. NOTE: If loading Landsat, the data is automatically rescaled so ‘native’ dtype will return a value error.

  • verbose (bool, optional) – If True, print progress statements during loading

  • **kwargs (dict, optional) – A set of keyword arguments to dc.load that define the spatiotemporal query used to extract data. This typically includes measurements, x, y, time, resolution, resampling, group_by and crs. Keyword arguments can either be listed directly in the load_ard call like any other parameter (e.g. measurements=[‘red’]), or by passing in a query kwarg dictionary (e.g. **query). For a list of possible options, see the dc.load documentation: https://datacube-core.readthedocs.io/en/latest/dev/api/generate/datacube.Datacube.load.html

Returns

combined_ds – An xarray dataset containing only satellite observations that contains greater than min_gooddata proportion of good quality pixels.

Return type

xarray Dataset

deafrica_tools.datahandling.mostcommon_crs(dc, product, query)

Takes a given query and returns the most common CRS for observations returned for that spatial extent. This can be useful when your study area lies on the boundary of two UTM zones, forcing you to decide which CRS to use for your output_crs in dc.load.

Parameters
  • dc (datacube Datacube object) – The Datacube to connect to, i.e. dc = datacube.Datacube(). This allows you to also use development datacubes if required.

  • product (str) – A product name to load CRSs from

  • query (dict) – A datacube query including x, y and time range to assess for the most common CRS

Returns

A EPSG string giving the most common CRS from all datasets returned by the query above

Return type

str

deafrica_tools.datahandling.nearest(array: xarray.DataArray, dim: str, target, index_name: Optional[str] = None) → xarray.DataArray

Finds the nearest values to a target label along the given dimension, for all other dimensions.

E.g. For a DataArray with dimensions (‘time’, ‘x’, ‘y’)

nearest_array = nearest(array, ‘time’, ‘2017-03-12’)

will return an array with the dimensions (‘x’, ‘y’), with non-null values found closest for each (x, y) pixel to that location along the time dimension.

The returned array will include the ‘time’ coordinate for each x,y pixel that the nearest value was found.

Parameters
  • array (xr.DataArray) – The array to search.

  • dim (str) – The name of the dimension to look for the target label.

  • target (same type as array[dim]) – The value to look up along the given dimension.

  • index_name (str, optional) – If given, the name of a coordinate to be added containing the index of where on the dimension the nearest value was found.

Returns

nearest_array – An array of the nearest non-null values to the target label. The dim dimension will be removed, and replaced with a coord of the same name, containing the value of that dimension closest to the given target label.

Return type

xr.DataArray

deafrica_tools.datahandling.wofs_fuser(dest, src)

Fuse two WOfS water measurements represented as ndarray objects.

Note: this is a copy of the function located here: https://github.com/GeoscienceAustralia/digitalearthau/blob/develop/digitalearthau/utils.py