Reprojecting datacube and raster data

  • Products used: gm_s2_annual

  • Special requirements: This notebook loads data from an external raster file (madagascar_CHPclim_02.tif) from the Supplementary_data folder of this repository

Keywords data used; sentinel-2 geomedian, data used; CHIRPS, data methods; reprojecting, data methods; resampling, data format; GeoTIFF, data methods; geobox

Background

It is often valuable to combine data from the datacube with other external raster datasets. For example, we may want to use a rainfall raster to focus our analysis on satellite data to mask areas of higher or lower precipitation. However, it can be challenging to combine datasets if they have different extents, resolutions (e.g. 20 m vs. 250 m), or coordinate reference systems (e.g. WGS 84 vs. Africa Albers Equal Area Conic). To be able to combine these datasets, we need to be able to reproject them into identical spatial grids prior to analysis.

Datacube stores information about the location, resolution and coordinate reference system of a rectangular grid of data using an object called a GeoBox. This GeoBox object is added to all data loaded from the datacube, and all rasters loaded with xr.open_rasterio (provided that import datacube is run before the raster is loaded). Datacube functions can use this object to provide a template that can be used to reproject raster and datacube data - either applying this reprojection directly when new data is being loaded, or to reproject existing data that has already been loaded into memory. This makes it straightforward to reproject one dataset into the exact spatial grid of another dataset for further analysis.

Description

This notebook demonstrates how to perform three key reprojection workflows when working with datacube data and external rasters:

  1. Loading and reprojecting datacube data to match a raster file using dc.load

  2. Reprojecting existing datacube data to match a raster using xr_reproject

  3. Loading and reprojecting a raster to match datacube data using rio_slurp_xarray


Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

[1]:
import datacube
import xarray as xr
import matplotlib.pyplot as plt
from odc.algo import xr_reproject
from datacube.testutils.io import rio_slurp_xarray
from datacube.utils.masking import mask_invalid_data

from deafrica_tools.plotting import rgb
/env/lib/python3.6/site-packages/geopandas/_compat.py:88: UserWarning: The Shapely GEOS version (3.7.2-CAPI-1.11.0 ) is incompatible with the GEOS version PyGEOS was compiled with (3.9.0-CAPI-1.16.2). Conversions between both will be slow.
  shapely_geos_version, geos_capi_version_string

Connect to the datacube

Connect to the datacube so we can access DE Africa data. The app parameter is a unique name for the analysis which is based on the notebook file name.

[2]:
dc = datacube.Datacube(app="Reprojecting_data")

Loading and reprojecting datacube data to match a raster

Load a raster file

First we load a GeoTIFF raster from file using xr.open_rasterio. The example below loads a long-term mean rainfall data for the month of February as according to Climate Hazards Group InfraRed Precipitation with Station (CHIRPS) for a section of Madagascar. It has a spatial resolution of ~0.05 decimal degrees (approximately equal to 5 km pixels at the equator) in the WGS 84 (EPSG:4326) coordinate reference system.

[3]:
# Path to raster file
raster_path = "../Supplementary_data/Reprojecting_data/madagascar_CHPclim_02.tif"

# Load raster, and remove redundant "band" dimension
raster = xr.open_rasterio(raster_path).squeeze("band", drop=True)

print(raster)
<xarray.DataArray (y: 20, x: 20)>
array([[351.0889 , 340.83768, 341.8932 , ..., 350.19598, 352.81683, 369.2292 ],
       [355.14917, 343.89148, 344.55936, ..., 342.1835 , 343.45105, 353.25793],
       [353.2198 , 344.27063, 345.2472 , ..., 341.32367, 342.58206, 334.54758],
       ...,
       [367.80493, 363.15268, 357.94885, ..., 311.04022, 305.26425, 299.07462],
       [351.61972, 349.06613, 345.6416 , ..., 305.8714 , 297.39603, 288.17926],
       [343.69983, 342.14966, 340.2362 , ..., 308.5356 , 297.67932, 285.16785]],
      dtype=float32)
Coordinates:
  * y        (y) float64 -18.03 -18.08 -18.13 -18.18 ... -18.88 -18.93 -18.98
  * x        (x) float64 46.03 46.08 46.13 46.18 ... 46.83 46.88 46.93 46.98
Attributes:
    transform:      (0.05000000074505806, 0.0, 46.00000336766243, 0.0, -0.050...
    crs:            +init=epsg:4326
    res:            (0.05000000074505806, 0.05000000074505806)
    is_tiled:       1
    nodatavals:     (nan,)
    scales:         (1.0,)
    offsets:        (0.0,)
    AREA_OR_POINT:  Area

If we plot our loaded raster, we can see there are areas with higher rainfall, and areas with lower rainfall.

[4]:
raster.plot(size=6)
[4]:
<matplotlib.collections.QuadMesh at 0x7f15af0bc588>
../../../_images/sandbox_notebooks_Frequently_used_code_Reprojecting_data_12_1.png

GeoBox objects

Now we have loaded our raster dataset, we can inspect its GeoBox object that we will use to allow us to reproject data. The GeoBox can be accessed using the .geobox method. It includes a set of information that together completely define the spatial grid of our data: * The width (e.g. 20) and height (e.g. 20) of our data in pixels * An Affine object which defines the spatial resolution (e.g. -0.05000000074505806 and 0.05000000074505806) and spatial position (e.g. 46.00000336766243 and -18.00000160932541) of our data * The coordinate reference system of our data (e.g. +init=epsg:4326)

[5]:
raster.geobox
/env/lib/python3.6/site-packages/pyproj/crs/crs.py:280: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  projstring = _prepare_from_string(projparams)
[5]:
GeoBox(20, 20, Affine(0.05000000074505806, 0.0, 46.00000336766243,
       0.0, -0.05000000074505806, -18.00000160932541), +init=epsg:4326)

The GeoBox also has a number of useful methods that can be used to view information about the spatial grid of our data. For example, we can inspect the spatial resolution of our data:

[6]:
raster.geobox.resolution
[6]:
(-0.05000000074505806, 0.05000000074505806)

Or obtain information about the data’s spatial extent:

[7]:
raster.geobox.extent.boundingbox
[7]:
BoundingBox(left=46.00000336766243, bottom=-19.00000162422657, right=47.00000338256359, top=-18.00000160932541)

Note: For more information about GeoBox objects and a complete list of their methods, refer to the datacube documentation.

Load and reproject datacube data

We can now use datacube to load and reproject satellite data to exactly match the coordinate reference system and resolution of our raster data. By specifying like=raster.geobox, we can load datacube data that will be reprojected to match the spatial grid of our raster.

[8]:
# Load and reproject data from datacube
ds = dc.load(product="gm_s2_annual",
             measurements=["red", "green", "blue"],
             time=("2018"),
             like=raster.geobox,
             resampling="nearest",
             group_by="solar_day")

When we plot the result, it should appear similarly pixelated to the raster we loaded above, which has a lower resolution.

[9]:
rgb(ds, col='time')
../../../_images/sandbox_notebooks_Frequently_used_code_Reprojecting_data_23_0.png

We can also directly compare the geobox of the two datasets to verify they share the same spatial grid:

[10]:
ds.geobox == raster.geobox
[10]:
True

Now that our two datasets share the same spatial grid, we can use our raster as a mask. For example, we can mask out all satellite pixels except those with higher rainfall (e.g. more than 350 mm/month of rain):

[11]:
# Rename raster dimensions to match datacube conventions for data
# with geographic coordinates
if raster.geobox.crs.geographic:
    raster = raster.rename({"x": "longitude", "y": "latitude"})

# Identify high rainfall areas
high_rainfall = raster > 350

# Apply mask to set lower-rainfall areas to `NaN`
ds_masked = ds.where(high_rainfall)
[12]:
# Plot the masked data
rgb(ds_masked)
../../../_images/sandbox_notebooks_Frequently_used_code_Reprojecting_data_28_0.png

Reprojecting existing datacube data to match a raster

The example above demonstrated how to load new satellite data from the datacube to match the spatial grid of a raster. However, sometimes we may have already loaded satellite data with a coordinate reference system and resolution that is different from our raster. In this case, we need to reproject this existing dataset to match our raster.

For example, we have loaded the 2018 Sentinel-2 geomedian data from the datacube with 20 m-resolution pixels in the EPSG:6933 coordinate reference system. Note that in this example we manually specify the x, y, resolution and output_crs parameters, rather than taking them directly from our raster using like=raster.geobox in the previous example.

[13]:
# Load data from datacube
ds = dc.load(product="gm_s2_annual",
             measurements=["red", "green", "blue"],
             time=("2018"),
             x=(46.00, 47.00),
             y=(-19, -18),
             resolution=(-20, 20),
             output_crs="EPSG:6933",
             group_by="solar_day")

# Load raster, and remove redundant "band" dimension
raster = xr.open_rasterio(raster_path).squeeze("band", drop=True)

If we plot our satellite data, we can see that it is much higher resolution than our pixelated CHIRPS precipitation raster:

[14]:
rgb(ds)
../../../_images/sandbox_notebooks_Frequently_used_code_Reprojecting_data_32_0.png

Reproject datacube data

We can now use the xr_reproject function to reproject our existing high resolution satellite dataset. We specify geobox=raster.geobox to request that the data gets reprojected to match the spatial grid of our low resolution raster.

To control how the data is reprojected, we can specify a custom resampling method that will control how our high resolution pixels will be transformed into lower resolution pixels. In this case, we specify "average", which will set the value of each larger pixel to the average of all smaller pixels that fall within its pixel boundary.

Note: Refer to the rasterio documentation for a full list of available resampling methods.

[15]:
# Temporary workaround for bug in `xr_reproject`
if raster.geobox.crs.geographic:
    ds = ds.rename({"x": "longitude", "y": "latitude"})

# Reproject data
ds_reprojected = xr_reproject(src=ds,
                              geobox=raster.geobox,
                              resampling="average")

#Set nodata to `NaN`
ds_reprojected = mask_invalid_data(ds_reprojected)

Now if we plot our reprojected dataset, we can see that our satellite imagery now has a similar resolution to our low resolution raster (e.g. with a pixelated appearance). Note however, that this result will appear smoother than the previous example due to the "average" resampling method used here (compared to "nearest").

[16]:
rgb(ds_reprojected)
../../../_images/sandbox_notebooks_Frequently_used_code_Reprojecting_data_36_0.png

Once again, we can also verify that the two datasets have identical spatial grids:

[17]:
ds_reprojected.geobox == raster.geobox
[17]:
True

Loading and reprojecting a raster to match datacube data

Rather than reprojecting satellite data to match the resolution and projection system of our raster, we may instead wish to reproject our raster to match the spatial grid of our satellite data. This can be particularly useful when we have a lower resolution raster file (e.g. like the ~5 km resolution CHIRPS data we are using in this example), but we don’t want to lose the better spatial resolution of our satellite data.

Load datacube data

As in the previous example, we can load in satellite data from the datacube at 20 m spatial resolution and EPSG:6933 coordinate reference system:

[18]:
# Load data from datacube
ds = dc.load(product="gm_s2_annual",
             measurements=["red", "green", "blue"],
             time=("2018"),
             x=(46.00, 47.00),
             y=(-19, -18),
             resolution=(-20, 20),
             output_crs="EPSG:6933",
             group_by="solar_day")

# Plot 20 m resolution satellite data
rgb(ds)
../../../_images/sandbox_notebooks_Frequently_used_code_Reprojecting_data_40_0.png

Load and reproject raster data

We can now use the rio_slurp_xarray function to load and reproject our raster file to match our higher resolution satellite dataset. We specify gbox=ds.geobox to request that our raster gets reprojected to match the spatial grid of ds. As in the previous xr_reproject example, we can also specify a custom resampling method which will be used during the reprojection. In this case, we specify 'bilinear', which will produce a smooth output without obvious pixel boundaries.

Note: Refer to the rasterio documentation for a full list of available resampling methods.

[19]:
# Load raster and reproject to match satellite dataset
raster_reprojected = rio_slurp_xarray(fname=raster_path,
                                      gbox=ds.geobox,
                                      resampling="bilinear")

# Set nodata to `NaN`
raster_reprojected = mask_invalid_data(raster_reprojected)

If we plot our resampled raster data, it should now appear less pixelated than the original ~250 m resolution raster.

[20]:
raster_reprojected.plot(size=6, vmin=200)
[20]:
<matplotlib.collections.QuadMesh at 0x7f15ad77c320>
../../../_images/sandbox_notebooks_Frequently_used_code_Reprojecting_data_44_1.png

The resampled raster should also match the spatial grid of our higher resolution satellite data:

[21]:
raster_reprojected.geobox == ds.geobox
[21]:
True

Now both of our datasets share the same spatial grid, we can use our resampled raster to mask our higher resolution satellite dataset as we did in the first section (e.g. mask out all areas with less than 350 mm of rainfall in February). Compared to the first example, this masked satellite dataset should appear much higher resolution, with far less obvious pixelation:

[ ]:
# Identify high rainfall areas
high_rainfall = raster_reprojected > 350

# Apply mask to set high rainfall areas to `NaN`
ds_masked = ds.where(high_rainfall)

# Plot the masked data
rgb(ds_masked)

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:

[ ]:
print(datacube.__version__)

Last Tested:

[ ]:
from datetime import datetime
datetime.today().strftime('%Y-%m-%d')