Integrating external data from a CSV

Keywords data used; landsat 8, external data, data methods; interpolation, data used; csv, intertidal, coastal, water

Background

It is often useful to combine external data (e.g. from a CSV or other dataset) with data loaded from Digital Earth Africa. For example, we may want to combine data from a tide or river guage with satellite data to determine water levels at the exact moment each satellite observation was made. This can allow us to manipulate and extract satellite data and obtain additional insights using data from our external source.

Description

This example notebook loads in a time series of external tide modelling data from a CSV, and combines it with satellite data loaded from Digital Earth Africa. This workflow could be applied to any external time series dataset (e.g. river guages, tide guagues, rainfall measurements etc). The notebook demonstrates how to:

  1. Load a time series of Landsat 8 data

  2. Load in external time series data from a CSV

  3. Convert this data to an xarray.Dataset and link it to each satellite observation by interpolating values at each satellite timestep

  4. Add this new data as a variable in our satellite dataset, and use this to filter satellite imagery to high and low tide imagery

  5. Demonstrate how to swap dimensions between time and the new tide_height variable


Getting started

To run this analysis, run all the cells in the notebook, starting with the « Load packages » cell.

Load packages

[1]:
%matplotlib inline

import datacube
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from datacube.utils.geometry import Geometry

from deafrica_tools.datahandling import load_ard, mostcommon_crs
from deafrica_tools.areaofinterest import define_area

pd.plotting.register_matplotlib_converters()

Connect to the datacube

[2]:
dc = datacube.Datacube(app='Integrating_external_data')

Loading satellite data

First we load in three years of Landsat 8 data. We will use Msasani Bay in Tanzania for this demonstration, as we have a CSV of tide height data for this location that we wish to combine with satellite data. We load a single band nir which clearly differentiates between water (low values) and land (higher values). This will let us verify that we can use external tide data to identify low and high tide satellite observations.

To define the area of interest, there are two methods available:

  1. By specifying the latitude, longitude, and buffer. This method requires you to input the central latitude, central longitude, and the buffer value in square degrees around the center point you want to analyze. For example, lat = 10.338, lon = -1.055, and buffer = 0.1 will select an area with a radius of 0.1 square degrees around the point with coordinates (10.338, -1.055).

  2. By uploading a polygon as a GeoJSON or Esri Shapefile. If you choose this option, you will need to upload the geojson or ESRI shapefile into the Sandbox using Upload Files button 915b9d7df8c6405a84578f80c92d85f2 in the top left corner of the Jupyter Notebook interface. ESRI shapefiles must be uploaded with all the related files (.cpg, .dbf, .shp, .shx). Once uploaded, you can use the shapefile or geojson to define the area of interest. Remember to update the code to call the file you have uploaded.

To use one of these methods, you can uncomment the relevant line of code and comment out the other one. To comment out a line, add the "#" symbol before the code you want to comment out. By default, the first option which defines the location using latitude, longitude, and buffer is being used.

[3]:
# Set the area of interest
# Method 1: Specify the latitude, longitude, and buffer
aoi = define_area(lat=-6.745, lon=39.261, buffer=0.06)

# Method 2: Use a polygon as a GeoJSON or Esri Shapefile.
# aoi = define_area(vector_path='aoi.shp')

#Create a geopolygon and geodataframe of the area of interest
geopolygon = Geometry(aoi["features"][0]["geometry"], crs="epsg:4326")
geopolygon_gdf = gpd.GeoDataFrame(geometry=[geopolygon], crs=geopolygon.crs)

# Get the latitude and longitude range of the geopolygon
lat_range = (geopolygon_gdf.total_bounds[1], geopolygon_gdf.total_bounds[3])
lon_range = (geopolygon_gdf.total_bounds[0], geopolygon_gdf.total_bounds[2])



# Create a reusable query
query = {
    'x': lon_range,
    'y': lat_range,
    'time': ('2015-01-01', '2017-12-31'),
    'measurements': ['nir'],
    'resolution': (-30, 30)
}

# Identify the most common projection system in the input query
output_crs = mostcommon_crs(dc=dc, product='ls8_sr', query=query)

# Load Landsat 8 data
ds = load_ard(dc=dc,
              products=['ls8_sr'],
              output_crs=output_crs,
              align=(15, 15),
              group_by='solar_day',
              **query)

print(ds)

Using pixel quality parameters for USGS Collection 2
Finding datasets
    ls8_sr
Applying pixel quality/cloud mask
Re-scaling Landsat C2 data
Loading 65 time steps
<xarray.Dataset>
Dimensions:      (time: 65, y: 443, x: 443)
Coordinates:
  * time         (time) datetime64[ns] 2015-01-07T07:32:04.385652 ... 2017-12...
  * y            (y) float64 -7.389e+05 -7.39e+05 ... -7.522e+05 -7.522e+05
  * x            (x) float64 5.222e+05 5.222e+05 ... 5.354e+05 5.355e+05
    spatial_ref  int32 32637
Data variables:
    nir          (time, y, x) float32 nan nan nan ... 0.006003 0.006718 0.006443
Attributes:
    crs:           epsg:32637
    grid_mapping:  spatial_ref

Integrating external data

Load in a CSV of external data with timestamps

In the code below, we aim to take a CSV file of external data (in this case, half-hourly tide heights for a location in Msasani Bay, Tanzania for a five year period between 2014 and 2018 generated using the OTPS TPXO8 tidal model), and link this data back to our satellite data timeseries.

We can load the existing tide height data using the pandas module which we imported earlier. The data has a column of time values, which we will set as the index column (roughly equivelent to a dimension in xarray).

[4]:
tide_data = pd.read_csv('../Supplementary_data/Integrating_external_data/msasani_bay_-6.749102_39.261819_2014-2018_tides.csv',
                          parse_dates=['time'],
                          index_col='time')

tide_data.head()

[4]:
tide_height
time
2014-01-01 00:00:00 1.798
2014-01-01 00:30:00 1.885
2014-01-01 01:00:00 1.867
2014-01-01 01:30:00 1.743
2014-01-01 02:00:00 1.521

Filtering by external data

Now that we have a new variable tide_height in our dataset, we can use xarray indexing methods to manipulate our data using tide heights (e.g. filter by tide to select low or high tide images):

[8]:
# Select a subset of low tide images (tide less than -1.0 m)
low_tide_ds = ds.sel(time = ds.tide_height < -1.0)

# Select a subset of high tide images (tide greater than 0.1 m)
high_tide_ds = ds.sel(time = ds.tide_height > 0.1)

We can plot the median of the filtered images for low and high tide to examine the difference between them. Taking the median allows us to view a cloud-free image.

For the low-tide images:

[9]:
# Compute the median of all low tide images to produce a single cloud free image.
low_tide_composite = low_tide_ds.median(dim='time', skipna=True, keep_attrs=False)

# Plot the median low tide image
low_tide_composite.nir.plot(robust=True)
plt.title('Low tide')
plt.show()

../../../_images/sandbox_notebooks_Frequently_used_code_Integrating_external_data_23_0.png

and for the high-tide images:

[10]:
# Compute the median of all high tide images to produce a single cloud free image.
high_tide_composite = high_tide_ds.median(dim='time', skipna=True, keep_attrs=False)

# Plot the median high tide image
high_tide_composite.nir.plot(robust=True)
plt.title('High tide')
plt.show()

../../../_images/sandbox_notebooks_Frequently_used_code_Integrating_external_data_25_0.png

Note that many tidal flat areas in the low-tide image (blue-green) now appear inundated by water (dark blue/purple) in the high-tide image.

Swapping dimensions based on external data

By default, xarray uses time as one of the main dimensions in the dataset (in addition to x and y). Now that we have a new tide_height variable, we can change this to be an actual dimension in the dataset in place of time. This enables additional more advanced operations, such as calculating rolling statistics or aggregating by tide_heights.

In the example below, you can see that the dataset now has three dimensions (tide_height, x and y). The dimension time is no longer a dimension in the dataset.

[11]:
print(ds.swap_dims({'time': 'tide_height'}))

<xarray.Dataset>
Dimensions:      (tide_height: 65, y: 443, x: 443)
Coordinates:
    time         (tide_height) datetime64[ns] 2015-01-07T07:32:04.385652 ... ...
  * y            (y) float64 -7.389e+05 -7.39e+05 ... -7.522e+05 -7.522e+05
  * x            (x) float64 5.222e+05 5.222e+05 ... 5.354e+05 5.355e+05
    spatial_ref  int32 32637
  * tide_height  (tide_height) float64 -1.401 -1.382 -0.7692 ... -0.2992 -0.3579
Data variables:
    nir          (tide_height, y, x) float32 nan nan nan ... 0.006718 0.006443
Attributes:
    crs:           epsg:32637
    grid_mapping:  spatial_ref

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:

[12]:
print(datacube.__version__)
1.8.15

Last Tested:

[13]:
from datetime import datetime
datetime.today().strftime('%Y-%m-%d')
[13]:
'2023-08-14'