Working with time in xarray

Keywords analysis; time series, data used; sentinel-2, data methods; groupby,:index:data methods; nearest, index:data methods; interpolating, data methods; resampling, data methods; compositing

Contexte

Time series data is a series of data points usually captured at successively spaced points in time. In a remote-sensing context, time series data is a sequence of discrete satellite images taken at the same area at successive times. Time series analysis uses different methods to extract meaningful statistics, patterns and other characteristics of the data. Time series data and analysis has widespread application ranging from monitoring agricultural crops, natural vegetation change detection, mineral prospectivity mapping, and tidal height modelling.

Description

The xarray Python package provides many useful techniques for dealing with time series data that can be applied to Digital Earth Africa data. This notebook demonstrates how to use xarray techniques to:

  1. Select different time periods of data (e.g. year, month, day) from an xarray.Dataset

  2. Use datetime accessors to extract additional information from a dataset’s time dimension

  3. Summarise time series data for different time periods using .groupby() and .resample()

  4. Interpolate time series data to estimate landscape conditions at a specific date that the satellite did not observe

For additional information about the techniques demonstrated below, refer to the xarray time series data guide.


Getting started

To run this analysis, run all the cells in the notebook, starting with the « Load packages » cell.

Load packages

[1]:
%matplotlib inline

import datacube
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd
from datacube.utils.geometry import Geometry

from deafrica_tools.datahandling import load_ard, mostcommon_crs
from deafrica_tools.areaofinterest import define_area

Connect to the datacube

[2]:
dc = datacube.Datacube(app='Working_with_time')

Loading Landsat data

First, we load in around two years” of Sentinel-2 data, using the load_ard function and filtering for timesteps with at least 95% good-quality pixels.

To define the area of interest, there are two methods available:

  1. By specifying the latitude, longitude, and buffer. This method requires you to input the central latitude, central longitude, and the buffer value in square degrees around the center point you want to analyze. For example, lat = 10.338, lon = -1.055, and buffer = 0.1 will select an area with a radius of 0.1 square degrees around the point with coordinates (10.338, -1.055).

  2. By uploading a polygon as a GeoJSON or Esri Shapefile. If you choose this option, you will need to upload the geojson or ESRI shapefile into the Sandbox using Upload Files button 988450c61da3402ebfe25421cf71f28b in the top left corner of the Jupyter Notebook interface. ESRI shapefiles must be uploaded with all the related files (.cpg, .dbf, .shp, .shx). Once uploaded, you can use the shapefile or geojson to define the area of interest. Remember to update the code to call the file you have uploaded.

To use one of these methods, you can uncomment the relevant line of code and comment out the other one. To comment out a line, add the "#" symbol before the code you want to comment out. By default, the first option which defines the location using latitude, longitude, and buffer is being used.

[3]:
# Define the location
# Method 1: Specify the latitude, longitude, and buffer
aoi = define_area(lat=13.94, lon=-16.54, buffer=0.125)

# Method 2: Use a polygon as a GeoJSON or Esri Shapefile.
# aoi = define_area(vector_path='aoi.shp')

#Create a geopolygon and geodataframe of the area of interest
geopolygon = Geometry(aoi["features"][0]["geometry"], crs="epsg:4326")
geopolygon_gdf = gpd.GeoDataFrame(geometry=[geopolygon], crs=geopolygon.crs)

# Get the latitude and longitude range of the geopolygon
lat_range = (geopolygon_gdf.total_bounds[1], geopolygon_gdf.total_bounds[3])
lon_range = (geopolygon_gdf.total_bounds[0], geopolygon_gdf.total_bounds[2])


# Create a reusable query
query = {
    'x': lon_range,
    'y': lat_range,
    'time': ('2018-01', '2019-12'),
    'resolution': (-20, 20),
    'measurements':['red', 'green', 'blue', 'nir']
}

# Identify the most common projection system in the input query
output_crs = mostcommon_crs(dc=dc, product='s2_l2a', query=query)

# Load available data from Landsat 8 and filter to retain only times
# with at least 95% good data
ds = load_ard(dc=dc,
              products=['s2_l2a'],
              min_gooddata=0.95,
              output_crs=output_crs,
              align=(15, 15),
              **query)
Using pixel quality parameters for Sentinel 2
Finding datasets
    s2_l2a
Counting good quality pixels for each time step
/usr/local/lib/python3.10/dist-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  _reproject(
/usr/local/lib/python3.10/dist-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  _reproject(
Filtering to 34 out of 144 time steps with at least 95.0% good quality pixels
Applying pixel quality/cloud mask
Loading 34 time steps

Explore xarray data using time

Here we will explore several ways to utilise the time dimension within an xarray.Dataset. This section outlines selecting, summarising and interpolating data at specific times.

Indexing by time

We can select data for an entire year by passing a string to .sel():

[4]:
ds.sel(time='2018')

[4]:
<xarray.Dataset>
Dimensions:      (time: 12, y: 1392, x: 1361)
Coordinates:
  * time         (time) datetime64[ns] 2018-01-08T11:46:56 ... 2018-12-14T11:...
  * y            (y) float64 1.556e+06 1.556e+06 ... 1.528e+06 1.528e+06
  * x            (x) float64 3.2e+05 3.2e+05 3.201e+05 ... 3.472e+05 3.472e+05
    spatial_ref  int32 32628
Data variables:
    red          (time, y, x) float32 2.254e+03 1.988e+03 ... 1.747e+03
    green        (time, y, x) float32 1.502e+03 1.348e+03 ... 1.336e+03
    blue         (time, y, x) float32 815.0 742.0 ... 1.016e+03 1.002e+03
    nir          (time, y, x) float32 3.789e+03 3.519e+03 ... 2.496e+03
Attributes:
    crs:           epsg:32628
    grid_mapping:  spatial_ref

Or select a single month:

[5]:
ds.sel(time='2018-05')

[5]:
<xarray.Dataset>
Dimensions:      (time: 1, y: 1392, x: 1361)
Coordinates:
  * time         (time) datetime64[ns] 2018-05-23T11:39:00
  * y            (y) float64 1.556e+06 1.556e+06 ... 1.528e+06 1.528e+06
  * x            (x) float64 3.2e+05 3.2e+05 3.201e+05 ... 3.472e+05 3.472e+05
    spatial_ref  int32 32628
Data variables:
    red          (time, y, x) float32 2.626e+03 2.49e+03 ... 2.05e+03 1.988e+03
    green        (time, y, x) float32 1.895e+03 1.793e+03 ... 1.542e+03
    blue         (time, y, x) float32 1.112e+03 1.08e+03 ... 1.163e+03 1.132e+03
    nir          (time, y, x) float32 3.536e+03 3.472e+03 ... 2.686e+03
Attributes:
    crs:           epsg:32628
    grid_mapping:  spatial_ref

Or select a range of dates using slice(). This selects all observations between the two dates, inclusive of both the start and stop values:

[6]:
ds.sel(time=slice('2018-06', '2019-01'))

[6]:
<xarray.Dataset>
Dimensions:      (time: 6, y: 1392, x: 1361)
Coordinates:
  * time         (time) datetime64[ns] 2018-10-15T11:35:46 ... 2019-01-18T11:...
  * y            (y) float64 1.556e+06 1.556e+06 ... 1.528e+06 1.528e+06
  * x            (x) float64 3.2e+05 3.2e+05 3.201e+05 ... 3.472e+05 3.472e+05
    spatial_ref  int32 32628
Data variables:
    red          (time, y, x) float32 1.099e+03 1.051e+03 ... 1.951e+03
    green        (time, y, x) float32 1.048e+03 1.058e+03 ... 1.409e+03
    blue         (time, y, x) float32 487.0 529.0 836.0 ... 1.016e+03 976.0
    nir          (time, y, x) float32 3.683e+03 3.957e+03 ... 2.792e+03
Attributes:
    crs:           epsg:32628
    grid_mapping:  spatial_ref

To select the nearest time to a desired time value, we set it to use a nearest neighbour method, 'nearest'. We have to specify the time using a datetime object, otherwise xarray indexing assumes we are selecting a range, like the ds.sel(time='2018-05') month example above.

Here, we have picked a date at the start of December 2018. 'nearest' will find the observation closest to that date.

[7]:
target_time = np.datetime64('2018-12-01')

ds.sel(time=target_time, method='nearest')
[7]:
<xarray.Dataset>
Dimensions:      (y: 1392, x: 1361)
Coordinates:
    time         datetime64[ns] 2018-12-09T11:47:28
  * y            (y) float64 1.556e+06 1.556e+06 ... 1.528e+06 1.528e+06
  * x            (x) float64 3.2e+05 3.2e+05 3.201e+05 ... 3.472e+05 3.472e+05
    spatial_ref  int32 32628
Data variables:
    red          (y, x) float32 2.376e+03 2.035e+03 ... 1.987e+03 1.938e+03
    green        (y, x) float32 1.61e+03 1.375e+03 ... 1.487e+03 1.416e+03
    blue         (y, x) float32 777.0 654.0 970.0 ... 1.179e+03 1.051e+03 960.0
    nir          (y, x) float32 3.633e+03 3.515e+03 ... 2.866e+03 2.748e+03
Attributes:
    crs:           epsg:32628
    grid_mapping:  spatial_ref

You can select the closest time before a given time using ffill (forward-fill).

[8]:
previous_time = ds.sel(time=target_time, method='ffill')

previous_time.blue.plot();

../../../_images/sandbox_notebooks_Frequently_used_code_Working_with_time_21_0.png

To select the closest time after a given time, use bfill (back-fill).

[9]:
next_time = ds.sel(time=target_time, method='bfill')

next_time.blue.plot()
[9]:
<matplotlib.collections.QuadMesh at 0x7fdacd94bd00>
../../../_images/sandbox_notebooks_Frequently_used_code_Working_with_time_23_1.png

The same methods also work on a list of times:

[10]:
many_times = np.array([
    '2018-06-23',
    '2018-09-13',
    '2018-11-02'
], dtype=np.datetime64)

nearest = ds.sel(time=many_times, method='nearest')

nearest.blue.plot(col='time', vmin=0);

../../../_images/sandbox_notebooks_Frequently_used_code_Working_with_time_25_0.png

Using the datetime accessor

xarray allows you to easily extract additional information from the time dimension in Digital Earth Africa data. For example, we can get a list of what season each observation belongs to:

[11]:
ds.time.dt.season

[11]:
<xarray.DataArray 'season' (time: 34)>
array(['DJF', 'DJF', 'DJF', 'DJF', 'MAM', 'MAM', 'MAM', 'MAM', 'SON', 'SON',
       'DJF', 'DJF', 'DJF', 'DJF', 'DJF', 'DJF', 'MAM', 'MAM', 'MAM', 'MAM',
       'MAM', 'MAM', 'MAM', 'MAM', 'MAM', 'JJA', 'JJA', 'JJA', 'SON', 'SON',
       'SON', 'SON', 'SON', 'DJF'], dtype=object)
Coordinates:
  * time         (time) datetime64[ns] 2018-01-08T11:46:56 ... 2019-12-19T11:...
    spatial_ref  int32 32628
Attributes:
    units:    seconds since 1970-01-01 00:00:00

Or the day of the year:

[12]:
ds.time.dt.dayofyear

[12]:
<xarray.DataArray 'dayofyear' (time: 34)>
array([  8,  23,  48,  53,  63,  68,  93, 143, 288, 323, 343, 348,   3,  18,
        43,  53,  68,  73,  83,  98, 108, 123, 133, 143, 148, 158, 183, 188,
       273, 293, 303, 328, 333, 353])
Coordinates:
  * time         (time) datetime64[ns] 2018-01-08T11:46:56 ... 2019-12-19T11:...
    spatial_ref  int32 32628
Attributes:
    units:    seconds since 1970-01-01 00:00:00

Grouping and resampling by time

xarray also provides some shortcuts for aggregating data over time. In the example below, we first group our data by season, then take the median of each group. This produces a new dataset with only four observations (one per season).

[13]:
# Group the time series into seasons, and take median of each time period
ds_seasonal = ds.groupby('time.season').median(dim='time')

# Plot the output
ds_seasonal.nir.plot(col='season', col_wrap=4)
plt.show()

../../../_images/sandbox_notebooks_Frequently_used_code_Working_with_time_31_0.png

We can also use the .resample() method to summarise our dataset into larger chunks of time. In the example below, we produce a median composite for every 6 months of data in our dataset:

[14]:
# Resample to combine each 6 months of data into a median composite
ds_resampled = ds.resample(time="6m").median()

# Plot the new resampled data
ds_resampled.nir.plot(col="time")
plt.show()
../../../_images/sandbox_notebooks_Frequently_used_code_Working_with_time_33_0.png

Interpolating new timesteps

Sometimes, we want to return data for specific times/dates that weren’t observed by a satellite. To estimate what the landscape appeared like on certain dates, we can use the .interp() method to interpolate between the nearest two observations.

By default, the interp() method uses linear interpolation (method='linear'). Another useful option is method='nearest', which will return the nearest satellite observation to the specified date(s).

[15]:
# New dates to interpolate data for
new_dates = ['2018-07-25', '2018-09-01', '2018-12-05']

# Interpolate Landsat values for three new dates
ds_interp = ds.interp(time=new_dates)

# Plot the new interpolated data
ds_interp.nir.plot(col='time')
plt.show()

../../../_images/sandbox_notebooks_Frequently_used_code_Working_with_time_35_0.png

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:

[16]:
print(datacube.__version__)
1.8.15

Last Tested:

[17]:
from datetime import datetime
datetime.today().strftime('%Y-%m-%d')
[17]:
'2023-08-14'