Downloading and streaming data using STAC metadata

Products used: s2_l2a

Keywords data used; s2_l2adata used; Sentinel-2

Background

Digital Earth Africa stores a range of data products on Amazon Web Service’s Simple Cloud Storage (S3) with free public access. These products can be browsed on the interactive Sandbox Explorer. To make it easier to find data in the Digital Earth Africa archive, the Sandbox Explorer also provides a SpatioTemporal Asset Catalog (STAC) endpoint for listing or searching metadata (https://explorer.digitalearth.africa/stac).

STAC is a recently developed specification that provides a common language to describe geospatial information so it can more easily be indexed and discovered. DEA’s STAC metadata can be used to quickly identify all available data for a given product, location or time period. This information can then be used to efficiently download data from the cloud onto a local disk, or stream data directly into desktop GIS software like QGIS.

Description

This notebook provides a brief introduction to accessing and using Digital Earth Africa’s STAC metadata:

How to construct a STAC metadata API call
How to search for STAC metadata and load the results into Python
How to inspect and plot the unique STAC Items contained in the metadata
How to inspect assets contained within a STAC Item
How to download data using STAC metadata
How to stream data into Python and QGIS using STAC metadata (without downloading it first)

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

[1]:

import datacube
import urllib.request, json
import geopandas as gpd
import rioxarray
import odc.aws
from pprint import pprint
from datacube.testutils.io import rio_slurp_xarray

Connect to the datacube

[2]:

dc = datacube.Datacube(app='Downloading_data_with_STAC')

Searching STAC metadata

Construct the STAC API call

First we need to set up some analysis parameters that will be used to search for metadata. This includes product, which is the same product name used to load data directly using dc.load (see Introduction to loading data). In this example we’ll choose a small area over Madagascar.

For a full list of available products, browse the DE Africa Sandbox Explorer.

[3]:

product = 's2_l2a'
start_time = '2020-01-01'
end_time = '2020-01-31'
bbox = [45.0,-20.1,  47.0, -19.8]

We can now combine the parameters above to create a URL that will be used to query Digital Earth Africa’s STAC metadata. This metadata can be previewed in another tab by clicking the URL.

[4]:

root_url = 'https://explorer.digitalearth.africa/stac'
stac_url = f'{root_url}/search?collection={product}&time={start_time}/{end_time}&bbox={str(bbox).replace(" ", "")}&limit=6'
print(stac_url)

https://explorer.digitalearth.africa/stac/search?collection=s2_l2a&time=2020-01-01/2020-01-31&bbox=[45.0,-20.1,47.0,-19.8]&limit=6

Load STAC metadata

We can now load metadata from the URL above into Python. STAC metadata is stored in JSON format, which we can read into nested Python dictionaries using the json Python module.

[5]:

with urllib.request.urlopen(stac_url) as url:
    data = json.loads(url.read().decode())
pprint(data, depth=1)

{'context': {...},
 'features': [...],
 'links': [...],
 'numberMatched': 144,
 'numberReturned': 6,
 'type': 'FeatureCollection'}

Inspecting STAC Items

In the output above, the numberReturned value indicates our search returned six unique results. These results are known as STAC Items. These are an atomic collection of inseparable data and metadata, such as a unique satellite dataset.

Data for each STAC Item is contained in the metadata’s list of features:

[6]:

pprint(data['features'], depth=2)

[{'assets': {...},
  'bbox': [...],
  'collection': 's2_l2a',
  'geometry': {...},
  'id': '7202d52a-0f63-5caa-b6c9-76b6c81cc5ed',
  'links': [...],
  'properties': {...},
  'stac_extensions': [...],
  'stac_version': '1.0.0',
  'type': 'Feature'},
 {'assets': {...},
  'bbox': [...],
  'collection': 's2_l2a',
  'geometry': {...},
  'id': '8dd152fa-a943-5bf1-a971-5b78e96d3ce7',
  'links': [...],
  'properties': {...},
  'stac_extensions': [...],
  'stac_version': '1.0.0',
  'type': 'Feature'},
 {'assets': {...},
  'bbox': [...],
  'collection': 's2_l2a',
  'geometry': {...},
  'id': '55c9f67a-85c9-5bd0-a643-e79803ec5f37',
  'links': [...],
  'properties': {...},
  'stac_extensions': [...],
  'stac_version': '1.0.0',
  'type': 'Feature'},
 {'assets': {...},
  'bbox': [...],
  'collection': 's2_l2a',
  'geometry': {...},
  'id': 'a48f9354-9844-5631-b5d5-0d0a5f07c86f',
  'links': [...],
  'properties': {...},
  'stac_extensions': [...],
  'stac_version': '1.0.0',
  'type': 'Feature'},
 {'assets': {...},
  'bbox': [...],
  'collection': 's2_l2a',
  'geometry': {...},
  'id': '54557de8-d11d-5191-9231-4715af9c3a52',
  'links': [...],
  'properties': {...},
  'stac_extensions': [...],
  'stac_version': '1.0.0',
  'type': 'Feature'},
 {'assets': {...},
  'bbox': [...],
  'collection': 's2_l2a',
  'geometry': {...},
  'id': '57b95ad3-ef6f-5292-b679-2132d6839611',
  'links': [...],
  'properties': {...},
  'stac_extensions': [...],
  'stac_version': '1.0.0',
  'type': 'Feature'}]

STAC’s features are stored as GeoJSON, a widely used file format for storing geospatial vector data. This means we can easily convert it to a spatial object using the geopandas Python module. This allows us to plot and inspect the spatial extents of our data:

[7]:

# Convert features to a GeoDataFrame
gdf = gpd.GeoDataFrame.from_features(data['features'])

# Plot the footprints of each dataset
gdf.plot(alpha=0.8, edgecolor='black')

[7]:

<Axes: >

../../../_images/sandbox_notebooks_Frequently_used_code_Downloading_data_with_STAC_18_1.png

If we print the GeoDataFrame itself, we can see that it contains useful metadata fields that provide information about each dataset:

[8]:

gdf.head(1)

[8]:

	geometry	title	gsd	created	updated	proj:epsg	platform	view:off_nadir	data_coverage	instruments	...	sentinel:grid_square	sentinel:data_coverage	sentinel:latitude_band	sentinel:valid_cloud_cover	proj:shape	proj:transform	cubedash:region_code	datetime	sentinel:boa_offset_applied	sentinel:processing_baseline
0	POLYGON ((46.911 -19.9745, 46.89992 -19.02874,...	S2B_MSIL2A_20200101T070219_N0213_R120_T38KQD_2...	10	2020-08-19T00:49:03.181000Z	2020-08-19T00:49:03.181Z	32738	sentinel-2b	0	99.97	[msi]	...	QD	99.97	K	True	[10980, 10980]	[10.0, 0.0, 699960.0, 0.0, -10.0, 7900000.0, 0...	38KQD	2020-01-01T07:05:01Z	NaN	NaN

1 rows × 27 columns

We can use this to learn more about our data. For example, we can plot our datasets using the eo:cloud_cover field to show what percent of each dataset was obscured by cloud (yellow = high cloud cover):

[9]:

# Colour features by cloud cover
gdf.plot(column='eo:cloud_cover',
         cmap='viridis',
         alpha=0.8,
         edgecolor='black',
         legend=True)

[9]:

<Axes: >

../../../_images/sandbox_notebooks_Frequently_used_code_Downloading_data_with_STAC_22_1.png

Inspecting assets

Each STAC Item listed in features can contain multiple assets. This assets represent unique data files or layers, for example individual remote sensing bands. For Digital Earth Africa’s Sentinel-2 products, these can include the bands, metadata, etc.

[10]:

stac_item = data['features'][0]
pprint(stac_item['assets'], depth=1)

{'AOT': {...},
 'B01': {...},
 'B02': {...},
 'B03': {...},
 'B04': {...},
 'B05': {...},
 'B06': {...},
 'B07': {...},
 'B08': {...},
 'B09': {...},
 'B11': {...},
 'B12': {...},
 'B8A': {...},
 'SCL': {...},
 'WVP': {...},
 'info': {...},
 'metadata': {...},
 'overview': {...},
 'thumbnail': {...},
 'visual': {...}}

Importantly, each asset (for example, 'B05') provides a unique URL (href) that can be used to access or download the data. In this case, the s3:// prefix indicates our data is stored in the cloud on Amazon S3.

[11]:

pprint(stac_item['assets']['B05'])

{'eo:bands': [{'name': 'B05'}],
 'href': 's3://deafrica-sentinel-2/sentinel-s2-l2a-cogs/38/K/QD/2020/1/S2B_38KQD_20200101_0_L2A/B05.tif',
 'proj:epsg': 32738,
 'proj:shape': [5490, 5490],
 'proj:transform': [20.0, 0.0, 699960.0, 0.0, -20.0, 7900000.0, 0.0, 0.0, 1.0],
 'roles': ['data'],
 'title': 'B05',
 'type': 'image/tiff; application=geotiff; profile=cloud-optimized'}

Downloading files using STAC

Now that we have a URL, we can use this to download data to our local disk. For example, we may want to download data for the 'B05' satellite band in our STAC Item. We can do this using the s3_download function from odc.aws:

[12]:

# Get URL then download
url = stac_item['assets']['B05']['href']
odc.aws.s3_download(url)

[12]:

'B05.tif'

To verify that this file downloaded correctly, we can load it into our notebook as an xarray.Dataset() using rioxarray.open_rasterio:

[13]:

# Load data as an xarray.Dataset()
downloaded_ds = rioxarray.open_rasterio('B05.tif')

# Plot a small subset of the data
downloaded_ds.isel(x=slice(2000, 3000), y=slice(2000, 3000)).plot(robust=True)

[13]:

<matplotlib.collections.QuadMesh at 0x7f773731e090>

../../../_images/sandbox_notebooks_Frequently_used_code_Downloading_data_with_STAC_30_1.png

Note: If this notebook is being run on the DE Africa Sandbox, the saved file will appear in the Frequently_used_code directory in the JupyterLab File Browser. To save it to your local PC, right click on the file and click Download.

Downloading multiple files

To download data from the 'B05' band for each of the six STAC Items returned by our search:

# Get list of all URLs for the 'B05' band
asset = 'B05'
urls = [stac_item['assets'][asset]['href'] for stac_item in data['features']]

# Download each URL
for url in urls:
    print(url)
    odc.aws.s3_download(url)

To download all available bands (i.e. assets) for a single STAC Item:

# Get list of all URLs for all assets in the first STAC Item
stac_item = data['features'][0]
urls = [asset['href'] for asset in stac_item['assets'].values()]

# Download each URL
for url in urls:
    print(url)
    odc.aws.s3_download(url)

Streaming data without downloading

Due to the increasing size of satellite datasets, downloading data directly to your own local disk can be time-consuming and slow. Sometimes, it is better to stream data directly from the cloud without downloading it first. This can be particularly powerful for data that is stored in the Cloud Optimised GeoTIFF (COG) format which is optimised for efficiently streaming small chunks of an image at a time.

This section demonstrates how data can be streamed directly from the cloud into both Python and the QGIS GIS software (v3.14). As a first step, we need to convert our Amazon S3 URL (e.g. s3://) into HTTPS format (e.g. https://) so that it can be read more easily:

[14]:

# Get URL
url = stac_item['assets']['B05']['href']

# Get https URL
bucket, key = odc.aws.s3_url_parse(url)
https_url = f'https://deafrica-sentinel-2.s3.af-south-1.amazonaws.com/{key}'
https_url

[14]:

'https://deafrica-sentinel-2.s3.af-south-1.amazonaws.com/sentinel-s2-l2a-cogs/38/K/QD/2020/1/S2B_38KQD_20200101_0_L2A/B05.tif'

Streaming data into Python

To stream data directly from the cloud into an xarray.Dataset() format so it can be analysed in Python, we can supply the HTTPS URL above directly to the rioxarray.open_rasterio function:

[15]:

# Load data as an xarray.Dataset()
streamed_ds = rioxarray.open_rasterio(https_url)

# Plot a small subset of the data
streamed_ds.isel(x=slice(2000, 3000), y=slice(2000, 3000)).plot(robust=True)

[15]:

<matplotlib.collections.QuadMesh at 0x7f7736aa40e0>

../../../_images/sandbox_notebooks_Frequently_used_code_Downloading_data_with_STAC_36_1.png

Streaming and reprojecting data

The code above will stream the entire dataset from the cloud into a xarray.Dataset(). Sometimes, however, we may only want to stream a portion of large dataset into a spatial grid (e.g. resolution and coordinate reference system) that exactly matches data we have already loaded using Datacube.

For example, we may have already used dc.load to load example data from the datacube into the Africa Albers Equal Area Conic projection and a 5000 m pixel resolution:

[16]:

# Load data from datacube
ds = dc.load(product='s2_l2a',
             time=('2020-01-01', '2020-01-31'),
             x=(46.8, 48.0),
             y=(-19.8, -19.4),
             resolution=(-5000, 5000),
             output_crs='epsg:6933',
             dask_chunks={})
ds

[16]:

<xarray.Dataset> Size: 253kB
Dimensions:      (time: 33, y: 11, x: 24)
Coordinates:
  * time         (time) datetime64[ns] 264B 2020-01-01T07:04:57 ... 2020-01-3...
  * y            (y) float64 88B -2.428e+06 -2.432e+06 ... -2.472e+06 -2.478e+06
  * x            (x) float64 192B 4.518e+06 4.522e+06 ... 4.628e+06 4.632e+06
    spatial_ref  int32 4B 6933
Data variables: (12/15)
    B01          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    B02          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    B03          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    B04          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    B05          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    B06          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    ...           ...
    B09          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    B11          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    B12          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    SCL          (time, y, x) uint8 9kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    AOT          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
    WVP          (time, y, x) uint16 17kB dask.array<chunksize=(1, 11, 24), meta=np.ndarray>
Attributes:
    crs:           EPSG:6933
    grid_mapping:  spatial_ref

We can now use the rio_slurp_xarray function to stream the data we identified using STAC into a format that is consistent with the ds data we loaded using the Datacube. Note that gbox=ds.geobox tells the function to load the data to match ds’s characteristics. The output should therefore appear far more pixelated than previous plots:

[17]:

# Load data as an xarray.Dataset()
streamed_ds = rio_slurp_xarray(https_url, geobox=ds.odc.geobox)

# Verify that data contains an Open Data Cube geobox
streamed_ds.plot(robust=True)

[17]:

<matplotlib.collections.QuadMesh at 0x7f7734210f50>

../../../_images/sandbox_notebooks_Frequently_used_code_Downloading_data_with_STAC_40_1.png

[18]:

ds.odc.geobox == streamed_ds.odc.geobox

[18]:

True

Note: For more about reprojecting data, see the Reprojecting datacube and raster data notebook.

Streaming data into QGIS

To stream data directly into a GIS software like QGIS without having to download it, first select and copy the HTTPS URL above (e.g. 'https://dea-public-data...) to our clipboard, then open QGIS. In QGIS, click Layer > Add Layer > Add Raster Layer:

On the Data Source Manager | Raster dialogue, click Protocol: HTTP(S), cloud, etc. Ensure Type is set to HTTP/HTTPS/FTP, and paste the URL you copied into the URI box:

Click Add, then Close. After a few moments, the image we identified using STAC will appear on the map. This data is being streamed directly from the cloud - no downloading required!

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:

[19]:

print(datacube.__version__)

1.9.13

Last Tested:

[20]:

from datetime import datetime
datetime.today().strftime('%Y-%m-%d')

[20]:

'2026-04-30'