Urban area mapping using Sentinel 1 data

Keywords data used; sentinel 1,:index:SAR,:index:urban, :index:`analysis;


Urban areas account for a small proportion of global land cover, but support daily human life and exert a great influence on environmental and ecological changes (Xia et al. 2019). This means that constant monitoring of the built environment is crucial for sustainable development. There are different techniques used to classify urban areas using both optical and radar data, one of the simplest is k-means clustering (unsupervised machine learning).

While urban areas can be mapped using optical data, the image quality is strongly affected by weather conditions such as cloud cover. This limits the temporal avaliability of clear images in tropical areas that experience very cloudy weather and heavy rains. The image quality of radar data is independent of daylight and weather conditions, which makes it more suitable for urban area mapping. For more information, see the Sentinel-1 notebook.


This notebook uses k-means clustering to classify land as ‘urban’ then compares those results with the ESA WorldCover global land cover product for the year 2020.

The choice of the number of clusters to use for the k-means clustering and the pixel value that represents the urban land cover class can be informed by comparing the prediction images with the “ground truth” dataset.

This notebook contains the following steps:

  1. Select a location and time range for the analysis.

  2. Load Sentinel-1 backscatter data for the area of interest.

  3. Convert the digital numbers to dB values for analysis.

  4. Generate a median VH and VV polarization composite image from the Sentinel 1 data.

  5. Perform k-means clustering on the median composite image.

  6. Show the k-means clustering urbanization prediction image.

  7. Load and show the “ground truth” ESA Worldcover data for the year 2020.

  8. Compare the urbanization prediction with the “ground truth” data visually and statistically.

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import the Python packages that are used for the analysis.

# Load the necessary Python packages.
%matplotlib inline

# Force GeoPandas to use Shapely instead of PyGEOS
# In a future release, GeoPandas will switch to using Shapely by default.
import os
os.environ['USE_PYGEOS'] = '0'

import warnings

import datacube
import matplotlib.colors as mcolours
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd
import xarray as xr
import matplotlib.colors as mcolors
from matplotlib.patches import Patch

# Turn off  RuntimeWarning: divide by zero or RuntimeWarning: invalid value warnings.
import numpy as np
np.seterr(divide='ignore', invalid='ignore')

from deafrica_tools.classification import sklearn_flatten, sklearn_unflatten
from deafrica_tools.dask import create_local_dask_cluster
from deafrica_tools.datahandling import load_ard
from deafrica_tools.plotting import display_map, plot_lulc
from datacube.utils.geometry import Geometry
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.preprocessing import StandardScaler
from deafrica_tools.areaofinterest import define_area

Set up a Dask cluster

Dask can be used to better manage memory use down and conduct the analysis in parallel. For an introduction to using Dask with Digital Earth Africa, see the Dask notebook.

Note: We recommend opening the Dask processing window to view the different computations that are being executed; to do this, see the Dask dashboard in DE Africa section of the Dask notebook.

To use Dask, set up the local computing cluster using the cell below.

[ ]:

Connect to the datacube

Connect to the datacube so we can access the Digital Earth Africa data. The app parameter is a unique name for the analysis which is based on the notebook file name.

dc = datacube.Datacube(app="Urban_area_mapping")

Analysis parameters

The following cell sets the parameters, which define the area of interest and the length of time to conduct the analysis. The parameters are: * lat: The central latitude of the area of interest to analyse. * lon: The central longitude of the area of interest to analyse. * buffer: The number of square degrees to load around the central latitude and longitude. For reasonable loading times, set this as 0.1 or lower. * time_range: The time range for your analysis, e.g.('2020') if you wanted data from all of the year 2020.

Select location

To define the area of interest, there are two methods available:

  1. By specifying the latitude, longitude, and buffer. This method requires you to input the central latitude, central longitude, and the buffer value in square degrees around the center point you want to analyze. For example, lat = 10.338, lon = -1.055, and buffer = 0.1 will select an area with a radius of 0.1 square degrees around the point with coordinates (10.338, -1.055).

  2. By uploading a polygon as a GeoJSON or Esri Shapefile. If you choose this option, you will need to upload the geojson or ESRI shapefile into the Sandbox using Upload Files button f77f418324ea484ea25a73bf0da8d865 in the top left corner of the Jupyter Notebook interface. ESRI shapefiles must be uploaded with all the related files (.cpg, .dbf, .shp, .shx). Once uploaded, you can use the shapefile or geojson to define the area of interest. Remember to update the code to call the file you have uploaded.

To use one of these methods, you can uncomment the relevant line of code and comment out the other one. To comment out a line, add the "#" symbol before the code you want to comment out. By default, the first option which defines the location using latitude, longitude, and buffer is being used.

If running the notebook for the first time, keep the default settings below. This will demonstrate how the analysis works and provide meaningful results. The example covers part of Nairobi County, Kenya.

# Method 1: Specify the latitude, longitude, and buffer
aoi = define_area(lat=-1.2933, lon=36.8379, buffer=0.1)

# Method 2: Use a polygon as a GeoJSON or Esri Shapefile.
#aoi = define_area(vector_path='aoi.shp')

#Create a geopolygon and geodataframe of the area of interest
geopolygon = Geometry(aoi["features"][0]["geometry"], crs="epsg:4326")
geopolygon_gdf = gpd.GeoDataFrame(geometry=[geopolygon], crs=geopolygon.crs)

# Get the latitude and longitude range of the geopolygon
lat_range = (geopolygon_gdf.total_bounds[1], geopolygon_gdf.total_bounds[3])
lon_range = (geopolygon_gdf.total_bounds[0], geopolygon_gdf.total_bounds[2])

# Time frame for the analysis.
time_range = ("2020-01", "2020-04")

View the selected location

The next cell will display the selected area on an interactive map. Feel free to zoom in and out to get a better understanding of the area you’ll be analysing. Clicking on any point of the map will reveal the latitude and longitude coordinates of that point.

# View the study area
display_map(x=lon_range, y=lat_range)
Make this Notebook Trusted to load map: File -> Trust Notebook

Load and view the Sentinel-1 data

Create a datacube query object

We will create a dictionary that will contain the parameters that will be used to load the Sentinel 1 data from the Digital Earth Africa datacube.

query = {
    "y": lat_range,
    "x": lon_range,
    "time": time_range,
    "output_crs": "EPSG:6933",
    "resolution": (-20, 20),
    "dask_chunks": dict(x=1000, y=1000),

Load the Sentinel 1 data

The first step in the analysis is to load Sentinel-1 backscatter data for the specified area of interest. This uses the pre-defined load_ard utility function. The load_ard function is used here to load an analysis ready dataset free of shadow, and missing data.

ds = load_ard(
    dc=dc, products=["s1_rtc"], measurements=["vv", "vh"], group_by="solar_day", **query

Using pixel quality parameters for Sentinel 1
Finding datasets
Applying pixel quality/cloud mask
Returning 30 time steps as a dask array
Dimensions:      (time: 30, y: 1276, x: 965)
  * time         (time) datetime64[ns] 2020-01-04T15:56:20.113636 ... 2020-04...
  * y            (y) float64 -1.522e+05 -1.522e+05 ... -1.777e+05 -1.777e+05
  * x            (x) float64 3.545e+06 3.545e+06 ... 3.564e+06 3.564e+06
    spatial_ref  int32 6933
Data variables:
    vv           (time, y, x) float32 dask.array<chunksize=(1, 1000, 965), meta=np.ndarray>
    vh           (time, y, x) float32 dask.array<chunksize=(1, 1000, 965), meta=np.ndarray>
    crs:           EPSG:6933
    grid_mapping:  spatial_ref

Convert the Digital Number (DN) values to Decibel values (dB)

The Sentinel-1 backscatter data is provided as digital number (DN), which can be converted to backscatter in decibel unit (dB) using the function:

\begin{equation} 10 * \log_{10}\left(\text{DN} \right) \end{equation}

It is often useful to convert the backscatter to decible (dB) for analysis because the backscatter in dB unit has a more symmetric noise profile and less skewed value distribution for easier statistical evaluation.

# Convert DN to db values.
ds["vv"] = 10 * np.log10(ds.vv)
ds["vh"] = 10 * np.log10(ds.vh)

View the Sentinel 1 data

# Plot the first VH and VV observation for the year 2020.
fig, ax = plt.subplots(1, 2, figsize=(14, 6), sharey=True)
ds.vh.isel(time=0).plot.imshow(cmap="Greys_r", robust=True, ax=ax[0])
ds.vv.isel(time=0).plot.imshow(cmap="Greys_r", robust=True, ax=ax[1])
ax[0].set_title("VH (db)")
ax[1].set_title("VV (db)");
/usr/local/lib/python3.10/dist-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
  return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.10/dist-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
  return func(*(_execute_task(a, cache) for a in args))

Generate a median value composite image

We will combine all the VH and VV observations in our Sentinel 1 ds xarray.Dataset into a single, complete (or near complete) image representing the median of the time period.

Note, here we will .compute() the median, which will bring our data into memory. This will take a couple of minutes to run.

# Obtain the median of all VH and VV observations for the time period.
ds = ds.median(dim="time").compute()

fig, ax = plt.subplots(1, 2, figsize=(14, 6), sharey=True)
ds.vh.plot.imshow(cmap="Greys_r", robust=True, ax=ax[0])
ds.vv.plot.imshow(cmap="Greys_r", robust=True, ax=ax[1])
ax[0].set_title("VH (db)")
ax[1].set_title("VV (db)");
/usr/local/lib/python3.10/dist-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
  return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.10/dist-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
  return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.10/dist-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
  return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.10/dist-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
  return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.10/dist-packages/dask/core.py:119: RuntimeWarning: divide by zero encountered in log10
  return func(*(_execute_task(a, cache) for a in args))
/usr/local/lib/python3.10/dist-packages/rasterio/warp.py:344: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.

Classification using K-means Clustering

In the next cell we will create a set of functions which will together be used to perform k-means clustering on our median value composite image. This functions are adapted from the ones used here.

# Defining functions to use for the k-means clustering.
def show_clustered(predicted_ds):
    Takes the predicted xarray DataArray and plots it.

    Last modified: October 2022
    predicted_ds : xarray DataArrray
    The xarray DataArray which is the result of the k-means clustering.

    An plot of the predicted_ds.

    # Display predicted_ds Dataset with upto 6 unique classes.
    image = predicted_ds

    # Get the number of unique classes in the image.
    class_values = np.unique(image)
    no_classes = len(class_values)

    # Define the color map to use to plot.
    cmap = plt.get_cmap(name="tab20b", lut=no_classes)

    # Define where the transition is from one color to the next.
    bounds = list(np.arange(-0.5, no_classes, 1))
    norm = mcolors.BoundaryNorm(bounds, cmap.N)

    # Plot the image.
    im = image.plot.imshow(cmap=cmap, norm=norm, add_colorbar=True, figsize=(6, 6))

    # Add a title to the plot.
    title = f"K-means Clustering Predicted Image using {no_classes} clusters"

    # Add a colour bar to the plot.
    cb = im.colorbar
    cb.set_label("Kmeans Clustering cluster/class values", labelpad=20)


def kmeans_clustering(input_xr, cluster_range):
    Perform sklearn Kmeans clustering on the input Dataset
    or Data Array.

    Last modified: November 2021

    input_xr : xarray.DataArray or xarray.Dataset
        Must have dimensions 'x' and 'y', may have dimension 'time'.

    cluster_range : list
        A list of the number of clusters to use to perform the k-means clustering
        on the input_xr Dataset.

    results : dictionary
        A dictionary with the number of clusters as keys and the predicted xarray.DataArrays
        as the values. Each predicted xarray.DataArray has the same dimensions 'x', 'y' and
        'time' as the input_xr.


    # Use the sklearn_flatten function to convert the Dataset or DataArray into a 2 dimensional numpy array.
    model_input = sklearn_flatten(input_xr)

    # Standardize the data.
    scaler = StandardScaler()
    model_input = scaler.fit_transform(model_input)

    # Dictionary to save results
    results = {}

    # Perform Kmeans clustering on the input dataset for each number of clusters
    # in the cluster_range list.
    for no_of_clusters in cluster_range:
        # Set up the kmeans classification by specifying the number of clusters
        # with initialization as k-means++.
        km = KMeans(n_clusters=no_of_clusters, init="k-means++", random_state=1)

        # Begin iteratively computing the position of the clusters.

        # Use the sklearn kmeans .predict method to assign all the pixels of the
        # model input to a unique cluster.
        flat_predictions = km.predict(model_input)

        # Use the sklearn_unflatten function to convert the flat predictions into a
        # xarray DataArray.
        predicted = sklearn_unflatten(flat_predictions, input_xr)
        predicted = predicted.transpose("y", "x")

        # Append the results to a dictionary using the number of clusters as the
        # column as an key.
        results.update({str(no_of_clusters): predicted})

    return results

In the next cell, we will use a range of clusters to perform k-means classification on our median composite dataset.

Below, enter a list of bands to include in the classification

bands = ["vv"]

Now we will run the k-means classification, and then plot the results

cluster_range = [2, 3, 4]
results = kmeans_clustering(ds[bands], cluster_range)
# Plot each of the predicted images.
for predicted_ds in results.values():

Decide which model and class to assign as ‘urban’

From the plotted predicted images above the best number of clusters to use is 3 clusters. In this image, the most likely pixel value to represent the urban/buit up land cover class is the pixel value 1.

Note: this may change depending on the area of interest and number of clusters used. In this example, the initial state has been arbitrarily defined using the random_state argument in the kmeans_clustering function.

Set the key and the pixel_value representing the model and value, respectively, that you want to assign as urban areas

# Mask the dataset to retain the pixels which are most likely to be urban/built up.
key = "3"
pixel_value = 1

clustering_predicted_ds = results[key] == pixel_value
clustering_predicted_ds.plot.imshow(figsize=(6, 6), add_colorbar=False)
plt.title("K-means Clustering Predicted image");

Validation of the k-means clustering classification

We will compare the performance of the urban area k-means clustering classification result against a built area (urban area) map for the study area derived from the ESA World Cover, global 10 m land use/land cover data from 2020.

Obtain the validation dataset

# Load the ESA land use land cover product over the same region as the Sentinel 1 dataset.
ds_esa = dc.load(product="esa_worldcover_2020", like=ds.geobox).squeeze()

Dimensions:         (y: 1276, x: 965)
    time            datetime64[ns] 2020-07-01T12:00:00
  * y               (y) float64 -1.522e+05 -1.522e+05 ... -1.777e+05 -1.777e+05
  * x               (x) float64 3.545e+06 3.545e+06 ... 3.564e+06 3.564e+06
    spatial_ref     int32 6933
Data variables:
    classification  (y, x) uint8 10 10 10 10 10 10 10 ... 30 60 50 60 60 60 60
    crs:           PROJCS["WGS 84 / NSIDC EASE-Grid 2.0 Global",GEOGCS["WGS 8...
    grid_mapping:  spatial_ref

Plot the ESA land use land cover product.

# Plot the ESA land use land cover product.
fig, ax = plt.subplots(figsize=(6, 6), sharey=True)
plot_lulc(ds_esa["classification"], product="ESA", legend=True, ax=ax)
plt.title("ESA WorldCover");

We’re only interested in the built-up areas, so we can use the flags definition on the dataset to isolate just the built-up class

{'data': {'bits': [0, 1, 2, 3, 4, 5, 6, 7],
  'values': {'0': 'no data',
   '10': 'tree cover',
   '20': 'shrubland',
   '30': 'grassland',
   '40': 'cropland',
   '50': 'built up',
   '60': 'bare/sparse vegetation',
   '70': 'snow and ice',
   '80': 'permanent water bodies',
   '90': 'herbaceous wetland',
   '95': 'mangroves',
   '100': 'moss and lichen'},
  'description': 'Land Use/Land Cover class'}}

Plot the ESA urban areas alongside the k-means estimate from S1

# Plot the built up land cover from the ESA worldcover dataset.
esa_urban_class = 50

built_up = ds_esa["classification"] == esa_urban_class

fig, ax = plt.subplots(1, 2, figsize=(14, 6), sharey=True)
built_up.plot.imshow(ax=ax[0], add_colorbar=False)
clustering_predicted_ds.plot.imshow(ax=ax[1], add_colorbar=False)
ax[1].set_title("K-means Clustering Predicted image")
ax[0].set_title("ESA WorldCover Built up Landcover");

Accuracy assessment metrics

We will use functions from the sklearn.metrics module to evaluate the k-means clustering classification. Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial.

\[\begin{aligned} \text{Overall Accuracy (OA)} = \frac{\text{True Positive} + \text{True Negative}}{\text{Total Pixels}} \end{aligned}\]
\[\begin{aligned} \text{Producer's Accuracy (PA) or Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \end{aligned}\]
\[\begin{aligned} \text{User's accuracy (UA) or Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \end{aligned}\]
\[\begin{aligned} \text{F1 Score} = 2*\frac{\text{Recall}*\text{Precision}}{\text{Recall} + \text{Precision}} \end{aligned}\]
# Metrics for the Kmeans Clustering.
y_true = sklearn_flatten(built_up)
y_pred_kmeans = sklearn_flatten(clustering_predicted_ds)

# Producer's Accuracies.
precision_kmeans = precision_score(y_true, y_pred_kmeans, labels=[0, 1], average=None)
urban_precision_kmeans = precision_kmeans[1] * 100

# User's Accuracies.
recall_kmeans = recall_score(y_true, y_pred_kmeans, labels=[0, 1], average=None)
urban_recall_kmeans = recall_kmeans[1] * 100

# Overall Accuracy.
accuracy_kmeans = accuracy_score(y_true, y_pred_kmeans, normalize=True)
overall_accuracy_kmeans = accuracy_kmeans * 100

# F1 score.
f1score_kmeans = f1_score(y_true, y_pred_kmeans)
    "\033[1m" + "\033[91m" + "Urban Area Mapping using k-means clustering Results"
)  # bold print and red
print("\033[0m")  # stop bold and red
print("Overall Accuracy: ", round(overall_accuracy_kmeans, 2))
print("F1 score: \t", round(f1score_kmeans, 2))
print("Producer's Accuracy: ", round(urban_precision_kmeans, 2))
    "User's Accuracy: ",
    round(urban_recall_kmeans, 2),
Urban Area Mapping using k-means clustering Results

Overall Accuracy:  79.27
F1 score:        0.53
Producer's Accuracy:  67.39
User's Accuracy:  43.2

The dstack calls provide the imshow calls with RGB array inputs. For the plotted image, the first channel (red) is the actual (ground truth, ESA Worldcover) values, and both the second and third channels (green, blue) are the predicted values (green + blue = cyan).

fig, ax = plt.subplots(figsize=(8, 8))
    [Patch(facecolor="cyan"), Patch(facecolor="red"), Patch(facecolor="white")],
    ["False Positive", "False Negative", "True Positive"],
    loc="lower left",
plt.title("Urban Area Mapping Clustering Results");

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:


Last Tested:

from datetime import datetime