Urbanization Index Comparisons with Global Human Settlement (GHS)

Tags: band index; NDBI, band index; ENDISI, urban


There are many different urbanization indices with different characteristics and use cases. It is often convenient to be able to compare the performance of several indicies for an area - determining which is the best for an area based on the outputs and a “ground truth” dataset of urbanization.


This notebook uses several indices to classify land as “urban” and then compares those results with the Global Human Settlement (GHS) product which shows the extent of built-up area (urban extent) through to 2014. The indices tested here are the Normalized Difference Buildup Index (NDBI) and the Enhanced Normalized Difference Impervious Surface Index (ENDISI).

  1. Load a geomedian image from the region of interest.

  2. Mask watr using the WOfS alltime summary

  3. Calulate urban indices and show histograms for the indicies.

  4. Select minimum and maximum threshold values for the indicies to determine urban pixels.

  5. Show the urbanization prediction images.

  6. Load and show the “ground truth” data (GHS) for the year 2014.

  7. Compare the urbanization predictions with the “ground truth” data visually and statistically.

The choice of threshold values can be informed by the histograms and comparing the urbanization prediction images with the “ground truth” urbanization data.

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

%matplotlib inline

import datacube
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
import numpy as np
import xarray as xr
from collections import namedtuple
from datacube.utils.geometry import assign_crs
from skimage.morphology import remove_small_objects
from skimage.morphology import remove_small_holes
from matplotlib.patches import Patch
from odc.algo import xr_reproject

from deafrica_tools.plotting import display_map, rgb
from deafrica_tools.bandindices import calculate_indices
/env/lib/python3.8/site-packages/geopandas/_compat.py:106: UserWarning: The Shapely GEOS version (3.8.0-CAPI-1.13.1 ) is incompatible with the GEOS version PyGEOS was compiled with (3.9.1-CAPI-1.14.2). Conversions between both will be slow.

Set up a Dask cluster

Dask can be used to better manage memory use and conduct the analysis in parallel. For an introduction to using Dask with Digital Earth Africa, see the Dask notebook.

Note: We recommend opening the Dask processing window to view the different computations that are being executed; to do this, see the Dask dashboard in DE Africa section of the Dask notebook.

To use Dask, set up the local computing cluster using the cell below.

Connect to the datacube

Activate the datacube database, which provides functionality for loading and displaying stored Earth observation data.

dc = datacube.Datacube(app="Urbanization_GHS_Comparison")

Analysis parameters

The following cell sets important parameters for the analysis. The parameters are:

  • lat: The central latitude to analyse (e.g. 10.338).

  • lon: The central longitude to analyse (e.g. -1.055).

  • lat_buffer: The number of degrees to load around the central latitude.

  • lon_buffer: The number of degrees to load around the central longitude.

  • time_range: The time range to analyze - in YYYY-MM-DD format (e.g. ('2016-01-01', '2016-12-31')).

If running the notebook for the first time, keep the default settings below. The default area is Dakar, Senegal.

Select location

# Define the area of interest
lat = 14.72
lon = -17.355
lat_buffer = 0.15
lon_buffer = 0.2
time_range = ("2017")

# Combine central lat,lon with buffer to get area of interest
lat_range = (lat - lat_buffer, lat + lat_buffer)
lon_range = (lon - lon_buffer, lon + lon_buffer)

View the selected location

The next cell will display the selected area on an interactive map. Feel free to zoom in and out to get a better understanding of the area you’ll be analysing. Clicking on any point of the map will reveal the latitude and longitude coordinates of that point.

# The code below renders a map that can be used to view the region.
display_map(lon_range, lat_range)

Load the data

We will below load a geoomedian from 2017

# Create the 'query' dictionary object
query = {
    "longitude": lon_range,
    "latitude": lat_range,
    "time": time_range,
    "resolution": (-20, 20),

#load geomedian
ds = dc.load(product='gm_s2_annual',
             measurements=["red", "green", "blue", "swir_1", "swir_2", "nir"],

Once the load is complete, examine the data by printing it in the next cell. The Dimensions attribute revels the number of time steps in the data set, as well as the number of pixels in the longitude and latitude dimensions.

Dimensions:      (x: 1930, y: 1854)
    time         datetime64[ns] 2017-07-02T11:59:59.999999
  * y            (y) float64 1.876e+06 1.876e+06 ... 1.839e+06 1.839e+06
  * x            (x) float64 -1.694e+06 -1.694e+06 ... -1.655e+06 -1.655e+06
    spatial_ref  int32 6933
Data variables:
    red          (y, x) uint16 438 443 435 455 471 474 ... 542 546 551 549 548
    green        (y, x) uint16 502 515 500 520 538 542 ... 652 659 667 666 664
    blue         (y, x) uint16 526 536 525 544 566 577 ... 590 605 603 601 602
    swir_1       (y, x) uint16 346 346 345 359 372 377 ... 441 445 447 448 445
    swir_2       (y, x) uint16 296 296 298 314 323 330 ... 387 388 387 384 383
    nir          (y, x) uint16 389 398 385 401 416 424 ... 499 503 502 506 506
    crs:           EPSG:6933
    grid_mapping:  spatial_ref

Filter out water pixels using WOfS

# Create a water mask using WOfS.
water = dc.load(

# Mask out water.
water_mask = water > 0.1
ds = ds.where(~water_mask.squeeze())

Land Spectral Indices

We’re going to be calculating NDBI and ENDISI as urbanization indices.


The Normalized Difference Built-Up Index (NDBI) is one of the most commonly used proxies of urbanization. Like all normalized difference indicies, it has a range of [-1,1].

\[ \begin{align}\begin{aligned}\begin{aligned} \text{NDBI} = \frac{\text{SWIR1} - \text{NIR}}{\text{SWIR1} + \text{NIR}} \end{aligned}\\ .. rubric:: ENDISI :name: endisi\\ The Enhanced Normalized Difference Impervious Surfaces Index (ENDISI) is a more recently developed urbanization proxy. Like all normalized difference indicies, it has a range of [-1,1]. Note that ``MNDWI``, ``swir_diff`` and ``alpha`` are all part of the ENDISI calculation.\end{aligned}\end{align} \]
\[\begin{aligned} \text{MNDWI} = \frac{\text{GREEN} - \text{SWIR1}}{\text{GREEN} + \text{SWIR1}} \end{aligned}\]
\[\begin{aligned} \text{swir_diff} = \frac{\text{SWIR1}}{\text{SWIR2}} \end{aligned}\]
\[\begin{aligned} \text{alpha} = \frac{\text{2*mean(BLUE)}}{\text{mean(swir_diff) + mean(MNDWI^2)}} \end{aligned}\]
\[\begin{aligned} \text{ENDISI} = \frac{\text{BLUE} - \text{alpha}*(\text{swir_diff} + \text{MNDWI^2})} {\text{BLUE} + \text{alpha}*(\text{swir_diff} + \text{MNDWI^2})} \end{aligned}\]

Calculate NDBI and ENDISI

#NDBI and ENDISI using DE Africa funciton calculate_indices
ds = calculate_indices(ds, index=['NDBI', "ENDISI"], collection='s2')

Plot the two urban indices

fig, ax = plt.subplots(1, 2, figsize=(16,6), sharey=True)
ds.NDBI.plot.imshow(ax=ax[0], vmin=-.75, vmax=.75, cmap='RdBu_r')
ds.ENDISI.plot.imshow(ax=ax[1], vmin=-.75, vmax=.75, cmap='RdBu_r')
ax[0].set_title('NDBI'), ax[0].xaxis.set_visible(False), ax[0].yaxis.set_visible(False)
ax[1].set_title('ENDISI'), ax[1].xaxis.set_visible(False), ax[1].yaxis.set_visible(False)

Determining the thresholds for urbanization

These histogram plots show the distribution of values for each product. The urban threshold values are chosen using these histograms.

If a highly urban area is being examined, there should be visible maximal values for these histograms. The ideal thresholds should usually include these values (see the x axes of the histograms) and some range of values less than and greater than these maximal values.

ds.NDBI.plot.hist(bins=1000, range=(-1,1), facecolor='gray', figsize=(10, 4))
plt.title('NDBI Histogram')

ds.ENDISI.plot.hist(bins=1000, range=(-1,1), facecolor='gray', figsize=(10, 4))
plt.title('ENDISI Histogram')


Create Threshold Plots

First we will define a minimum threshold and a maximum threshold for each index. Then we will create plots that color the threshold region a single color (e.g. red).

# NDBI (Buildup Index) = -1.0 to 1.0 (full range)
# NDBI -0.1 to 0.3 is typical for urban areas
min_ndbi_threshold = -0.1
max_ndbi_threshold = 0.15

# ENDISI = -1.0 to 1.0 (full range)
# ENDISI -0.2 to 0.4 is typical for urban areas
min_endisi_threshold = -0.3
max_endisi_threshold = 0.1

# Set up the sub-plots
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

rgb(ds, ax=ax[0])
ds.NDBI.where((ds.NDBI > min_ndbi_threshold) &
                          (ds.NDBI < max_ndbi_threshold)).plot.imshow(

rgb(ds, ax=ax[1])
ds.ENDISI.where((ds.ENDISI > min_endisi_threshold) &
                            (ds.ENDISI < max_endisi_threshold)).plot.imshow(

#remove axes plotting elements
for a in ax:

#set titles
ax[0].set_title(f'NDBI Threshold ({min_ndbi_threshold} < x < {max_ndbi_threshold})')
ax[1].set_title(f'ENDISI Threshold ({min_endisi_threshold} < x < {max_endisi_threshold})')

Comparison Metrics

We will compare the performance of the urban index results against the GHS GeoTIFF product (shown below).

The GHS geotiff for the Senegal region is available by default in the Supplementary_data folder. To find and download GHS geotiffs for other regions, use the following link: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu

# Senegal region
tif = '../Supplementary_data/Urban_index_comparison/GHS_BUILT_LDSMT_GLOBE_R2018A_3857_30_V2_0_12_10.tif'

Open and reproject dataset to match Landsat

#Open and assign a geobox object
ghs_ds = assign_crs(xr.open_rasterio(tif).squeeze().chunk({'x':5000, 'y':5000}))

#reproject to match our landsat composite
ghs_ds = xr_reproject(ghs_ds,

#Threshold GHS to get all the urban areas
actual = (ghs_ds >= 3) & (ghs_ds <= 6)
/env/lib/python3.8/site-packages/pyproj/crs/crs.py:280: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
  projstring = _prepare_from_string(projparams)
actual.plot(figsize=(8,7), add_colorbar=False)
plt.title('Global Human Settlement Urban Areas');

Metric and Plotting Functions

The code below will calculate the true/false positive/negative sums and calculate values for a typical confusion matrix to assess the results. Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial.

\[\begin{aligned} \text{Accuracy} = \frac{\text{True Positive} + \text{True Negative}}{\text{Total Pixels}} \end{aligned}\]
\[\begin{aligned} \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \end{aligned}\]
\[\begin{aligned} \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \end{aligned}\]
\[\begin{aligned} \text{F1 Score} = 2*\frac{\text{Recall}*\text{Precision}}{\text{Recall} + \text{Precision}} \end{aligned}\]
def get_metrics(actual, predicted, minimum_threshold, maximum_threshold, filter_size=1):
    """ Creates performance metrics.

        actual: the data to use as truth.
        predicted: the data to predict and to compare against actual.
        minimum_threshold: the minimum threshold to apply on the predicted values for generating a boolean mask.
        maximum_threshold: the maximum threshold to apply on the predicted values for generating a boolean mask.
        filter_size: the filter size to apply on predicted to remove small object/holes with.

    Returns: A namedtuple containing the actual, predicted mask, and varying metrics for a confusion matrix.
    metrics = namedtuple('Metrics',
                         'actual predicted true_positive true_negative false_positive false_negative')

    predicted  = (predicted > minimum_threshold) & (predicted < maximum_threshold)
    predicted  = remove_small_objects(predicted, min_size=filter_size+1, connectivity=2)
    predicted  = remove_small_holes(predicted, area_threshold=filter_size+1, connectivity=2)

    true_positive=(predicted & actual).sum()
    true_negative=(~predicted & ~actual).sum()
    false_positive=(predicted & ~actual).sum()
    false_negative=(~predicted & actual).sum()

    return metrics(actual=actual,

def print_metrics(metrics):
    norm = metrics.true_positive + metrics.false_negative + metrics.false_positive + metrics.true_negative
    accuracy = (metrics.true_positive + metrics.true_negative)/norm
    ppv = metrics.true_positive/(metrics.true_positive + metrics.false_positive)
    tpr = metrics.true_positive/(metrics.true_positive + metrics.false_negative)
    f1 = (2*ppv*tpr)/(ppv+tpr)

    print('True Positive (Actual + Model = Urban): {tp}'.format(tp=round(metrics.true_positive/norm*100,3)))
    print('True Negative (Actual + Model = Non-Urban): {tn}'.format(tn=round(metrics.true_negative/norm*100,3)))
    print('False Positive (Actual=Non-Urban, Model=Urban): {fp}'.format(fp=round(metrics.false_positive/norm*100,3)))
    print('False Negative (Actual=Urban, Model=Non-Urban): {fn}'.format(fn=round(metrics.false_negative/norm*100,3)))
    print('\nAccuracy: {accuracy}'.format(accuracy=round(accuracy*100, 3)))
    print('F1 Score: {f1}\n'.format(f1=round(f1*100, 3)))
indexes =['NDBI','ENDISI']
min_thresholds = [min_ndbi_threshold, min_endisi_threshold]
max_thresholds =[max_ndbi_threshold, max_endisi_threshold]

for index, min_thresh, max_thresh in zip(indexes, min_thresholds, max_thresholds):
    print ('\033[1m' + '\033[91m' + index+' - Comparison Results') # bold print and red
    print ('\033[0m') # stop bold and red
    metrics = get_metrics(actual.values, ds[index].values, min_thresh, max_thresh)

#create a dictionary with the accuracy data in it
index_metrics = {indexes[i]: index_metrics[i] for i in range(len(indexes))}
NDBI - Comparison Results

True Positive (Actual + Model = Urban): 11.682
True Negative (Actual + Model = Non-Urban): 78.626
False Positive (Actual=Non-Urban, Model=Urban): 8.475
False Negative (Actual=Urban, Model=Non-Urban): 1.218

Accuracy: 90.308
F1 Score: 70.679

ENDISI - Comparison Results

True Positive (Actual + Model = Urban): 9.617
True Negative (Actual + Model = Non-Urban): 83.161
False Positive (Actual=Non-Urban, Model=Urban): 3.94
False Negative (Actual=Urban, Model=Non-Urban): 3.282

Accuracy: 92.778
F1 Score: 72.702

Output Comparisons

The dstack calls provide the imshow calls with RGB array inputs. For each image, the first channel (red) is the actual (ground truth, GHS) values, and both the second and third channels (green, blue) are the predicted values (green + blue = cyan).

fig, ax = plt.subplots(1, 2, figsize=(20,10))

for a, key in zip(ax, index_metrics):
        [Patch(facecolor='cyan'), Patch(facecolor='red'), Patch(facecolor='white')],
        ['False Positive', 'False Negative', 'True Positive'], loc='lower left', fontsize=10)
    a.set_title(key +' Comparison')

Next Steps

Machine learning can also be used to measure urbanisation. See this notebook for a guide on using machine learning in the context of the ODC.

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:


Last Tested:

from datetime import datetime