Urbanization Index Comparisons with Global Human Settlement (GHS)

Tags: band index; NDBI, band index; ENDISI, band index; ASI, urban


There are many different urbanization indices with different characteristics and use cases. It is often convenient to be able to compare the performance of several indicies for an area - determining which is the best for an area based on the outputs and a “ground truth” dataset of urbanization.


This notebook uses several indices to classify land as “urban” and then compares those results with the Global Human Settlement (GHS) product which shows the extent of built-up area (urban extent) through to 2014. The indices tested here are the Normalized Difference Buildup Index (NDBI), the Enhanced Normalized Difference Impervious Surface Index (ENDISI) and the Artificial Surface Index (ASI).

  1. Load a geomedian image from the region of interest.

  2. Mask water using the WOfS All Time Summary

  3. Calulate urban indices and show histograms for the indices.

  4. Select minimum and maximum threshold values for the indicies to determine urban pixels.

  5. Show the urbanization prediction images.

  6. Load and show the “ground truth” data (GHS) for the year 2014.

  7. Compare the urbanization predictions with the “ground truth” data visually and statistically.

The choice of threshold values can be informed by the histograms and comparing the urbanization prediction images with the “ground truth” urbanization data.

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

%matplotlib inline

import datacube
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
import numpy as np
import rioxarray
from collections import namedtuple
from datacube.utils.geometry import assign_crs
from skimage.morphology import remove_small_objects
from skimage.morphology import remove_small_holes
from matplotlib.patches import Patch
from odc.algo import xr_reproject

from deafrica_tools.plotting import display_map, rgb
from deafrica_tools.bandindices import calculate_indices

Connect to the datacube

Activate the datacube database, which provides functionality for loading and displaying stored Earth observation data.

dc = datacube.Datacube(app="Urbanization_GHS_Comparison")

Analysis parameters

The following cell sets important parameters for the analysis. The parameters are:

  • lat: The central latitude to analyse (e.g. 10.338).

  • lon: The central longitude to analyse (e.g. -1.055).

  • lat_buffer: The number of degrees to load around the central latitude.

  • lon_buffer: The number of degrees to load around the central longitude.

  • time_range: The time range to analyze - in YYYY-MM-DD format (e.g. ('2016-01-01', '2016-12-31')).

If running the notebook for the first time, keep the default settings below. The default area is Dakar, Senegal.

Select location

# Define the area of interest.
lat = 14.72
lon = -17.355
lat_buffer = 0.15
lon_buffer = 0.2
time_range = ("2017")

# Combine central lat, and central lon with their respective buffers to get the area of interest.
lat_range = (lat - lat_buffer, lat + lat_buffer)
lon_range = (lon - lon_buffer, lon + lon_buffer)

View the selected location

The next cell will display the selected area on an interactive map. Feel free to zoom in and out to get a better understanding of the area you’ll be analysing. Clicking on any point of the map will reveal the latitude and longitude coordinates of that point.

# The code below renders a map that can be used to view the region.
display_map(lon_range, lat_range)
Make this Notebook Trusted to load map: File -> Trust Notebook

Load the Landsat 8 Annual GeoMAD

We will load the Landsat 8 Annual GeoMAD from 2017.

# Create the 'query' dictionary object.
query = {
    "longitude": lon_range,
    "latitude": lat_range,
    "time": time_range,
    "resolution": (-20, 20),

# Load the geomedian.
ds = dc.load(product='gm_ls8_annual',
             measurements=["red", "green", "blue", "swir_1", "swir_2", "nir"],

Once the load is complete, examine the data by printing it in the next cell. The Dimensions attribute revels the number of time steps in the data set, as well as the number of pixels in the longitude and latitude dimensions.

Dimensions:      (y: 1854, x: 1930)
    time         datetime64[ns] 2017-07-02T11:59:59.999999
  * y            (y) float64 1.876e+06 1.876e+06 ... 1.839e+06 1.839e+06
  * x            (x) float64 -1.694e+06 -1.694e+06 ... -1.655e+06 -1.655e+06
    spatial_ref  int32 6933
Data variables:
    red          (y, x) uint16 234 237 235 235 233 229 ... 140 140 163 163 163
    green        (y, x) uint16 272 272 270 270 268 266 ... 300 300 322 322 322
    blue         (y, x) uint16 116 120 115 115 112 113 ... 132 132 148 148 148
    swir_1       (y, x) uint16 306 310 312 312 309 305 ... 228 228 248 248 248
    swir_2       (y, x) uint16 273 276 277 277 275 272 ... 199 199 213 212 212
    nir          (y, x) uint16 255 259 261 261 258 254 ... 166 166 189 188 188
    crs:           EPSG:6933
    grid_mapping:  spatial_ref

Filter out water pixels using WOfS

# Create a water mask using WOfS.
water = dc.load(

# Mask out water.
water_mask = water > 0.1
ds = ds.where(~water_mask.squeeze())

Land Spectral Indices

We’re going to be calculating NDBI, ENDISI and ASI as urbanization indices.


The Normalized Difference Built-Up Index (NDBI) is one of the most commonly used proxies of urbanization. Like all normalized difference indicies, it has a range of [-1,1].

\[\begin{aligned} \text{NDBI} = \frac{\text{SWIR1} - \text{NIR}}{\text{SWIR1} + \text{NIR}} \end{aligned}\]


The Enhanced Normalized Difference Impervious Surfaces Index (ENDISI) is a more recently developed urbanization proxy. Like all normalized difference indicies, it has a range of [-1,1]. Note that MNDWI, swir_diff and alpha are all part of the ENDISI calculation.

\[\begin{aligned} \text{MNDWI} = \frac{\text{GREEN} - \text{SWIR1}}{\text{GREEN} + \text{SWIR1}} \end{aligned}\]
\[\begin{aligned} \text{swir_diff} = \frac{\text{SWIR1}}{\text{SWIR2}} \end{aligned}\]
\[\begin{aligned} \text{alpha} = \frac{\text{2*mean(BLUE)}}{\text{mean(swir_diff) + mean(MNDWI^2)}} \end{aligned}\]
\[\begin{aligned} \text{ENDISI} = \frac{\text{BLUE} - \text{alpha}*(\text{swir_diff} + \text{MNDWI^2})} {\text{BLUE} + \text{alpha}*(\text{swir_diff} + \text{MNDWI^2})} \end{aligned}\]


The Artificial Surface Index (ASI) has a range of [0, 1] . ASI is composed of four components:

  1. The Artificial Surface Factor (AF):

\begin{aligned} AF = \frac{NIR - BLUE}{NIR + BLUE} \end{aligned}
  1. The Vegetation Suppressing Factor (VSF) which is includes the Normalized Difference Vegetation Index and the Modified Soil Adjusted Vegetation Index.

\begin{aligned} NDVI = \frac{NIR - RED}{NIR + RED} \end{aligned}
\begin{aligned} MSAVI = \frac{2*NIR + 1 - \sqrt{(2*NIR + 1)^2 − 8*(NIR - RED)}}{2} \end{aligned}
\begin{aligned} VSF = 1 − NDVI * MSAVI \end{aligned}
  1. Soil Suppressing Factor (SSF) which includes the Modified Bare soil Index (MBI), Modified Normalized Difference Water Index (MNDWI), and Enhanced Modified Bare soil Index (EMBI).

\begin{aligned} MBI = \frac{SWIR1 - SWIR2 - NIR}{SWIR1 + SWIR2 + NIR} + 0.5 \end{aligned}
\[\begin{aligned} \text{MNDWI} = \frac{\text{GREEN} - \text{SWIR1}}{\text{GREEN} + \text{SWIR1}} \end{aligned}\]
\[\begin{aligned} \text{EMBI} = \frac{MBI - MNDWI - 0.5}{MBI + MNDWI + 1.5} \end{aligned}\]
\[\begin{aligned} \text{SSF} = 1 - EMBI \end{aligned}\]
  1. Modulation Factor (MF)

\[\begin{aligned} \text{MF} = \frac{(BLUE + GREEN) - (NIR + SWIR1)}{(BLUE + GREEN) + (NIR + SWIR1)} \end{aligned}\]

The Artificial Surface Index (ASI) is finalized as:

\begin{equation} ASI = f(AF) × f(VSF) × f(SSF) × f(MF) \end{equation}

\(f\) means the min–max normalization function based on the entire image, i.e., \(y = (x − x_{min})/(x_{max} − x_{min})\)

Calculate NDBI, ENDISI and ASI

# NDBI, ENDISI and ASI using DE Africa funciton calculate_indices.
ds = calculate_indices(ds, index=['NDBI', "ENDISI", "ASI"], satellite_mission='ls')

Plot the urban indices

# Plot.
fig, ax = plt.subplots(1, 3, figsize=(16,6), sharey=True)
ds.NDBI.plot.imshow(ax=ax[0], vmin=-.75, vmax=.75, cmap='RdBu_r')
ds.ENDISI.plot.imshow(ax=ax[1], vmin=-.75, vmax=.75, cmap='RdBu_r')
ds.ASI.plot.imshow(ax=ax[2], cmap='RdBu_r')

ax[0].set_title('NDBI'), ax[0].xaxis.set_visible(False), ax[0].yaxis.set_visible(False)
ax[1].set_title('ENDISI'), ax[1].xaxis.set_visible(False), ax[1].yaxis.set_visible(False)
ax[2].set_title('ASI'), ax[2].xaxis.set_visible(False), ax[2].yaxis.set_visible(False)


Determining the thresholds for urbanization

These histogram plots show the distribution of values for each product. The urban threshold values are chosen using these histograms.

If a highly urban area is being examined, there should be visible maximal values for these histograms. The ideal thresholds should usually include these values (see the x axes of the histograms) and some range of values less than and greater than these maximal values.

ds.NDBI.plot.hist(bins=1000, range=(-1,1), facecolor='gray', figsize=(10, 4))
plt.title('NDBI Histogram')

ds.ENDISI.plot.hist(bins=1000, range=(-1,1), facecolor='gray', figsize=(10, 4))
plt.title('ENDISI Histogram')

ds.ASI.plot.hist(bins=1000, range=(0,1), facecolor='gray', figsize=(10, 4))
plt.title('ASI Histogram')


Create Threshold Plots

First we will define a minimum threshold and a maximum threshold for each index. Then we will create plots that color the threshold region a single color (e.g. red).

# NDBI = -1.0 to 1.0 (full range)
# NDBI -0.1 to 0.3 is typical for urban areas
min_ndbi_threshold = -0.1
max_ndbi_threshold = 0.15

# ENDISI = -1.0 to 1.0 (full range)
# ENDISI -0.2 to 0.4 is typical for urban areas
min_endisi_threshold = -0.3
max_endisi_threshold = 0.1

# ASI = 0 to 1.0 (full range)
# ASI -0.05 to 1 is typical for urban areas
min_asi_threshold = 0.05
max_asi_threshold = 1 #0.1
# Set up the sub-plots.
fig, ax = plt.subplots(1, 3, figsize=(12, 6))

rgb(ds, ax=ax[0])
ds.NDBI.where((ds.NDBI > min_ndbi_threshold) &
                          (ds.NDBI < max_ndbi_threshold)).plot.imshow(

rgb(ds, ax=ax[1])
ds.ENDISI.where((ds.ENDISI > min_endisi_threshold) &
                            (ds.ENDISI < max_endisi_threshold)).plot.imshow(

rgb(ds, ax=ax[2])
ds.ASI.where((ds.ASI > min_asi_threshold) &
                            (ds.ASI < max_asi_threshold)).plot.imshow(

# Remove axes plotting elements.
for a in ax:

# Set titles.
ax[0].set_title(f'NDBI Threshold ({min_ndbi_threshold} < x < {max_ndbi_threshold})')
ax[1].set_title(f'ENDISI Threshold ({min_endisi_threshold} < x < {max_endisi_threshold})')
ax[2].set_title(f'ASI Threshold ({min_asi_threshold} < x < {max_asi_threshold})')


Comparison Metrics

We will compare the performance of the urban index results against the GHS GeoTIFF product (shown below).

The GHS geotiff for the Senegal region is available by default in the Supplementary_data folder. To find and download GHS geotiffs for other regions, use the following link: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu

# Senegal region.
tif = '../Supplementary_data/Urban_index_comparison/GHS_BUILT_LDSMT_GLOBE_R2018A_3857_30_V2_0_12_10.tif'

Open and reproject dataset to match Landsat

# Open and assign a geobox object.
ghs_ds = assign_crs(rioxarray.open_rasterio(tif).squeeze().chunk({'x':5000, 'y':5000}))

# Reproject to match our landsat composite.
ghs_ds = xr_reproject(ghs_ds,

# Threshold GHS to get all the urban areas.
actual = (ghs_ds >= 3) & (ghs_ds <= 6)
# Plot.
actual.plot(figsize=(8,7), add_colorbar=False)
plt.title('Global Human Settlement Urban Areas');

Metric and Plotting Functions

The code below will calculate the true/false positive/negative sums and calculate values for a typical confusion matrix to assess the results. Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial.

\[\begin{aligned} \text{Accuracy} = \frac{\text{True Positive} + \text{True Negative}}{\text{Total Pixels}} \end{aligned}\]
\[\begin{aligned} \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \end{aligned}\]
\[\begin{aligned} \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \end{aligned}\]
\[\begin{aligned} \text{F1 Score} = 2*\frac{\text{Recall}*\text{Precision}}{\text{Recall} + \text{Precision}} \end{aligned}\]
def get_metrics(actual, predicted, minimum_threshold, maximum_threshold, filter_size=1):
    """ Creates performance metrics.

        actual: the data to use as truth.
        predicted: the data to predict and to compare against actual.
        minimum_threshold: the minimum threshold to apply on the predicted values for generating a boolean mask.
        maximum_threshold: the maximum threshold to apply on the predicted values for generating a boolean mask.
        filter_size: the filter size to apply on predicted to remove small object/holes with.

    Returns: A namedtuple containing the actual, predicted mask, and varying metrics for a confusion matrix.
    metrics = namedtuple('Metrics',
                         'actual predicted true_positive true_negative false_positive false_negative')

    predicted  = (predicted > minimum_threshold) & (predicted < maximum_threshold)
    predicted  = remove_small_objects(predicted, min_size=filter_size+1, connectivity=2)
    predicted  = remove_small_holes(predicted, area_threshold=filter_size+1, connectivity=2)

    true_positive=(predicted & actual).sum()
    true_negative=(~predicted & ~actual).sum()
    false_positive=(predicted & ~actual).sum()
    false_negative=(~predicted & actual).sum()

    return metrics(actual=actual,

def print_metrics(metrics):
    norm = metrics.true_positive + metrics.false_negative + metrics.false_positive + metrics.true_negative
    accuracy = (metrics.true_positive + metrics.true_negative)/norm
    ppv = metrics.true_positive/(metrics.true_positive + metrics.false_positive)
    tpr = metrics.true_positive/(metrics.true_positive + metrics.false_negative)
    f1 = (2*ppv*tpr)/(ppv+tpr)

    print('True Positive (Actual + Model = Urban): {tp}'.format(tp=round(metrics.true_positive/norm*100,3)))
    print('True Negative (Actual + Model = Non-Urban): {tn}'.format(tn=round(metrics.true_negative/norm*100,3)))
    print('False Positive (Actual=Non-Urban, Model=Urban): {fp}'.format(fp=round(metrics.false_positive/norm*100,3)))
    print('False Negative (Actual=Urban, Model=Non-Urban): {fn}'.format(fn=round(metrics.false_negative/norm*100,3)))
    print('\nAccuracy: {accuracy}'.format(accuracy=round(accuracy*100, 3)))
    print('F1 Score: {f1}\n'.format(f1=round(f1*100, 3)))
indexes =['NDBI','ENDISI', 'ASI']
min_thresholds = [min_ndbi_threshold, min_endisi_threshold, min_asi_threshold]
max_thresholds =[max_ndbi_threshold, max_endisi_threshold, max_asi_threshold]

for index, min_thresh, max_thresh in zip(indexes, min_thresholds, max_thresholds):
    print ('\033[1m' + '\033[91m' + index+' - Comparison Results') # bold print and red
    print ('\033[0m') # stop bold and red
    metrics = get_metrics(actual.values, ds[index].values, min_thresh, max_thresh)

#create a dictionary with the accuracy data in it
index_metrics = {indexes[i]: index_metrics[i] for i in range(len(indexes))}
NDBI - Comparison Results

True Positive (Actual + Model = Urban): 12.552
True Negative (Actual + Model = Non-Urban): 74.567
False Positive (Actual=Non-Urban, Model=Urban): 12.533
False Negative (Actual=Urban, Model=Non-Urban): 0.347

Accuracy: 87.12
F1 Score: 66.092

ENDISI - Comparison Results

True Positive (Actual + Model = Urban): 9.667
True Negative (Actual + Model = Non-Urban): 82.772
False Positive (Actual=Non-Urban, Model=Urban): 4.329
False Negative (Actual=Urban, Model=Non-Urban): 3.232

Accuracy: 92.439
F1 Score: 71.888

ASI - Comparison Results

True Positive (Actual + Model = Urban): 11.276
True Negative (Actual + Model = Non-Urban): 83.989
False Positive (Actual=Non-Urban, Model=Urban): 3.112
False Negative (Actual=Urban, Model=Non-Urban): 1.623

Accuracy: 95.265
F1 Score: 82.647

Output Comparisons

The dstack calls provide the imshow calls with RGB array inputs. For each image, the first channel (red) is the actual (ground truth, GHS) values, and both the second and third channels (green, blue) are the predicted values (green + blue = cyan).

fig, ax = plt.subplots(1, 3, figsize=(20,10))

for a, key in zip(ax, index_metrics):
        [Patch(facecolor='cyan'), Patch(facecolor='red'), Patch(facecolor='white')],
        ['False Positive', 'False Negative', 'True Positive'], loc='lower left', fontsize=10)
    a.set_title(key +' Comparison')

Next Steps

Machine learning can also be used to measure urbanisation. See this notebook for a guide on using machine learning in the context of the ODC.

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:


Last Tested:

from datetime import datetime