Cloud and pixel quality masking for Landsat

Keywords data methods; dilate, data methods; masking, cloud masking, data methods; buffering


In the past, remote sensing researchers would reject partly cloud-affected scenes in favour of cloud-free scenes. However, multi-temporal analysis techniques increasingly make use of every quality assured pixel within a time series of observations. The capacity to automatically exclude low quality pixels (e.g. clouds, shadows or other invalid data) is essential for enabling these new remote sensing methods.

Analysis-ready satellite data from Digital Earth Africa includes pixel quality information that can be used to easily “mask” data (i.e. keep only certain pixels in an image) to obtain a time series containing only clear or cloud-free pixels.


In this notebook, we show how to mask Digital Earth Africa Landsat satellite data using boolean masks. The notebook demonstrates how to:

  1. Load in a time series of satellite data including the pixel_qa pixel quality band

  2. Inspect the band’s flags_definition attributes

  3. Create a mask where pixels are cloudy, have cloud-shadow, or no-data

  4. Apply binary morphological operators on the cloudy pixels to improve the mask

  5. Masking of high aerosols

  6. Apply the masks to the satellite data so we retain only the good quality observations, and plot the results

  7. Rescaling of Landsat data and masking invalid data

  8. Using load_ard to mask poor quality pixels, taking advantage of all its in-built features for masking

Digital Earth Africa provides wrapper functions that will automatically provide cloud-masked satellite data, more information can be found in the Using_load_ard notebook.

Getting started

First we import relevant packages and connect to the datacube. Then we define our example area of interest and load in a time series of satellite data.

%matplotlib inline

import datacube
from pprint import pprint
import geopandas as gpd
from datacube.utils.geometry import Geometry
from datacube.utils import masking
from odc.algo import mask_cleanup, erase_bad, to_f32

from deafrica_tools.plotting import rgb
from deafrica_tools.datahandling import mostcommon_crs, load_ard
from deafrica_tools.areaofinterest import define_area

Connect to the datacube

dc = datacube.Datacube(app="Masking_data")

Create a query and load satellite data

To demonstrate how to mask satellite data, we will load Landsat 8 surface reflectance RGB data along with a pixel quality classification band called pixel_quality.

To define the area of interest, there are two methods available:

  1. By specifying the latitude, longitude, and buffer. This method requires you to input the central latitude, central longitude, and the buffer value in square degrees around the center point you want to analyze. For example, lat = 10.338, lon = -1.055, and buffer = 0.1 will select an area with a radius of 0.1 square degrees around the point with coordinates (10.338, -1.055).

  2. By uploading a polygon as a GeoJSON or Esri Shapefile. If you choose this option, you will need to upload the geojson or ESRI shapefile into the Sandbox using Upload Files button 6b96c1fd5441418191e39ea7bc7fd17b in the top left corner of the Jupyter Notebook interface. ESRI shapefiles must be uploaded with all the related files (.cpg, .dbf, .shp, .shx). Once uploaded, you can use the shapefile or geojson to define the area of interest. Remember to update the code to call the file you have uploaded.

To use one of these methods, you can uncomment the relevant line of code and comment out the other one. To comment out a line, add the "#" symbol before the code you want to comment out. By default, the first option which defines the location using latitude, longitude, and buffer is being used.

If running the notebook for the first time, keep the default settings below. This will demonstrate how the analysis works and provide meaningful results.

# Method 1: Specify the latitude, longitude, and buffer
aoi = define_area(lat=35.7718, lon=-5.8096, buffer=0.03)

# Method 2: Use a polygon as a GeoJSON or Esri Shapefile.
# aoi = define_area(vector_path='aoi.shp')

#Create a geopolygon and geodataframe of the area of interest
geopolygon = Geometry(aoi["features"][0]["geometry"], crs="epsg:4326")
geopolygon_gdf = gpd.GeoDataFrame(geometry=[geopolygon],

# Get the latitude and longitude range of the geopolygon
lat_range = (geopolygon_gdf.total_bounds[1], geopolygon_gdf.total_bounds[3])
lon_range = (geopolygon_gdf.total_bounds[0], geopolygon_gdf.total_bounds[2])

# Create a reusable query
query = {
    'x': lon_range,
    'y': lat_range,
    'time': ('2016-09', '2016-10'),

# Identify the most common projection system in the input query
output_crs = mostcommon_crs(dc=dc, product='ls8_sr', query=query)

# Load data from the Landsat-8
data = dc.load(product="ls8_sr",
               measurements=["blue", "green", "red",
                             "pixel_quality", "qa_aerosol"],
               resolution=(-30, 30),
               align=(15, 15),
/home/jovyan/dev/deafrica-sandbox-notebooks/Tools/deafrica_tools/ UserWarning: Multiple UTM zones ['epsg:32629', 'epsg:32630'] were returned for this query. Defaulting to the most common zone: epsg:32630
Dimensions:        (time: 7, y: 228, x: 188)
  * time           (time) datetime64[ns] 2016-09-02T11:03:08.816501 ... 2016-...
  * y              (y) float64 3.966e+06 3.966e+06 ... 3.959e+06 3.959e+06
  * x              (x) float64 2.432e+05 2.432e+05 ... 2.488e+05 2.488e+05
    spatial_ref    int32 32630
Data variables:
    blue           (time, y, x) uint16 8398 8338 8295 8216 ... 16903 19403 21435
    green          (time, y, x) uint16 8813 8771 8707 8622 ... 20944 23082 23060
    red            (time, y, x) uint16 8235 8199 8136 8077 ... 22769 24061 24180
    pixel_quality  (time, y, x) uint16 23888 23888 23888 ... 21824 21824 21824
    qa_aerosol     (time, y, x) uint8 224 224 224 224 224 ... 194 224 224 64 96
    crs:           epsg:32630
    grid_mapping:  spatial_ref

The absence of satellite observation is indicated by a “nodata” value for the band, which is listed under the Attributes category of the returned xarray.DataArray.

{'units': '1', 'nodata': 0, 'crs': 'epsg:32630', 'grid_mapping': 'spatial_ref'}

We see that the nodata attribute reports the value 0.

We can find the classification scheme of the pixel_qa band in its flags definition.

flags_def = data.pixel_quality.attrs["flags_definition"]
{'cirrus': {'bits': 2,
            'values': {'0': 'not_high_confidence', '1': 'high_confidence'}},
 'cirrus_confidence': {'bits': [14, 15],
                       'values': {'0': 'none',
                                  '1': 'low',
                                  '2': 'reserved',
                                  '3': 'high'}},
 'clear': {'bits': 6, 'values': {'0': False, '1': True}},
 'cloud': {'bits': 3,
           'values': {'0': 'not_high_confidence', '1': 'high_confidence'}},
 'cloud_confidence': {'bits': [8, 9],
                      'values': {'0': 'none',
                                 '1': 'low',
                                 '2': 'medium',
                                 '3': 'high'}},
 'cloud_shadow': {'bits': 4,
                  'values': {'0': 'not_high_confidence',
                             '1': 'high_confidence'}},
 'cloud_shadow_confidence': {'bits': [10, 11],
                             'values': {'0': 'none',
                                        '1': 'low',
                                        '2': 'reserved',
                                        '3': 'high'}},
 'dilated_cloud': {'bits': 1, 'values': {'0': 'not_dilated', '1': 'dilated'}},
 'nodata': {'bits': 0, 'values': {'0': False, '1': True}},
 'snow': {'bits': 5,
          'values': {'0': 'not_high_confidence', '1': 'high_confidence'}},
 'snow_ice_confidence': {'bits': [12, 13],
                         'values': {'0': 'none',
                                    '1': 'low',
                                    '2': 'reserved',
                                    '3': 'high'}},
 'water': {'bits': 7, 'values': {'0': 'land_or_cloud', '1': 'water'}}}

We see that pixel_quality also reports the nodata pixels, and along with the cloud and cloud_shadow pixels, it also picks up snow and water pixels.

Creating a cloud and pixel quality mask

We create a mask by specifying conditions that our pixels must satisfy. But we will only need the labels (not the values) to create a mask.

In this example below, we want to exclude multiple catagories of bad pixels, i.e. if a pixel has any of the flags (cloud, cirrus or cloud_shadow) set, the pixel will be masked.

quality_flags = dict(
                cloud="high_confidence", # True where there is cloud
                cirrus="high_confidence",# True where there is cirrus cloud
                cloud_shadow="high_confidence",# True where there is cloud shadow

# set cloud_mask: True=cloud, False=non-cloud
mask, _= masking.create_mask_value(flags_def, **quality_flags)

#add the cloud mask to our dataset
data['cloud_mask'] = (data['pixel_quality'] & mask) != 0

Below, we’ll plot the pixel quality mask along with the true colour satellite images.

Does the cloud mask exactly match the clouds you see in the RGB plots? Landsat’s pixel quality algorithm has known limitations that result in bright objects, such as beaches and cities, mistakenly being classified as clouds.

#plot the locations where there are clouds and cloud shadows
data['cloud_mask'].plot(col="time", col_wrap=7, add_colorbar=False);
# Plot in True colour
rgb(data, col="time", col_wrap=7)

Applying morphological processing on the cloud mask

We can improve on the false positives detected by Landsat’s pixel quality mask by applying binary morphological image processing techniques (e.g. binary_closing, binary_erosion etc.). The Open-Data-Cube library odc-algo has a function, odc.algo.mask_cleanup that can perform a few of these operations. Below, we will try to improve the cloud mask by apply a number of these filters.

Feel free to experiment with the values in filters

# set the filters to apply. The integers refer to the number of pixels
filters = [("opening", 2),("dilation", 2)]
# Use the mask_cleanup function to apply the filters
data['cloud_mask_filtered'] = mask_cleanup(data['cloud_mask'], mask_filters=filters)
#plot the results
data['cloud_mask_filtered'].plot(col="time", col_wrap=7, add_colorbar=False);

Masking pixels with high aerosol

The atmospheric correction algorithm applied to Landsat uses aerosol as an input. When the aerosol value is high, the surface reflectance value may be unreliable especially over dark surfaces. Therefore for some applications, masking data with high aerosol may help remove unreliable observations.

In the example below, we use the masking.make_mask() function to create a mask for pixels with “high aerosol” and combine it with the filtered cloud mask from above.

# find the bit mask flags and print them
aerosol_flags = data.qa_aerosol.attrs["flags_definition"]

#create a dict of flags we want to mask
aerosol_quality_flags = dict(

#create a boolean mask where high-aerosol = True
data['aerosol_mask'] = masking.make_mask(data['qa_aerosol'], **aerosol_quality_flags)
{'aerosol_level': {'bits': [6, 7],
                   'values': {'0': 'climatology',
                              '1': 'low',
                              '2': 'medium',
                              '3': 'high'}},
 'interpolated_aerosol': {'bits': 5, 'values': {'0': False, '1': True}},
 'nodata': {'bits': 0, 'values': {'0': False, '1': True}},
 'valid_aerosol_retrieval': {'bits': 1, 'values': {'0': False, '1': True}},
 'water': {'bits': 2, 'values': {'0': False, '1': True}}}
data['aerosol_mask'].plot(col="time", col_wrap=7);

Now we combine the aerosol mask with the clear_filtered mask to produce of mask where the data is unreliable either due to cloud or high aerosols

data['cloud_aerosol_mask'] = data['aerosol_mask'] | data['cloud_mask_filtered']

#plot the result
data['cloud_aerosol_mask'].plot(col="time", col_wrap=7);

Applying the cloud-mask

We can now get the clear images we want by erasing the cloud, non-data, and/or high-aerosol pixels from the data. We’ll do this for the origial PQ mask, the PQ mask that underwent binary morphological image processing, and the mask with additional aerosol masking.

# erase pixels with cloud
clear = erase_bad(data.drop_vars(['cloud_mask_filtered', 'cloud_mask', 'cloud_aerosol_mask', 'aerosol_mask']),

# erase pixels with the cloud_filtering
clear_filtered = erase_bad(data.drop_vars(['cloud_mask_filtered', 'cloud_mask', 'cloud_aerosol_mask','aerosol_mask']),

# erase pixels with cloud and high aerosol
clear_filtered_aerosol = erase_bad(data.drop_vars(['cloud_mask_filtered', 'cloud_mask', 'cloud_aerosol_mask','aerosol_mask']),

Plot the results of our ‘clear’ masking

rgb(clear, col="time", col_wrap=7)

Plot the results of our ‘clear_filtered’ masking

As you can see, the morphological filtering operations have minimised the impact of the false-postives in the cloud mask over the cities and along the coast

rgb(clear_filtered, col="time", col_wrap=7)

Plot the results of our ‘clear_filtered’ and ‘aerosol’ masked data

Masking high aerosols removes many pixels over the ocean where surface reflectance values are low and unreliable. Noting, however, this might also remove useful observations. Users should use this mask with care.

rgb(clear_filtered_aerosol, col="time", col_wrap=7)

Recaling Landsat data and masking invalid data

Rescale the Landsat data to ~0-1 in accordance with the scale and offset factors provided by the USGS Landsat Collection 2 documentation.

Note: USGS Landsat data has different scale and offset values depending on the measurement. The example scale and offset values shown below only applies to the surface reflectance bands

Before we can rescale the values, we need to ensure the data is first converted to float32, this makes sure the no-data values in the native data-type (0) are converted to NaN, thus ensuring the no-data values in the rescaled data are also NaN.

# Convert the data to `float32` and set no-data values to `NaN`:
clear_filtered = to_f32(clear_filtered)
#Apply the scale and offset factors to the Landsat data
sr_bands = ["blue", "green", "red"]

for band in sr_bands:
    clear_filtered[band] = 2.75e-5 * clear_filtered[band] - 0.2

Below we replot clear_filtered in true colour. Note the land surface looks more vivid because the function rgb is no longer including the 0 values in the colour ramps, and because the surface reflectance has been rescaled to its intended values (~0-1).

rgb(clear_filtered, col="time", col_wrap=7)
/usr/local/lib/python3.10/dist-packages/matplotlib/ RuntimeWarning: invalid value encountered in cast
  xx = (xx * 255).astype(np.uint8)

Cloud masking, morphological filtering, and rescaling with load_ard

Most of the steps undertaken above can be achieved by using the function deafrica_tools.datahanding.load_ard. This function will create a cloud mask, apply morphological filters (if desired), and apply the Landsat scale and offset factors.

Note: There are some limitations to load_ard due to its design. An important part of its intended functionality is to provide a simple means of concatenating Landsat 5, 7, & 8 together to form a single xarray object. However, this means that load_ard can only apply pixel quality masking to pq-categories that are common across Landsat sensors. For example, Landsat-8 has a dedicated bit flag for cirrus bands, but Landsat 5 and 7 do not. This means that load_ard cannot accept 'cirrus': 'high_confidence' as a pq-category. The same issue also applies for masking of aerosols, Landat 5 and 7 have different means of recording high aerosol than Landsat 8, and thus load_ard does not support masking of aerosols. However, load_ard does remove negative surface reflectance values (negative values are an occassional artefact of the correction algorithms employed by USGS), which often coincide with areas prone to high aerosol values such as large waterbodies and along the coast.

# set the filters to apply
filters = [("opening", 2),("dilation", 2)]

# # Load data from the Landsat-8
data = load_ard(dc=dc,
               measurements=["blue", "green", "red"],
               resolution=(-30, 30),

# plot the data
rgb(data, col="time", col_wrap=7)
Using pixel quality parameters for USGS Collection 2
Finding datasets
Applying morphological filters to pq mask [('opening', 2), ('dilation', 2)]
Applying pixel quality/cloud mask
Re-scaling Landsat C2 data
Loading 7 time steps
/usr/local/lib/python3.10/dist-packages/matplotlib/ RuntimeWarning: invalid value encountered in cast
  xx = (xx * 255).astype(np.uint8)

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:


Last Tested:

from datetime import datetime'%Y-%m-%d')