Cloud and pixel quality masking for Landsat
Products used: ls8_sr
Keywords data methods; dilate, data methods; masking, cloud masking, data methods; buffering
Background
In the past, remote sensing researchers would reject partly cloud-affected scenes in favour of cloud-free scenes. However, multi-temporal analysis techniques increasingly make use of every quality assured pixel within a time series of observations. The capacity to automatically exclude low quality pixels (e.g. clouds, shadows or other invalid data) is essential for enabling these new remote sensing methods.
Analysis-ready satellite data from Digital Earth Africa includes pixel quality information that can be used to easily “mask” data (i.e. keep only certain pixels in an image) to obtain a time series containing only clear or cloud-free pixels.
Description
In this notebook, we show how to mask Digital Earth Africa Landsat satellite data using boolean masks. The notebook demonstrates how to:
Load in a time series of satellite data including the
pixel_qa
pixel quality bandInspect the band’s
flags_definition
attributesCreate a mask where pixels are cloudy, have cloud-shadow, or no-data
Apply binary morphological operators on the cloudy pixels to improve the mask
Masking of high aerosols
Apply the masks to the satellite data so we retain only the good quality observations, and plot the results
Rescaling of Landsat data and masking invalid data
Using
load_ard
to mask poor quality pixels, taking advantage of all its in-built features for masking
Digital Earth Africa provides wrapper functions that will automatically provide cloud-masked satellite data, more information can be found in the Using_load_ard notebook.
Getting started
First we import relevant packages and connect to the datacube. Then we define our example area of interest and load in a time series of satellite data.
[1]:
%matplotlib inline
import datacube
from pprint import pprint
import geopandas as gpd
from datacube.utils.geometry import Geometry
from datacube.utils import masking
from odc.algo import mask_cleanup, erase_bad, to_f32
from deafrica_tools.plotting import rgb
from deafrica_tools.datahandling import mostcommon_crs, load_ard
from deafrica_tools.areaofinterest import define_area
Connect to the datacube
[2]:
dc = datacube.Datacube(app="Masking_data")
Create a query and load satellite data
To demonstrate how to mask satellite data, we will load Landsat 8 surface reflectance RGB data along with a pixel quality classification band called pixel_quality
.
To define the area of interest, there are two methods available:
By specifying the latitude, longitude, and buffer. This method requires you to input the central latitude, central longitude, and the buffer value in square degrees around the center point you want to analyze. For example,
lat = 10.338
,lon = -1.055
, andbuffer = 0.1
will select an area with a radius of 0.1 square degrees around the point with coordinates (10.338, -1.055).By uploading a polygon as a
GeoJSON or Esri Shapefile
. If you choose this option, you will need to upload the geojson or ESRI shapefile into the Sandbox using Upload Files button in the top left corner of the Jupyter Notebook interface. ESRI shapefiles must be uploaded with all the related files(.cpg, .dbf, .shp, .shx)
. Once uploaded, you can use the shapefile or geojson to define the area of interest. Remember to update the code to call the file you have uploaded.
To use one of these methods, you can uncomment the relevant line of code and comment out the other one. To comment out a line, add the "#"
symbol before the code you want to comment out. By default, the first option which defines the location using latitude, longitude, and buffer is being used.
If running the notebook for the first time, keep the default settings below. This will demonstrate how the analysis works and provide meaningful results.
[3]:
# Method 1: Specify the latitude, longitude, and buffer
aoi = define_area(lat=35.7718, lon=-5.8096, buffer=0.03)
# Method 2: Use a polygon as a GeoJSON or Esri Shapefile.
# aoi = define_area(vector_path='aoi.shp')
#Create a geopolygon and geodataframe of the area of interest
geopolygon = Geometry(aoi["features"][0]["geometry"], crs="epsg:4326")
geopolygon_gdf = gpd.GeoDataFrame(geometry=[geopolygon], crs=geopolygon.crs)
# Get the latitude and longitude range of the geopolygon
lat_range = (geopolygon_gdf.total_bounds[1], geopolygon_gdf.total_bounds[3])
lon_range = (geopolygon_gdf.total_bounds[0], geopolygon_gdf.total_bounds[2])
# Create a reusable query
query = {
'x': lon_range,
'y': lat_range,
'time': ('2016-09', '2016-10'),
}
# Identify the most common projection system in the input query
output_crs = mostcommon_crs(dc=dc, product='ls8_sr', query=query)
# Load data from the Landsat-8
data = dc.load(product="ls8_sr",
measurements=["blue", "green", "red",
"pixel_quality", "qa_aerosol"],
output_crs=output_crs,
resolution=(-30, 30),
align=(15, 15),
**query)
print(data)
/home/jovyan/dev/deafrica-sandbox-notebooks/Tools/deafrica_tools/datahandling.py:733: UserWarning: Multiple UTM zones ['epsg:32629', 'epsg:32630'] were returned for this query. Defaulting to the most common zone: epsg:32630
warnings.warn(
<xarray.Dataset>
Dimensions: (time: 7, y: 228, x: 188)
Coordinates:
* time (time) datetime64[ns] 2016-09-02T11:03:08.816501 ... 2016-...
* y (y) float64 3.966e+06 3.966e+06 ... 3.959e+06 3.959e+06
* x (x) float64 2.432e+05 2.432e+05 ... 2.488e+05 2.488e+05
spatial_ref int32 32630
Data variables:
blue (time, y, x) uint16 8398 8338 8295 8216 ... 16903 19403 21435
green (time, y, x) uint16 8813 8771 8707 8622 ... 20944 23082 23060
red (time, y, x) uint16 8235 8199 8136 8077 ... 22769 24061 24180
pixel_quality (time, y, x) uint16 23888 23888 23888 ... 21824 21824 21824
qa_aerosol (time, y, x) uint8 224 224 224 224 224 ... 194 224 224 64 96
Attributes:
crs: epsg:32630
grid_mapping: spatial_ref
The absence of satellite observation is indicated by a “nodata” value for the band, which is listed under the Attributes category of the returned xarray.DataArray
.
[4]:
data.red.attrs
[4]:
{'units': '1', 'nodata': 0, 'crs': 'epsg:32630', 'grid_mapping': 'spatial_ref'}
We see that the nodata
attribute reports the value 0
.
We can find the classification scheme of the pixel_qa
band in its flags definition.
[5]:
flags_def = data.pixel_quality.attrs["flags_definition"]
pprint(flags_def)
{'cirrus': {'bits': 2,
'values': {'0': 'not_high_confidence', '1': 'high_confidence'}},
'cirrus_confidence': {'bits': [14, 15],
'values': {'0': 'none',
'1': 'low',
'2': 'reserved',
'3': 'high'}},
'clear': {'bits': 6, 'values': {'0': False, '1': True}},
'cloud': {'bits': 3,
'values': {'0': 'not_high_confidence', '1': 'high_confidence'}},
'cloud_confidence': {'bits': [8, 9],
'values': {'0': 'none',
'1': 'low',
'2': 'medium',
'3': 'high'}},
'cloud_shadow': {'bits': 4,
'values': {'0': 'not_high_confidence',
'1': 'high_confidence'}},
'cloud_shadow_confidence': {'bits': [10, 11],
'values': {'0': 'none',
'1': 'low',
'2': 'reserved',
'3': 'high'}},
'dilated_cloud': {'bits': 1, 'values': {'0': 'not_dilated', '1': 'dilated'}},
'nodata': {'bits': 0, 'values': {'0': False, '1': True}},
'snow': {'bits': 5,
'values': {'0': 'not_high_confidence', '1': 'high_confidence'}},
'snow_ice_confidence': {'bits': [12, 13],
'values': {'0': 'none',
'1': 'low',
'2': 'reserved',
'3': 'high'}},
'water': {'bits': 7, 'values': {'0': 'land_or_cloud', '1': 'water'}}}
We see that pixel_quality
also reports the nodata
pixels, and along with the cloud
and cloud_shadow
pixels, it also picks up snow
and water
pixels.
Creating a cloud and pixel quality mask
We create a mask by specifying conditions that our pixels must satisfy. But we will only need the labels (not the values) to create a mask.
In this example below, we want to exclude multiple catagories of bad pixels, i.e. if a pixel has any of the flags (cloud, cirrus or cloud_shadow) set, the pixel will be masked.
[6]:
quality_flags = dict(
cloud="high_confidence", # True where there is cloud
cirrus="high_confidence",# True where there is cirrus cloud
cloud_shadow="high_confidence",# True where there is cloud shadow
)
# set cloud_mask: True=cloud, False=non-cloud
mask, _= masking.create_mask_value(flags_def, **quality_flags)
#add the cloud mask to our dataset
data['cloud_mask'] = (data['pixel_quality'] & mask) != 0
Below, we’ll plot the pixel quality mask along with the true colour satellite images.
Does the cloud mask exactly match the clouds you see in the RGB plots? Landsat’s pixel quality algorithm has known limitations that result in bright objects, such as beaches and cities, mistakenly being classified as clouds.
[7]:
#plot the locations where there are clouds and cloud shadows
data['cloud_mask'].plot(col="time", col_wrap=7, add_colorbar=False);
[8]:
# Plot in True colour
rgb(data, col="time", col_wrap=7)
Applying morphological processing on the cloud mask
We can improve on the false positives detected by Landsat’s pixel quality mask by applying binary morphological image processing techniques (e.g. binary_closing, binary_erosion etc.). The
Open-Data-Cube library odc-algo has a function, odc.algo.mask_cleanup
that can perform a few of these operations. Below, we will try to improve the cloud mask by apply a number of these filters.
Feel free to experiment with the values in filters
[9]:
# set the filters to apply. The integers refer to the number of pixels
filters = [("opening", 2),("dilation", 2)]
[10]:
# Use the mask_cleanup function to apply the filters
data['cloud_mask_filtered'] = mask_cleanup(data['cloud_mask'], mask_filters=filters)
[11]:
#plot the results
data['cloud_mask_filtered'].plot(col="time", col_wrap=7, add_colorbar=False);
Masking pixels with high aerosol
The atmospheric correction algorithm applied to Landsat uses aerosol as an input. When the aerosol value is high, the surface reflectance value may be unreliable especially over dark surfaces. Therefore for some applications, masking data with high aerosol may help remove unreliable observations.
In the example below, we use the masking.make_mask()
function to create a mask for pixels with “high aerosol” and combine it with the filtered cloud mask from above.
[12]:
# find the bit mask flags and print them
aerosol_flags = data.qa_aerosol.attrs["flags_definition"]
pprint(aerosol_flags)
#create a dict of flags we want to mask
aerosol_quality_flags = dict(
aerosol_level="high",
)
#create a boolean mask where high-aerosol = True
data['aerosol_mask'] = masking.make_mask(data['qa_aerosol'], **aerosol_quality_flags)
{'aerosol_level': {'bits': [6, 7],
'values': {'0': 'climatology',
'1': 'low',
'2': 'medium',
'3': 'high'}},
'interpolated_aerosol': {'bits': 5, 'values': {'0': False, '1': True}},
'nodata': {'bits': 0, 'values': {'0': False, '1': True}},
'valid_aerosol_retrieval': {'bits': 1, 'values': {'0': False, '1': True}},
'water': {'bits': 2, 'values': {'0': False, '1': True}}}
[13]:
data['aerosol_mask'].plot(col="time", col_wrap=7);
Now we combine the aerosol mask with the clear_filtered
mask to produce of mask where the data is unreliable either due to cloud or high aerosols
[14]:
data['cloud_aerosol_mask'] = data['aerosol_mask'] | data['cloud_mask_filtered']
#plot the result
data['cloud_aerosol_mask'].plot(col="time", col_wrap=7);
Applying the cloud-mask
We can now get the clear images we want by erasing the cloud, non-data, and/or high-aerosol pixels from the data. We’ll do this for the origial PQ mask, the PQ mask that underwent binary morphological image processing, and the mask with additional aerosol masking.
[15]:
# erase pixels with cloud
clear = erase_bad(data.drop_vars(['cloud_mask_filtered', 'cloud_mask', 'cloud_aerosol_mask', 'aerosol_mask']),
data['cloud_mask'])
# erase pixels with the cloud_filtering
clear_filtered = erase_bad(data.drop_vars(['cloud_mask_filtered', 'cloud_mask', 'cloud_aerosol_mask','aerosol_mask']),
data['cloud_mask_filtered'])
# erase pixels with cloud and high aerosol
clear_filtered_aerosol = erase_bad(data.drop_vars(['cloud_mask_filtered', 'cloud_mask', 'cloud_aerosol_mask','aerosol_mask']),
data['cloud_aerosol_mask'])
Plot the results of our ‘clear’ masking
[16]:
rgb(clear, col="time", col_wrap=7)
Plot the results of our ‘clear_filtered’ masking
As you can see, the morphological filtering operations have minimised the impact of the false-postives in the cloud mask over the cities and along the coast
[17]:
rgb(clear_filtered, col="time", col_wrap=7)
Plot the results of our ‘clear_filtered’ and ‘aerosol’ masked data
Masking high aerosols removes many pixels over the ocean where surface reflectance values are low and unreliable. Noting, however, this might also remove useful observations. Users should use this mask with care.
[18]:
rgb(clear_filtered_aerosol, col="time", col_wrap=7)
Recaling Landsat data and masking invalid data
Rescale the Landsat data to ~0-1 in accordance with the scale and offset factors provided by the USGS Landsat Collection 2 documentation.
Note: USGS Landsat data has different scale and offset values depending on the measurement. The example scale and offset values shown below only applies to the surface reflectance bands
Before we can rescale the values, we need to ensure the data is first converted to float32
, this makes sure the no-data values in the native data-type (0
) are converted to NaN
, thus ensuring the no-data values in the rescaled data are also NaN
.
[19]:
# Convert the data to `float32` and set no-data values to `NaN`:
clear_filtered = to_f32(clear_filtered)
[20]:
#Apply the scale and offset factors to the Landsat data
sr_bands = ["blue", "green", "red"]
for band in sr_bands:
clear_filtered[band] = 2.75e-5 * clear_filtered[band] - 0.2
Below we replot clear_filtered
in true colour. Note the land surface looks more vivid because the function rgb
is no longer including the 0
values in the colour ramps, and because the surface reflectance has been rescaled to its intended values (~0-1).
[21]:
rgb(clear_filtered, col="time", col_wrap=7)
/usr/local/lib/python3.10/dist-packages/matplotlib/cm.py:478: RuntimeWarning: invalid value encountered in cast
xx = (xx * 255).astype(np.uint8)
Cloud masking, morphological filtering, and rescaling with load_ard
Most of the steps undertaken above can be achieved by using the function deafrica_tools.datahanding.load_ard
. This function will create a cloud mask, apply morphological filters (if desired), and apply the Landsat scale and offset factors.
Note: There are some limitations to
load_ard
due to its design. An important part of its intended functionality is to provide a simple means of concatenating Landsat 5, 7, & 8 together to form a single xarray object. However, this means thatload_ard
can only apply pixel quality masking to pq-categories that are common across Landsat sensors. For example, Landsat-8 has a dedicated bit flag for cirrus bands, but Landsat 5 and 7 do not. This means thatload_ard
cannot accept'cirrus': 'high_confidence'
as a pq-category. The same issue also applies for masking of aerosols, Landat 5 and 7 have different means of recording high aerosol than Landsat 8, and thusload_ard
does not support masking of aerosols. However,load_ard
does remove negative surface reflectance values (negative values are an occassional artefact of the correction algorithms employed by USGS), which often coincide with areas prone to high aerosol values such as large waterbodies and along the coast.
[22]:
# set the filters to apply
filters = [("opening", 2),("dilation", 2)]
# # Load data from the Landsat-8
data = load_ard(dc=dc,
products=["ls8_sr"],
measurements=["blue", "green", "red"],
mask_filters=filters,
output_crs=output_crs,
resolution=(-30, 30),
align=(15,15),
**query)
# plot the data
rgb(data, col="time", col_wrap=7)
Using pixel quality parameters for USGS Collection 2
Finding datasets
ls8_sr
Applying morphological filters to pq mask [('opening', 2), ('dilation', 2)]
Applying pixel quality/cloud mask
Re-scaling Landsat C2 data
Loading 7 time steps
/usr/local/lib/python3.10/dist-packages/matplotlib/cm.py:478: RuntimeWarning: invalid value encountered in cast
xx = (xx * 255).astype(np.uint8)
Additional information
License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Africa data is licensed under the Creative Commons by Attribution 4.0 license.
Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube
tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on
Github.
Compatible datacube version:
[23]:
print(datacube.__version__)
1.8.15
Last Tested:
[24]:
from datetime import datetime
datetime.today().strftime('%Y-%m-%d')
[24]:
'2023-08-11'