Dynamic World, Near real-time global 10 m land use land cover mapping

Land Use Land Cover taxonomy

The classification schema or “taxonomy” for Dynamic World, shown in Table 1, was determined after a review of global LULC maps, including the USGS Anderson classification system18, ESA Land Use and Coverage Area frame Survey (LUCAS) land cover modalities19, MapBiomas classification20, and GlobeLand30 land cover types13. The Dynamic World taxonomy maintains a close semblance to the land use classes presented in the IPCC Good Practice Guidance (forest land, grassland, cropland, wetland, settlement, and other)21 to ensure easier application of the resulting data for estimating carbon stocks and greenhouse gas emissions. Unlike single-pixel labels, which are usually defined in terms of percent cover thresholds, the Dynamic World taxonomy was applied using “dense” polygon-based annotations such that LULC labels are applied to areas of relatively homogenous cover types with similar colors and textures.

Table 1 Dynamic World Land Use Land Cover (LULC) classification taxonomy.

Training dataset collection

Our modeling approach relies on semi-supervised deep learning and requires spatially dense (i.e., ideally wall-to-wall) annotations. To collect a diverse set of training and evaluation data, we divided the world into three regions: the Western Hemisphere (160°W to 20°W), Eastern Hemisphere-1 (20°W to 100°E), and Eastern Hemisphere-2 (100°E to 160°W). We further divided each region by the 14 RESOLVE Ecoregions biomes22. We collected a stratified sample of sites for each biome per region based on NASA MCD12Q1 land cover for 20174. Given the availability of higher-resolution LULC maps in the United States and Brazil, we used the NLCD 201610 and MapBiomas 201720 LULC products respectively in place of MODIS products for stratification in these two countries.

At each sample location, we performed an initial selection of Sentinel-2 images from 2019 scenes based on image cloudiness metadata reported in the Sentinel-2 tile’s QA60 band. We further filtered scenes to remove images with many masked pixels. We finally extracted individual tiles of 510 × 510 pixels centered on the sample sites from random dates in 2019. Tiles were sampled in the UTM projection of the source image and we selected one tile corresponding to a single Sentinel-2 ID number and single date.

Further steps were then taken to obtain an “as balanced as possible” training dataset with respect to the LULC classifications from the respective LULC products. In particular, for each Dynamic World LULC category contained within a tile, the tile was labeled to be high, medium, or low in that category. We then selected an approximately equal number of tiles with high, medium or low category labels for each category.

To achieve a large dataset of labeled Sentinel-2 scenes, we worked with two groups of annotators. The first group included 25 annotators with previous photo-interpretation and/or remote sensing experience. The expert group labeled approximately 4,000 image tiles (Fig. 1a), which were then used to train and measure the performance and accuracy of a second “non-expert” group of 45 additional annotators who labeled a second set of approximately 20,000 image tiles (Fig. 1b). A final validation set of 409 image tiles were held back from the modeling effort and used for evaluation as described in the Technical Validation section. Each image tile in the validation set was annotated by three experts and one non-expert to facilitate cross-expert and expert/non-expert QA comparisons.

Fig. 1
figure 1

Global distribution of annotated Sentinel-2 image tiles used for model training and periodic testing (neither including 409 validation tiles). (a) 4,000 tiles interpreted by a group of 25 experts (b) 20,000 tiles interpreted by a group of 45 non-experts. Hexagons represent approximately 58,500 km2 areas and shading corresponds to the count of annotated tile centroids per hexagon.

All Dynamic World annotators used the Labelbox platform23, which provides a vector drawing tool to mark the boundaries of feature classes directly over the Sentinel-2 tile (Fig. 2). We instructed both expert and non-expert annotators to use dense markup instead of single pixel labels with a minimum mapping unit of 50 × 50 m (5 × 5 pixels). For water, trees, crops, built area, bare ground, snow & ice, and cloud, this was a fairly straightforward procedure at the Sentinel-2 10 m resolution since these feature classes tend to appear in fairly homogenous agglomerations. Shrub & scrub and flooded vegetation classes proved to be more challenging as they tended not to appear as homogenous features (e.g. mix of vegetation types) and have variable appearance. Annotators used their best discretion in these situations based on the guidance provided in our training material (i.e. descriptions and examples in Table 1). In addition to the Sentinel-2 tile, annotators had access to a matching high-resolution satellite image via Google Maps and ground photography via Google Street View from the image center point. We also provided the date and center point coordinates for each annotation. All annotators were asked to label at least 70% of a tile within 20 to 60 minutes and were allowed to skip some tiles to best balance their labeling accuracy with their efficiency.

Fig. 2
figure 2

Sentinel-2 tile and example reference annotation provided as part of interpreter training. This example was used to illustrate the Flooded vegetation class, which is distinguished by small “mottled” areas of water mixed with vegetation near a riverbed. Also note that some areas of the tile are left unlabeled.

Image preprocessing

We prepared Sentinel-2 imagery in a number of ways to accommodate both annotation and training workflows. An overview of the preprocessing workflow is shown in Fig. 3.

Fig. 3
figure 3

Training inputs workflow. Annotations created using Sentinel-2 Level 2 A Surface Reflectance imagery are paired with masked and normalized Sentinel-2 Level 1 C Top of Atmosphere imagery, and inputs are augmented to create training inputs used for modeling. Cloud and shadow masking involves a three-step process that combines the Sentinel-2 Cloud Probability (S2C) product with the Cloud Displacement Index (CDI), which is used to correct over-masking of bright non-cloud targets” and directional distance transform (DDT), which is used to remove the expected path of shadows based on sun-sensor geometry.

For training data collection, we used the Sentinel-2 Level-2A (L2A) product, which provides radiometrically calibrated surface reflectance (SR) processed using the Sen2Cor software package24. This advanced level of processing was advantageous for annotation, as it attempts to remove inter-scene variability due to solar distance, zenith angle, and atmospheric conditions. However, systematically produced Sentinel-2 SR products are currently only available from 2017 onwards. Therefore, for our modeling approach, we used the Level-1C (L1C) product, which has been generated since the beginning of the Sentinel-2 program in 2015. The L1C product represents Top-of-Atmosphere (TOA) reflectance measurements and is not subject to a change in processing algorithm in the future. We note that for any L2A image, there is a corresponding L1C image, allowing us to directly map annotations performed using L2A imagery to the L1C imagery used in model training. All bands except for B1, B8A, B9, and B10 were kept, with all bands bilinearly upsampled to 10 m for both training and inference.

In addition to our preliminary cloud filtering in training image selection, we adopted and applied a novel masking solution that combines several existing products and techniques. Our procedure is to first take the 10 m Sentinel-2 Cloud Probability (S2C) product available in Earth Engine25 and join it to our working set of Sentinel-2 scenes such that each image is paired with the corresponding mask. We compute a cloud mask by thresholding S2C using a cloud probability of 65% to identify pixels that are likely obscured by cloud cover. We then apply the Cloud Displacement Index (CDI) algorithm26 and threshold the result to produce a second cloud mask, which is intersected with the S2C mask to reduce errors of commission by removing bright non-cloud targets based on Sentinel-2 parallax effects. We finally intersect this sub-cirrus mask with a threshold on the Sentinel-2 cirrus band (B10) using the thresholding constants proposed for the CDI algorithm26, and take a morphological opening of this as our cloudy pixel mask. This mask is computed at 20 m resolution.

In order to remove cloud shadows, we extend the cloudy pixel mask 5 km in the direction opposite the solar azimuthal angle using the scene level metadata “SOLAR_AZIMUTH_ANGLE” and a directional distance transform (DDT) operation in Earth Engine. The final cloud and shadow mask is resampled to 100 m to decrease both the data volume and processing time. The resulting mask is applied to Sentinel-2 images used for training and inference such that unmasked pixels represent observations that are likely to be cloud- and shadow-free.

The distribution of Sentinel-2 reflectance values are highly compressed towards the low end of the sensor range, with the remainder mostly occupied by high return phenomena like snow and ice, bare ground, and specular reflection. To combat this imbalance, we introduce a normalization scheme that better utilizes the useful range of Sentinel-2 reflectance values for each band. We first log-transform the raw reflectance values to equalize the long tail of highly reflective surfaces, then remap percentiles of the log-transformed values to points on a sigmoid function. The latter is done to bound on (0, 1) without truncation, and condenses the extreme end members of reflectances to a smaller range.

To account for an annotation skill differential between the non-expert and expert groups, we one-hot encode the labeled pixels, and smooth them according to the confidence in a binary label of the individual annotator (expert/non-expert): this is effectively linearly interpolating the distributions per-pixel from their one-hot encoding (i.e. a vector of binary variables for each class label) to uniform probability. We used 0.2 for experts, and 0.3 for non-experts (i.e. ~82% confidence on the true class for experts and ~73% confidence on the true class for the non-expert. We note that these values approximately mirror the Non-Expert to Expert Consensus agreement as discussed in the Technical Validation section). This is akin to standard label-smoothing27,28, with the addition that the degree of smoothing is associated with annotation confidence.

We generate a pair of weights for each pixel in an augmented example designed to compensate for class imbalance across the training set and weight high-frequency spatial features at the inputs during “synthesis” (discussed further in the following section). We also include a weight per pixel designed to attenuate labels in the center of labeled polygons where human annotators often missed small details using a simple edge finding kernel.

We finally perform a series of augmentations (random rotation and random per-band contrasting) to our input data to improve generalizability and performance of our model. These augmentations are applied four times to each example to yield our final training dataset of examples paired with class distributions, masks, and weights (Fig. 3).

Model training

Our broad approach to transferring the supervised label data to a system that could be applied globally was to train a Fully Convolutional Neural Network (FCNN)29. Conceptually, this approach transforms pre-processed Sentinel-2 optical bands to a discrete probability distribution of the classes in our taxonomy on the basis of spatial context. This is done per-image with the assumption that sufficient spatial and spectral context is available to recover one of our taxonomic labels at a pixel. There are a few notable benefits to such an approach: namely that given the generalizability of modern deep neural networks, it is possible, as we will show, to produce a single model that achieves acceptable agreement with hand-digitized expert annotations globally. Furthermore, since model outputs are generated from a single image and a single model, it is straightforward to scale as each Sentinel-2 L1C image need only be observed once.

Although applying CNN modeling, including FCNN, to recover LULC is not a new idea30,31,32, we introduce a number of novel innovations that achieve state-of-the-art performance on LULC globally with a neural network architecture almost 100x smaller than architectures used for semantic segmentation or regression of ground-level camera imagery (specifically compared to U-Net33 and DeepLab v3+34 architectures). Our approach also leverages weak supervision by way of a synthesis pathway: this pathway includes a replica of the labeling model architecture that learns a mapping from estimated probabilities back to the input reflectances, in a way, a reverse LULC classifier that offers both multi-tasking and a constraint to overcome deficiencies in human labeling (Fig. 4).

Fig. 4
figure 4

Training protocol used to recover the labeling model. The bottom row shows the progression from a normalized Sentinel-2 L1C image, to class probabilities, to synthesized Sentinel-2. The dashed red and blue arrows show how the labeling model is optimized with respect to both the class probability and synthesis pathway, and the synthesis model is optimized only with respect to the synthesized imagery. The example image is retrieved from Earth Engine using ee.Image(‘GOOGLE/DYNAMICWORLD/V1/20190517T083601_20190517T083604_T37UET’).

Near real-time inference

Using Earth Engine in combination with Cloud AI Platform, it is possible to handle enormous quantities of satellite data and apply custom image processing and classification methods using a simple scaling paradigm (Fig. 5). To generate our NRT products, we apply the normalization described earlier to the raw Sentinel-2 L1C imagery and pass all normalized bands except B1, B8A, B9 and B10 after bilinear upscaling to ee.Model.predictImage. This output is then masked using our cloud mask derived from the unnormalized L1C image. Creation of these images is triggered automatically when new Sentinel-2 L1C and S2C images are available. The NRT collection is continuously updated with new results. For a full Sentinel-2 tile (roughly 100 km x 100 km), predictions are completed on the order of 45 minutes. In total, we evaluate ~12,000 Sentinel-2 scenes per day, processing half on average due to a filter criteria on the CLOUDY_PIXEL_PERCENTAGE metadata of 35%. A new Dynamic World LULC image is processed approximately every 14.4 s.

Fig. 5
figure 5

Near-Real-Time (NRT) prediction workflow. Input imagery is normalized following the same protocol used in training and the trained model is applied to generate land cover predictions. Predicted results are masked to remove cloud and cloud shadow artifacts using Sentinel-2 cloud probabilities (S2C), the Cloud Displacement Index (CDI) and a directional distance transform (DDT), then added to the Dynamic World image collection.

Leave a Comment