Deep learning techniques are increasingly used to automatically derive geological maps from digital outcrop models, lessening interpretation time and (ideally) reducing bias. Such techniques are especially needed when hyperspectral images are back-projected to create data-rich ‘hypercloud’ type digital outcrop models.
However, accurate validation of these automated mapping approaches is a significant challenge, due to the subjective nature of geological mapping and difficulty collecting quantitative validation data. This makes validation of different machine learning approaches for geological applications exceedingly difficult. Furthermore, many state-of-the-art deep learning methods are limited to 2-D image data, making application to 3-D digital outcrops (e.g., hyperclouds) an outstanding challenge.
The Tinto dataset aims to help solve these validation issues and so foster further development of deep learning techniques in the geosciences. It comprises two representations (tinto2D and tinto3D) of three benchmark datasets: a real one, a noise-free and realistic (degraded) synthetic twin.
The real dataset (tinto.real) comprises a compilation of visible, near, short-wave and long-wave infrared hyperspectral data acquired using ground and airborne sensors at Rio Tinto, Spain. These have been corrected for atmospheric and topographic effects, and projected onto a photogrammetric point cloud to derive a set of hyperclouds (tinto3D) and corresponding 2D views (tinto2D). Ground-truth labels for every point in the hypercloud (and by projection every pixel in tinto2D) were then derived by a combination of laboratory XRD analyses, hyperspectral interpretation, and digital outcrop mapping.
Issues associated with potential biases or inconsistencies in the ground-truth labels associated with the real dataset have been addressed by generating an entirely synthetic suite of spectral data by forward modelling (tinto.synth and tinto.degr). These share the same labels as the real dataset, as well as several latent variables and spatial relationships, but are derived using a spectral mixing model and a spatial distribution of mineral abundances simulated using spectral proxies. We suggest that these synthetic spectra are suited for comparing learning approaches, as the ground truth is known with certainty, while the real spectra can be used to evaluate performance on realistic data. tinto.synth contains perfect (noise-free) spectra, largely for testing purposes, while tinto.degr contains these spectra with added sensor-noise, illumination and topographic effects to better simulate real data. Synthetic mineral abundances and pure end-member spectra have also been included in tinto.synth for testing e.g., endmember extraction and unmixing methods.
Finally, we have also included two versions of the ground-truth labels for each dataset: one simplified (tinto.labels_basic), and one complete (tinto.labels_complete). While we encourage people to use the complete label set, geologically similar classes have been lumpted together in the basic dataset to derive a benchmark with fewer classes for testing e.g., unsupervised methods.
Each of the real and synthetic benchmark datasets described above contain point clouds, stored as lists of vertices in the common .ply format. Each vertex contains additional scalar attributes (properties in the .ply terminology) that contain e.g., class labels or hyperspectral bands.
These point clouds can be opened for visualisation using the open-source CloudCompare software, where the scalar attributes will be loaded as Scalar Fields. Additionally, online previews of the various datasets can be viewed in the PoTree viewer here.
While the main aim of this benchmark is to provide a much-needed 3-D benchmark dataset (tinto3D) for point-cloud classification methods, we have also included several 2-D representations as we acknowledge that many state-of-the-art methods can not (yet) handle unstructured 3-D point cloud data. Tinto2D thus contains three different projections of each hypercloud and associated ground-truth labels: (1) a top-down orthomosaic, (2) an oblique perspective view, and (3) an oblique panoramic view. These are all stored in the widely used ENVI format.
To help get started with the tinto benchmark, we have created a basic python tutorial that opens the dataset using hylite and trains a simple classification model using sklearn.
This tutorial can be found in GoogleColab here, and requires no installation or data download (everything runs remotely on Google's servers).
The tinto datset can be downloaded in entirety or in part from here.