Difference between revisions of "TorchGeo embeddings"
(new rev) |
|||
| Line 108: | Line 108: | ||
# '''Benchmarking''': Projects are encouraged to standardize in benchmarking. Benchmarks including NeuCo-Bench and Embed2Scale. | # '''Benchmarking''': Projects are encouraged to standardize in benchmarking. Benchmarks including NeuCo-Bench and Embed2Scale. | ||
| + | |||
| + | |||
| + | = Deatasets Auto-Edit == | ||
| + | == 2. Datasets == | ||
| + | |||
| + | Large-scale, open-access datasets play a central role in training and evaluating Earth foundation models. | ||
| + | |||
| + | === Datasets === | ||
| + | * '''EuroSAT''' – [[https://zenodo.org/records/7711810 Zenodo]] | ||
| + | Land use classification dataset using Sentinel-2 satellite data. | ||
| + | * '''EuroCrops''' – [[https://pmc.ncbi.nlm.nih.gov/articles/PMC10495462/ PMC_10495462]] | ||
| + | Crop type mapping dataset for Europe. | ||
| + | * '''National Land Cover Database (NLCD)''' – [[https://www.mrlc.gov/data/legends/national-land-cover-database-class-legend-and-description MRLC]] | ||
| + | USA land cover classes. | ||
| + | * '''SSL4EO-S12''' – [[https://github.com/zhu-xlab/SSL4EO-S12 GitHub]] | ||
| + | Multimodal, multitemporal dataset for self-supervised learning. | ||
| + | * '''Copernicus-Pretrain''' – [[https://github.com/zhu-xlab/Copernicus-FM GitHub]] | ||
| + | An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P). | ||
| + | * '''BigEarthNet''' – [[https://bigearth.net/ BigEarthNet]] | ||
| + | Large-scale multi-label satellite image classification dataset. | ||
| + | * '''Resisc45''' – [[https://doi.org/10.1109/jproc.2017.2675998 IEEE DOI]] | ||
| + | Remote sensing image classification dataset with 45 categories. | ||
| + | * '''UC Merced''' – [[https://vision.ucmerced.edu/datasets/uc-merced.html UC Merced]] | ||
| + | Aerial image dataset for land use classification. | ||
| + | * '''Potsdam''' – [[https://www.isprs.org/resources/datasets/benchmarks/ ISPRS]] | ||
| + | Semantic segmentation dataset for urban areas from aerial imagery. | ||
| + | * '''Vaihingen''' – [[https://www.isprs.org/resources/datasets/benchmarks/ ISPRS]] | ||
| + | Semantic segmentation dataset for urban areas from aerial imagery. | ||
| + | * '''Inria Aerial Image Labeling''' – [[https://project.inria.fr/aerialimagelabeling/ Inria]] | ||
| + | Aerial image segmentation dataset for building footprint extraction. | ||
| + | * '''NAIP''' – [[https://www.usgs.gov/centers/eros/science/usgs-eros-archive-aerial-photography-national-agriculture-imagery-program-naip USGS EROS]] | ||
| + | National Agriculture Imagery Program data for the USA. | ||
| + | * '''Sentinel-2''' – [[https://sentinels.copernicus.eu/web/sentinel/home Copernicus]] | ||
| + | Multispectral imagery from the Sentinel-2 mission. | ||
| + | * '''Landsat''' – [[https://www.usgs.gov/landsat-missions USGS Landsat]] | ||
| + | Long-term archive of medium-resolution satellite imagery. | ||
| + | * '''OpenStreetMap''' – [[https://www.openstreetmap.org/ OpenStreetMap]] | ||
| + | Collaborative project to create a free editable map of the world. | ||
| + | * '''GFED''' (Global Fire Emissions Database) – [[https://www.globalfiredata.org/ Global Fire Data]] | ||
| + | Global dataset of biomass burning emissions. | ||
| + | * '''GBIF''' – [[https://www.gbif.org/ GBIF]] | ||
| + | Global biodiversity information facility dataset. | ||
| + | * '''Open Buildings''' – [[https://github.com/microsoft/globalmlbuildingfootprints Microsoft Research]] | ||
| + | Global building footprint detection dataset. | ||
| + | * '''OpenAerialMap''' – [[https://www.openaerialmap.org/ OpenAerialMap]] | ||
| + | Open-source aerial imagery dataset. | ||
| + | * '''NASA Marine Debris''' – [[https://data.nasa.gov/Earth/nasa-marine-debris/nasa-marine-debris/dataset NASA Data]] | ||
| + | Marine debris detection dataset. | ||
| + | * '''Major-Tom''' – [[https://github.com/Clay-Lab/Major-Tom GitHub]] | ||
| + | Large-scale remote sensing image classification dataset. | ||
| + | * '''Google Satellite Embedding''' – [[https://developers.google.com/earth-engine/datasets/catalog/GOOGLE_SATELLITE_EMBEDDING_V1_ANNUAL Google Earth Engine]] | ||
| + | Pre-trained embeddings for Google satellite imagery. | ||
| + | * '''Dota''' – [[https://captain-whu.github.io/DOTA/ DOTA Website]] | ||
| + | Large-scale dataset for object detection in aerial images. | ||
| + | * '''Cropland Data Layer''' – [[https://www.nass.usda.gov/Research_and_Science/Cropland/SARS/index.php USDA NASS]] | ||
| + | Crop-specific land cover dataset for the USA. | ||
| + | * '''Cropharvest''' – [[https://github.com/Clay-Lab/CROPHarvest GitHub]] | ||
| + | Crop type mapping dataset for Europe using Sentinel-1 and Sentinel-2. | ||
| + | * '''Cowc''' – [[https://github.com/microsoft/COWC Microsoft Research]] | ||
| + | Counting objects in aerial images dataset. | ||
| + | * '''Copernicus-Embed''' – [[https://github.com/Clay-Lab/Copernicus-Embed GitHub]] | ||
| + | Pre-trained embeddings for Copernicus data. | ||
| + | * '''Copernicus-Bench''' – [[https://github.com/Clay-Lab/Copernicus-Bench GitHub]] | ||
| + | Benchmark dataset for Copernicus data. | ||
| + | * '''Cloud-Cover-Detection''' – [[https://github.com/Clay-Lab/Cloud-Cover-Detection GitHub]] | ||
| + | Cloud cover detection dataset. | ||
| + | * '''Clay-Embeddings''' – [[https://github.com/Clay-Lab/Clay-Embeddings GitHub]] | ||
| + | Pre-trained embeddings for Clay dataset. | ||
| + | * '''Chesapeake''' – [[https://github.com/Clay-Lab/Chesapeake GitHub]] | ||
| + | Land cover classification dataset for the Chesapeake Bay region. | ||
| + | * '''Chabud''' – [[https://github.com/Clay-Lab/Chabud GitHub]] | ||
| + | Building footprint extraction dataset. | ||
| + | * '''Cabuar''' – [[https://github.com/Clay-Lab/Cabuar GitHub]] | ||
| + | Agricultural field boundary detection dataset. | ||
| + | * '''Bright''' – [[https://github.com/Clay-Lab/Bright GitHub]] | ||
| + | Bright object detection dataset. | ||
| + | * '''Biomassters''' – [[https://github.com/Clay-Lab/Biomassters GitHub]] | ||
| + | Biomass estimation dataset. | ||
| + | * '''Benin Cashew Plantations''' – [[https://github.com/Clay-Lab/Benin-Cashew-Plantations GitHub]] | ||
| + | Cashew plantation mapping dataset for Benin. | ||
| + | * '''Aboveground-Woody-Biomass''' – [[https://github.com/Clay-Lab/Aboveground-Woody-Biomass GitHub]] | ||
| + | Aboveground woody biomass estimation dataset. | ||
| + | |||
== Datasets Detail == | == Datasets Detail == | ||
Revision as of 19:52, 14 June 2026
arXiv:2601.13134v1 [cs.SE] 19 Jan 2026
Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access is a comprehensive survey that organizes existing geospatial embedding products into a structured taxonomy through a three-layer taxonomy: Data, Tools, and Value. This research paper provides a detailed metadata atlas (resolution, license, etc.). It also proposes a unified integration by implementing standardized data loaders for these embeddings in [TorchGeo] .
An overview landscape is proposed comprising: a) Analysis Frameworks & Tools b) Embeddings data artifacts c) Charting downstream application value, specifically mapping tasks and retrieval tasks.
Embeddings are differentiated as either location-typed, patch-typed, or pixel-typed. Details of existing products are shown. "We extend TorchGeo with a unified API that standardizes the loading and querying of diverse embedding products."
1. Foundation Models for Earth Observation (EO)
These are the leading projects that aim to build general-purpose models capable of representing Earth from satellite imagery and other geospatial modalities.
Projects
- Clay Foundation Model – [HuggingFace] (2024)
- A multimodal foundation model for Earth using diverse data sources.
- Major TOM – [MajorTOM] AFrancis IGARSS 2024
- Expandable datasets and models for global EO coverage.
- Earth Index Embeddings – [EarthGenome] (2025)
- A large-scale embedding system built from Earth observation data.
- Copernicus-Embed – [LINK] Zhu et al., AI4Copernicus Project
- Foundation model leveraging Copernicus Sentinel data.
- Presto Embeddings – [NASAHarvest]
- Embedding framework for satellite time series and land use analysis.
- Tessera Embeddings – [GeoTessera] Docs / [REPO]
- Pixel-based Temporal spectral embeddings for Earth representation.
- Google Satellite Embedding (AlphaEarth) – [LINK] Google Earth Engine
- An early-stage embedding model using Google's global satellite data.
- OlmoEarth – [AllenAI] (2025)
- Latent image modeling approach for multimodal Earth observation.
Key Papers
- XXZhu 2025 [LINK] "On the Foundations of Earth Foundation Models" – Nature Computational Science
- CFBrown 2025 [LINK] "AlphaEarth Foundations"
- KKlemmer 2023 [LINK] "SatCLIP: Global Location Embeddings with Satellite Imagery"
2. Datasets
Large-scale, open-access datasets play a central role in training and evaluating Earth foundation models.
3. Models & Methods
These include both classical and cutting-edge machine learning approaches used in building Earth foundation models.
Core Methods
- SatCLIP – [LINK] AAAI 2025 etc.
Vision-language model for global location representations.
- MMEarth – [LINK] EU/CV 2024
Multimodal pretext tasks for geospatial representation learning.
- ResNet – [LINK] |KHe IEEE/CV 2016
Baseline CNN architecture widely used in EO.
- ConvNeXt V2 – [LINK] Woo et al., IEEE/CVF 2023
Efficient ConvNet architecture using masked autoencoders (MAE).
- DINO, DINOv2, DINOv3 – [LINK] INRIA 2021–2023, META
Vision transformers with self-supervised learning capabilities.
- MAE (Masked Autoencoders) – [LINK] IEEE/CVF 2021
Self-supervised learning for vision transformers.
Distillation & Advanced Approaches
- Distillation methods – Transfer knowledge from large models.
- Neural plasticity-inspired models – TorchGeo_DOFA: Inspired by biological learning mechanisms.
- Multi-label guided soft contrastive learning – YWang, IEEE TGRS, 2024.
- Barlow Twins – Method for learning representations without contrastive loss.
- Continual Barlow Twins – Extends Barlow Twins to continual learning in EO segmentation.
4. Tools & Benchmarks
These are software systems and frameworks that support development, evaluation, or deployment of EO AI models.
Tools
PyTorch library for geospatial deep learning.
- NeuCo-Bench – [LINK] RVinge, arXiv 2025
Benchmarking framework for neural embeddings in Earth observation.
- GeoINRID – [LINK] GitHub: arjunarao619/GeoINRID
Geospatial inference and representation learning toolkit.
Challenges
- Embed2Scale Challenge – [LINK] CVPR CAlbrecht 2025
Large-scale Earth vision challenge focused on scale-aware embeddings.
- TerraMind Blue-Sky Challenge –
Generative modeling for Earth observation.
5. Key Themes & Trends
- Foundation Models: TorchGeo now includes data loaders designed for search/retrieval (Clay, Major TOM, Earth Index), and for dense prediction tasks like land cover mapping (Copernicus, Presto, Tessera, Google). TorchGeo allows us to enable fair, side-by-side benchmarking of different embedding models on the same downstream tasks, forming the basis for future experiments. Projects are encouraged to strengthen and improve explainability.
- Major TOM Notes: Major TOM embeddings are not (yet) really product-oriented and are aimed with a similar purpose to the MT Core datasets - to make it easier to experiment and benchmark model outputs (hence, unlike TESSERA and AEF which came a few months after, MT embeddings do not have consistent or aggregated temporal scope). We haven't had enough time to finish off the preprint, but my current plan is to provide a simple MT Embedding benchmark at this year's EGU and integrate that into the arxiv pre-print. --Miko
- Earth Index / Earth Genome: Use the Earth Index application (earthindex.ai) for non-technical users to use the embeddings we published on source.coop. Users of the web app (non-technical journalists, indigenous communities/allies, NGOs) have been our main focus. Users of the source.coop embeddings have generally been more technical folks interested in exploring/innovating in what's possible --BenStrong
- Clay: Clay and Presto offer documented tutorials on generating new embeddings with their models. In CLAY, the encoder receives unmasked patches, latitude-longitude data, and timestep information. Notably, the last 2 embeddings from the encoder specifically represent the latitude-longitude and timestep embeddings.
- Self-Supervised Learning (SSL):
- Multimodal Integration:
- Open Data & Tools: Open-source projects (e.g., TorchGeo, Copernicus-Embed) and public datasets (EuroSAT, EuroCrops) are crucial for reproducibility and democratization of EO AI. Projects are encouraged to increase Input Data Diversity, and to adopt cloud-native data formats for geospatial data.
- Benchmarking: Projects are encouraged to standardize in benchmarking. Benchmarks including NeuCo-Bench and Embed2Scale.
Deatasets Auto-Edit =
2. Datasets
Large-scale, open-access datasets play a central role in training and evaluating Earth foundation models.
Datasets
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [MRLC]
USA land cover classes.
- SSL4EO-S12 – [GitHub]
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- BigEarthNet – [BigEarthNet]
Large-scale multi-label satellite image classification dataset.
- Resisc45 – [IEEE DOI]
Remote sensing image classification dataset with 45 categories.
- UC Merced – [UC Merced]
Aerial image dataset for land use classification.
- Potsdam – [ISPRS]
Semantic segmentation dataset for urban areas from aerial imagery.
- Vaihingen – [ISPRS]
Semantic segmentation dataset for urban areas from aerial imagery.
- Inria Aerial Image Labeling – [Inria]
Aerial image segmentation dataset for building footprint extraction.
- NAIP – [USGS EROS]
National Agriculture Imagery Program data for the USA.
- Sentinel-2 – [Copernicus]
Multispectral imagery from the Sentinel-2 mission.
- Landsat – [USGS Landsat]
Long-term archive of medium-resolution satellite imagery.
- OpenStreetMap – [OpenStreetMap]
Collaborative project to create a free editable map of the world.
- GFED (Global Fire Emissions Database) – [Global Fire Data]
Global dataset of biomass burning emissions.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- Open Buildings – [Microsoft Research]
Global building footprint detection dataset.
- OpenAerialMap – [OpenAerialMap]
Open-source aerial imagery dataset.
- NASA Marine Debris – [NASA Data]
Marine debris detection dataset.
- Major-Tom – [GitHub]
Large-scale remote sensing image classification dataset.
- Google Satellite Embedding – [Google Earth Engine]
Pre-trained embeddings for Google satellite imagery.
- Dota – [DOTA Website]
Large-scale dataset for object detection in aerial images.
- Cropland Data Layer – [USDA NASS]
Crop-specific land cover dataset for the USA.
- Cropharvest – [GitHub]
Crop type mapping dataset for Europe using Sentinel-1 and Sentinel-2.
- Cowc – [Microsoft Research]
Counting objects in aerial images dataset.
- Copernicus-Embed – [GitHub]
Pre-trained embeddings for Copernicus data.
- Copernicus-Bench – [GitHub]
Benchmark dataset for Copernicus data.
- Cloud-Cover-Detection – [GitHub]
Cloud cover detection dataset.
- Clay-Embeddings – [GitHub]
Pre-trained embeddings for Clay dataset.
- Chesapeake – [GitHub]
Land cover classification dataset for the Chesapeake Bay region.
- Chabud – [GitHub]
Building footprint extraction dataset.
- Cabuar – [GitHub]
Agricultural field boundary detection dataset.
- Bright – [GitHub]
Bright object detection dataset.
- Biomassters – [GitHub]
Biomass estimation dataset.
- Benin Cashew Plantations – [GitHub]
Cashew plantation mapping dataset for Benin.
- Aboveground-Woody-Biomass – [GitHub]
Aboveground woody biomass estimation dataset.
Datasets Detail
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [NLCD_Legend]
USA land cover classes.
- SSL4EO-S12 – [GitHub]
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- BigEarthNet – [Site]
Large-scale multi-label satellite image classification dataset.
- Resisc45 – [DOI]
Remote sensing image classification dataset with 45 categories.
- UC Merced – [UCMerced_Datasets]
Aerial image dataset for land use classification.
- Potsdam – [ISPRS]
Semantic segmentation dataset for urban areas from aerial imagery.
- Inria Aerial Image Labeling – [Inria]
Aerial image segmentation dataset for building footprint extraction.
- NAIP – [USGS_NAIP]
National Agriculture Imagery Program data for the USA.
- Sentinel-2 – [Sentinel]
Multispectral imagery from the Sentinel-2 mission.
- Landsat – [Landsat_USGS]
Long-term archive of medium-resolution satellite imagery.
- OpenStreetMap – [OpenStreetMap]
Collaborative project to create a free editable map of the world.
- GFED (Global Fire Emissions Database) – [GFED]
Global dataset of biomass burning emissions.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- Open Buildings – [MSFT_Bldgs]
Global building footprint detection dataset.
- OpenAerialMap – [OpenAerialMap]
Open-source aerial imagery dataset.
- NLCD – [NLCD Legend]
National Land Cover Database for the USA.
- NASA Marine Debris – [NASA Data]
Marine debris detection dataset.
- Major-Tom – [GitHub]
Large-scale remote sensing image classification dataset.
- Google Satellite Embedding – [GitHub]
Pre-trained embeddings for Google satellite imagery.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- Dota – [DOTA]
Large-scale dataset for object detection in aerial images.
- Cropland Data Layer – [USDA NASS]
Crop-specific land cover dataset for the USA.
- Cropharvest – [GitHub]
Crop type mapping dataset for Europe using Sentinel-1 and Sentinel-2.
- Cowc – [GitHub]
Counting objects in aerial images dataset.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- Copernicus-Embed – [GitHub]
Pre-trained embeddings for Copernicus data.
- Copernicus-Bench – [GitHub]
Benchmark dataset for Copernicus data.
- Cloud-Cover-Detection – [GitHub]
Cloud cover detection dataset.
- Clay-Embeddings – [GitHub]
Pre-trained embeddings for Clay dataset.
- Chesapeake – [GitHub]
Land cover classification dataset for the Chesapeake Bay region.
- Chabud – [GitHub]
Building footprint extraction dataset.
- Caffe – [Caffe Website]
Deep learning framework for remote sensing.
- Cabuar – [GitHub]
Agricultural field boundary detection dataset.
- Bright – [GitHub]
Bright object detection dataset.
- Biomassters – [GitHub]
Biomass estimation dataset.
- Benin Cashew Plantations – [GitHub]
Cashew plantation mapping dataset for Benin.
- Benchmark.csv – [Benchmark GitHub]
Benchmark dataset for remote sensing.
- Advance – [GitHub]
Advanced remote sensing dataset.
- Aboveground-Woody-Biomass – [GitHub]
Aboveground woody biomass estimation dataset.
- ---------------------------------------------------------------
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [LINK] Photogrammetric Engineering & Remote Sensing (2001)
USA land cover classes.
- SSL4EO-S12 – [LINK] IEEE Geoscience and Remote Sensing (2023)
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain [LINK] IEEE Geoscience and Remote Sensing (2023)
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
Research Directions
- Unified Earth Foundation Models:
- Interpretability in EO AI: Exploring how these embeddings can be interpreted by domain experts.
- Ethics and Bias: Investigating fairness and bias in global EO models trained on unevenly distributed data.
- Edge Deployment: Making these large foundation models deployable on resource-constrained platforms (e.g., for field use).