Difference between revisions of "TorchGeo embeddings"
(misc) |
|||
| Line 70: | Line 70: | ||
* '''Potsdam''' – [[https://www.isprs.org/resources/datasets/benchmarks/ ISPRS]] | * '''Potsdam''' – [[https://www.isprs.org/resources/datasets/benchmarks/ ISPRS]] | ||
Semantic segmentation dataset for urban areas from aerial imagery. | Semantic segmentation dataset for urban areas from aerial imagery. | ||
| − | + | * '''Inria Aerial Image Labeling''' – [[https://project.inria.fr/aerialimagelabeling/ Inria]] | |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | * '''Inria Aerial Image Labeling''' – [[https://project.inria.fr/aerialimagelabeling/ Inria | ||
Aerial image segmentation dataset for building footprint extraction. | Aerial image segmentation dataset for building footprint extraction. | ||
| − | * '''NAIP''' – [[https://www.usgs.gov/ | + | * '''NAIP''' – [[https://www.usgs.gov/centers/eros/science/usgs-eros-archive-aerial-photography-national-agriculture-imagery-program-naip USGS_NAIP]] |
National Agriculture Imagery Program data for the USA. | National Agriculture Imagery Program data for the USA. | ||
| − | * '''Sentinel-2''' – [[https://sentinels.copernicus.eu/web/sentinel/ | + | * '''Sentinel-2''' – [[https://sentinels.copernicus.eu/web/sentinel/home Sentinel]] |
Multispectral imagery from the Sentinel-2 mission. | Multispectral imagery from the Sentinel-2 mission. | ||
| − | * '''Landsat''' – [[https://www.usgs.gov/ | + | * '''Landsat''' – [[https://www.usgs.gov/landsat-missions Landsat_USGS]] |
Long-term archive of medium-resolution satellite imagery. | Long-term archive of medium-resolution satellite imagery. | ||
* '''OpenStreetMap''' – [[https://www.openstreetmap.org/ OpenStreetMap]] | * '''OpenStreetMap''' – [[https://www.openstreetmap.org/ OpenStreetMap]] | ||
Collaborative project to create a free editable map of the world. | Collaborative project to create a free editable map of the world. | ||
| − | * '''GFED''' (Global Fire Emissions Database) – [[https://www. | + | * '''GFED''' (Global Fire Emissions Database) – [[https://www.globalfiredata.org/ GFED]] |
Global dataset of biomass burning emissions. | Global dataset of biomass burning emissions. | ||
* '''GBIF''' – [[https://www.gbif.org/ GBIF]] | * '''GBIF''' – [[https://www.gbif.org/ GBIF]] | ||
Global biodiversity information facility dataset. | Global biodiversity information facility dataset. | ||
| − | * '''Open Buildings''' – [[https://github.com/microsoft/ | + | * '''Open Buildings''' – [[https://github.com/microsoft/globalmlbuildingfootprints MSFT_Bldgs]] |
Global building footprint detection dataset. | Global building footprint detection dataset. | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
* '''OpenAerialMap''' – [[https://www.openaerialmap.org/ OpenAerialMap]] | * '''OpenAerialMap''' – [[https://www.openaerialmap.org/ OpenAerialMap]] | ||
Open-source aerial imagery dataset. | Open-source aerial imagery dataset. | ||
* '''NLCD''' – [[https://www.mrlc.gov/data/legends/national-land-cover-database-nlcd-legend NLCD Legend]] | * '''NLCD''' – [[https://www.mrlc.gov/data/legends/national-land-cover-database-nlcd-legend NLCD Legend]] | ||
National Land Cover Database for the USA. | National Land Cover Database for the USA. | ||
| − | |||
| − | |||
* '''NASA Marine Debris''' – [[https://data.nasa.gov/Earth/nasa-marine-debris/nasa-marine-debris/dataset NASA Data]] | * '''NASA Marine Debris''' – [[https://data.nasa.gov/Earth/nasa-marine-debris/nasa-marine-debris/dataset NASA Data]] | ||
Marine debris detection dataset. | Marine debris detection dataset. | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
* '''Major-Tom''' – [[https://github.com/Clay-Lab/Major-Tom GitHub]] | * '''Major-Tom''' – [[https://github.com/Clay-Lab/Major-Tom GitHub]] | ||
Large-scale remote sensing image classification dataset. | Large-scale remote sensing image classification dataset. | ||
| − | * ''' | + | * '''Google Satellite Embedding''' – [[https://developers.google.com/earth-engine/datasets/catalog/GOOGLE_SATELLITE_EMBEDDING_V1_ANNUAL GitHub]] |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Pre-trained embeddings for Google satellite imagery. | Pre-trained embeddings for Google satellite imagery. | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
* '''GBIF''' – [[https://www.gbif.org/ GBIF]] | * '''GBIF''' – [[https://www.gbif.org/ GBIF]] | ||
Global biodiversity information facility dataset. | Global biodiversity information facility dataset. | ||
| − | |||
| − | |||
| − | |||
| − | |||
* '''EuroSAT''' – [[https://zenodo.org/records/7711810 Zenodo]] | * '''EuroSAT''' – [[https://zenodo.org/records/7711810 Zenodo]] | ||
Land use classification dataset using Sentinel-2 satellite data. | Land use classification dataset using Sentinel-2 satellite data. | ||
* '''EuroCrops''' – [[https://pmc.ncbi.nlm.nih.gov/articles/PMC10495462/ PMC_10495462]] | * '''EuroCrops''' – [[https://pmc.ncbi.nlm.nih.gov/articles/PMC10495462/ PMC_10495462]] | ||
Crop type mapping dataset for Europe. | Crop type mapping dataset for Europe. | ||
| − | * ''' | + | * '''Dota''' – [[https://www.kaggle.com/datasets/chandlertimm/dota-data DOTA]] |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Large-scale dataset for object detection in aerial images. | Large-scale dataset for object detection in aerial images. | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
* '''Cropland Data Layer''' – [[https://www.nass.usda.gov/Research_and_Science/Cropland/SARS/index.php USDA NASS]] | * '''Cropland Data Layer''' – [[https://www.nass.usda.gov/Research_and_Science/Cropland/SARS/index.php USDA NASS]] | ||
Crop-specific land cover dataset for the USA. | Crop-specific land cover dataset for the USA. | ||
Revision as of 18:59, 14 June 2026
arXiv:2601.13134v1 [cs.SE] 19 Jan 2026
Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access is a comprehensive survey that organizes existing geospatial embedding products into a structured taxonomy through a three-layer taxonomy: Data, Tools, and Value. This research paper provides a detailed metadata atlas (resolution, license, etc.). It also proposes a unified integration by implementing standardized data loaders for these embeddings in [TorchGeo] .
An overview landscape is proposed comprising: a) Analysis Frameworks & Tools b) Embeddings data artifacts c) Charting downstream application value, specifically mapping tasks and retrieval tasks.
Embeddings are differentiated as either location-typed, patch-typed, or pixel-typed. Details of existing products are shown. "We extend TorchGeo with a unified API that standardizes the loading and querying of diverse embedding products."
1. Foundation Models for Earth Observation (EO)
These are the leading projects that aim to build general-purpose models capable of representing Earth from satellite imagery and other geospatial modalities.
Projects
- Clay Foundation Model – [HuggingFace] (2024)
- A multimodal foundation model for Earth using diverse data sources.
- Major TOM – [MajorTOM] AFrancis IGARSS 2024
- Expandable datasets and models for global EO coverage.
- Earth Index Embeddings – [EarthGenome] (2025)
- A large-scale embedding system built from Earth observation data.
- Copernicus-Embed – [LINK] Zhu et al., AI4Copernicus Project
- Foundation model leveraging Copernicus Sentinel data.
- Presto Embeddings – [NASAHarvest]
- Embedding framework for satellite time series and land use analysis.
- Tessera Embeddings – [GeoTessera] Docs / [REPO]
- Pixel-based Temporal spectral embeddings for Earth representation.
- Google Satellite Embedding (AlphaEarth) – [LINK] Google Earth Engine
- An early-stage embedding model using Google's global satellite data.
- OlmoEarth – [AllenAI] (2025)
- Latent image modeling approach for multimodal Earth observation.
Key Papers
- XXZhu 2025 [LINK] "On the Foundations of Earth Foundation Models" – Nature Computational Science
- CFBrown 2025 [LINK] "AlphaEarth Foundations"
- KKlemmer 2023 [LINK] "SatCLIP: Global Location Embeddings with Satellite Imagery"
2. Datasets
Large-scale, open-access datasets play a central role in training and evaluating Earth foundation models.
Datasets
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [NLCD_Legend]
USA land cover classes.
- SSL4EO-S12 – [GitHub]
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- BigEarthNet – [Site]
Large-scale multi-label satellite image classification dataset.
- Resisc45 – [DOI]
Remote sensing image classification dataset with 45 categories.
- UC Merced – [UCMerced_Datasets]
Aerial image dataset for land use classification.
- Potsdam – [ISPRS]
Semantic segmentation dataset for urban areas from aerial imagery.
- Inria Aerial Image Labeling – [Inria]
Aerial image segmentation dataset for building footprint extraction.
- NAIP – [USGS_NAIP]
National Agriculture Imagery Program data for the USA.
- Sentinel-2 – [Sentinel]
Multispectral imagery from the Sentinel-2 mission.
- Landsat – [Landsat_USGS]
Long-term archive of medium-resolution satellite imagery.
- OpenStreetMap – [OpenStreetMap]
Collaborative project to create a free editable map of the world.
- GFED (Global Fire Emissions Database) – [GFED]
Global dataset of biomass burning emissions.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- Open Buildings – [MSFT_Bldgs]
Global building footprint detection dataset.
- OpenAerialMap – [OpenAerialMap]
Open-source aerial imagery dataset.
- NLCD – [NLCD Legend]
National Land Cover Database for the USA.
- NASA Marine Debris – [NASA Data]
Marine debris detection dataset.
- Major-Tom – [GitHub]
Large-scale remote sensing image classification dataset.
- Google Satellite Embedding – [GitHub]
Pre-trained embeddings for Google satellite imagery.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- Dota – [DOTA]
Large-scale dataset for object detection in aerial images.
- Cropland Data Layer – [USDA NASS]
Crop-specific land cover dataset for the USA.
- Cropharvest – [GitHub]
Crop type mapping dataset for Europe using Sentinel-1 and Sentinel-2.
- Cowc – [GitHub]
Counting objects in aerial images dataset.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- Copernicus-Embed – [GitHub]
Pre-trained embeddings for Copernicus data.
- Copernicus-Bench – [GitHub]
Benchmark dataset for Copernicus data.
- Cloud-Cover-Detection – [GitHub]
Cloud cover detection dataset.
- Clay-Embeddings – [GitHub]
Pre-trained embeddings for Clay dataset.
- Chesapeake – [GitHub]
Land cover classification dataset for the Chesapeake Bay region.
- Chabud – [GitHub]
Building footprint extraction dataset.
- Caffe – [Caffe Website]
Deep learning framework for remote sensing.
- Cabuar – [GitHub]
Agricultural field boundary detection dataset.
- Bright – [GitHub]
Bright object detection dataset.
- Biomassters – [GitHub]
Biomass estimation dataset.
- Benin Cashew Plantations – [GitHub]
Cashew plantation mapping dataset for Benin.
- Benchmark.csv – [Benchmark GitHub]
Benchmark dataset for remote sensing.
- Advance – [GitHub]
Advanced remote sensing dataset.
- Aboveground-Woody-Biomass – [GitHub]
Aboveground woody biomass estimation dataset.
- --
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [LINK] Photogrammetric Engineering & Remote Sensing (2001)
USA land cover classes.
- SSL4EO-S12 – [LINK] IEEE Geoscience and Remote Sensing (2023)
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain [LINK] IEEE Geoscience and Remote Sensing (2023)
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
3. Models & Methods
These include both classical and cutting-edge machine learning approaches used in building Earth foundation models.
Core Methods
- SatCLIP – [LINK] AAAI 2025 etc.
Vision-language model for global location representations.
- MMEarth – [LINK] EU/CV 2024
Multimodal pretext tasks for geospatial representation learning.
- ResNet – [LINK] |KHe IEEE/CV 2016
Baseline CNN architecture widely used in EO.
- ConvNeXt V2 – [LINK] Woo et al., IEEE/CVF 2023
Efficient ConvNet architecture using masked autoencoders (MAE).
- DINO, DINOv2, DINOv3 – [LINK] INRIA 2021–2023, META
Vision transformers with self-supervised learning capabilities.
- MAE (Masked Autoencoders) – [LINK] IEEE/CVF 2021
Self-supervised learning for vision transformers.
Distillation & Advanced Approaches
- Distillation methods – Transfer knowledge from large models.
- Neural plasticity-inspired models – TorchGeo_DOFA: Inspired by biological learning mechanisms.
- Multi-label guided soft contrastive learning – YWang, IEEE TGRS, 2024.
- Barlow Twins – Method for learning representations without contrastive loss.
- Continual Barlow Twins – Extends Barlow Twins to continual learning in EO segmentation.
4. Tools & Benchmarks
These are software systems and frameworks that support development, evaluation, or deployment of EO AI models.
Tools
PyTorch library for geospatial deep learning.
- NeuCo-Bench – [LINK] RVinge, arXiv 2025
Benchmarking framework for neural embeddings in Earth observation.
- GeoINRID – [LINK] GitHub: arjunarao619/GeoINRID
Geospatial inference and representation learning toolkit.
Challenges
- Embed2Scale Challenge – [LINK] CVPR CAlbrecht 2025
Large-scale Earth vision challenge focused on scale-aware embeddings.
- TerraMind Blue-Sky Challenge –
Generative modeling for Earth observation.
5. Key Themes & Trends
- Foundation Models: TorchGeo now includes data loaders designed for search/retrieval (Clay, Major TOM, Earth Index), and for dense prediction tasks like land cover mapping (Copernicus, Presto, Tessera, Google). TorchGeo allows us to enable fair, side-by-side benchmarking of different embedding models on the same downstream tasks, forming the basis for future experiments. Projects are encouraged to strengthen and improve explainability.
- Major TOM Notes: Major TOM embeddings are not (yet) really product-oriented and are aimed with a similar purpose to the MT Core datasets - to make it easier to experiment and benchmark model outputs (hence, unlike TESSERA and AEF which came a few months after, MT embeddings do not have consistent or aggregated temporal scope). We haven't had enough time to finish off the preprint, but my current plan is to provide a simple MT Embedding benchmark at this year's EGU and integrate that into the arxiv pre-print. --Miko
- Earth Index / Earth Genome: Use the Earth Index application (earthindex.ai) for non-technical users to use the embeddings we published on source.coop. Users of the web app (non-technical journalists, indigenous communities/allies, NGOs) have been our main focus. Users of the source.coop embeddings have generally been more technical folks interested in exploring/innovating in what's possible --BenStrong
- Clay: Clay and Presto offer documented tutorials on generating new embeddings with their models. In CLAY, the encoder receives unmasked patches, latitude-longitude data, and timestep information. Notably, the last 2 embeddings from the encoder specifically represent the latitude-longitude and timestep embeddings.
- Self-Supervised Learning (SSL):
- Multimodal Integration:
- Open Data & Tools: Open-source projects (e.g., TorchGeo, Copernicus-Embed) and public datasets (EuroSAT, EuroCrops) are crucial for reproducibility and democratization of EO AI. Projects are encouraged to increase Input Data Diversity, and to adopt cloud-native data formats for geospatial data.
- Benchmarking: Projects are encouraged to standardize in benchmarking. Benchmarks including NeuCo-Bench and Embed2Scale.
Research Directions
- Unified Earth Foundation Models:
- Interpretability in EO AI: Exploring how these embeddings can be interpreted by domain experts.
- Ethics and Bias: Investigating fairness and bias in global EO models trained on unevenly distributed data.
- Edge Deployment: Making these large foundation models deployable on resource-constrained platforms (e.g., for field use).