TorchGeo embeddings
arXiv:2601.13134v1 [cs.SE] 19 Jan 2026
Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access is a comprehensive survey that organizes existing geospatial embedding products into a structured taxonomy through a three-layer taxonomy: Data, Tools, and Value. This research paper provides a detailed metadata atlas (resolution, license, etc.). It also proposes a unified integration by implementing standardized data loaders for these embeddings in [TorchGeo] .
An overview landscape is proposed comprising: a) Analysis Frameworks & Tools b) Embeddings data artifacts c) Charting downstream application value, specifically mapping tasks and retrieval tasks.
Embeddings are differentiated as either location-typed, patch-typed, or pixel-typed. Details of existing products are shown. "We extend TorchGeo with a unified API that standardizes the loading and querying of diverse embedding products."
1. Foundation Models for Earth Observation (EO)
These are the leading projects that aim to build general-purpose models capable of representing Earth from satellite imagery and other geospatial modalities.
Projects
- Clay Foundation Model – [HuggingFace] (2024)
- A multimodal foundation model for Earth using diverse data sources.
- Major TOM – [MajorTOM] AFrancis IGARSS 2024
- Expandable datasets and models for global EO coverage.
- Earth Index Embeddings – [EarthGenome] (2025)
- A large-scale embedding system built from Earth observation data.
- Copernicus-Embed – [LINK] Zhu et al., AI4Copernicus Project
- Foundation model leveraging Copernicus Sentinel data.
- Presto Embeddings – [NASAHarvest]
- Embedding framework for satellite time series and land use analysis.
- Tessera Embeddings – [GeoTessera] Docs / [REPO]
- Pixel-based Temporal spectral embeddings for Earth representation.
- Google Satellite Embedding (AlphaEarth) – [LINK] Google Earth Engine
- An early-stage embedding model using Google's global satellite data.
- OlmoEarth – [AllenAI] (2025)
- Latent image modeling approach for multimodal Earth observation.
Key Papers
- XXZhu 2025 [LINK] "On the Foundations of Earth Foundation Models" – Nature Computational Science
- CFBrown 2025 [LINK] "AlphaEarth Foundations"
- KKlemmer 2023 [LINK] "SatCLIP: Global Location Embeddings with Satellite Imagery"
2. Datasets
Large-scale, open-access datasets play a central role in training and evaluating Earth foundation models.
3. Models & Methods
These include both classical and cutting-edge machine learning approaches used in building Earth foundation models.
Core Methods
- SatCLIP – [LINK] AAAI 2025 etc.
Vision-language model for global location representations.
- MMEarth – [LINK] EU/CV 2024
Multimodal pretext tasks for geospatial representation learning.
- ResNet – [LINK] |KHe IEEE/CV 2016
Baseline CNN architecture widely used in EO.
- ConvNeXt V2 – [LINK] Woo et al., IEEE/CVF 2023
Efficient ConvNet architecture using masked autoencoders (MAE).
- DINO, DINOv2, DINOv3 – [LINK] INRIA 2021–2023, META
Vision transformers with self-supervised learning capabilities.
- MAE (Masked Autoencoders) – [LINK] IEEE/CVF 2021
Self-supervised learning for vision transformers.
Distillation & Advanced Approaches
- Distillation methods – Transfer knowledge from large models.
- Neural plasticity-inspired models – TorchGeo_DOFA: Inspired by biological learning mechanisms.
- Multi-label guided soft contrastive learning – YWang, IEEE TGRS, 2024.
- Barlow Twins – Method for learning representations without contrastive loss.
- Continual Barlow Twins – Extends Barlow Twins to continual learning in EO segmentation.
4. Tools & Benchmarks
These are software systems and frameworks that support development, evaluation, or deployment of EO AI models.
Tools
PyTorch library for geospatial deep learning.
- NeuCo-Bench – [LINK] RVinge, arXiv 2025
Benchmarking framework for neural embeddings in Earth observation.
- GeoINRID – [LINK] GitHub: arjunarao619/GeoINRID
Geospatial inference and representation learning toolkit.
Challenges
- Embed2Scale Challenge – [LINK] CVPR CAlbrecht 2025
Large-scale Earth vision challenge focused on scale-aware embeddings.
- TerraMind Blue-Sky Challenge –
Generative modeling for Earth observation.
5. Key Themes & Trends
- Foundation Models: TorchGeo now includes data loaders designed for search/retrieval (Clay, Major TOM, Earth Index), and for dense prediction tasks like land cover mapping (Copernicus, Presto, Tessera, Google). TorchGeo allows us to enable fair, side-by-side benchmarking of different embedding models on the same downstream tasks, forming the basis for future experiments. Projects are encouraged to strengthen and improve explainability.
- Major TOM Notes: Major TOM embeddings are not (yet) really product-oriented and are aimed with a similar purpose to the MT Core datasets - to make it easier to experiment and benchmark model outputs (hence, unlike TESSERA and AEF which came a few months after, MT embeddings do not have consistent or aggregated temporal scope). We haven't had enough time to finish off the preprint, but my current plan is to provide a simple MT Embedding benchmark at this year's EGU and integrate that into the arxiv pre-print. --Miko
- Earth Index / Earth Genome: Use the Earth Index application (earthindex.ai) for non-technical users to use the embeddings we published on source.coop. Users of the web app (non-technical journalists, indigenous communities/allies, NGOs) have been our main focus. Users of the source.coop embeddings have generally been more technical folks interested in exploring/innovating in what's possible --BenStrong
- Clay: Clay and Presto offer documented tutorials on generating new embeddings with their models. In CLAY, the encoder receives unmasked patches, latitude-longitude data, and timestep information. Notably, the last 2 embeddings from the encoder specifically represent the latitude-longitude and timestep embeddings.
- Self-Supervised Learning (SSL):
- Multimodal Integration:
- Open Data & Tools: Open-source projects (e.g., TorchGeo, Copernicus-Embed) and public datasets (EuroSAT, EuroCrops) are crucial for reproducibility and democratization of EO AI. Projects are encouraged to increase Input Data Diversity, and to adopt cloud-native data formats for geospatial data.
- Benchmarking: Projects are encouraged to standardize in benchmarking. Benchmarks including NeuCo-Bench and Embed2Scale.
Deatasets Auto-Edit =
2. Datasets
Large-scale, open-access datasets play a central role in training and evaluating Earth foundation models.
Datasets
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [MRLC]
USA land cover classes.
- SSL4EO-S12 – [GitHub]
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- BigEarthNet – [BigEarthNet]
Large-scale multi-label satellite image classification dataset.
- Resisc45 – [IEEE DOI]
Remote sensing image classification dataset with 45 categories.
- UC Merced – [UC Merced]
Aerial image dataset for land use classification.
- Potsdam – [ISPRS]
Semantic segmentation dataset for urban areas from aerial imagery.
- Vaihingen – [ISPRS]
Semantic segmentation dataset for urban areas from aerial imagery.
- Inria Aerial Image Labeling – [Inria]
Aerial image segmentation dataset for building footprint extraction.
- NAIP – [USGS EROS]
National Agriculture Imagery Program data for the USA.
- Sentinel-2 – [Copernicus]
Multispectral imagery from the Sentinel-2 mission.
- Landsat – [USGS Landsat]
Long-term archive of medium-resolution satellite imagery.
- OpenStreetMap – [OpenStreetMap]
Collaborative project to create a free editable map of the world.
- GFED (Global Fire Emissions Database) – [Global Fire Data]
Global dataset of biomass burning emissions.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- Open Buildings – [Microsoft Research]
Global building footprint detection dataset.
- OpenAerialMap – [OpenAerialMap]
Open-source aerial imagery dataset.
- NASA Marine Debris – [NASA Data]
Marine debris detection dataset.
- Major-Tom – [GitHub]
Large-scale remote sensing image classification dataset.
- Google Satellite Embedding – [Google Earth Engine]
Pre-trained embeddings for Google satellite imagery.
- Dota – [DOTA Website]
Large-scale dataset for object detection in aerial images.
- Cropland Data Layer – [USDA NASS]
Crop-specific land cover dataset for the USA.
- Cropharvest – [GitHub]
Crop type mapping dataset for Europe using Sentinel-1 and Sentinel-2.
- Cowc – [Microsoft Research]
Counting objects in aerial images dataset.
- Copernicus-Embed – [GitHub]
Pre-trained embeddings for Copernicus data.
- Copernicus-Bench – [GitHub]
Benchmark dataset for Copernicus data.
- Cloud-Cover-Detection – [GitHub]
Cloud cover detection dataset.
- Clay-Embeddings – [GitHub]
Pre-trained embeddings for Clay dataset.
- Chesapeake – [GitHub]
Land cover classification dataset for the Chesapeake Bay region.
- Chabud – [GitHub]
Building footprint extraction dataset.
- Cabuar – [GitHub]
Agricultural field boundary detection dataset.
- Bright – [GitHub]
Bright object detection dataset.
- Biomassters – [GitHub]
Biomass estimation dataset.
- Benin Cashew Plantations – [GitHub]
Cashew plantation mapping dataset for Benin.
- Aboveground-Woody-Biomass – [GitHub]
Aboveground woody biomass estimation dataset.
Datasets Detail
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [NLCD_Legend]
USA land cover classes.
- SSL4EO-S12 – [GitHub]
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- BigEarthNet – [Site]
Large-scale multi-label satellite image classification dataset.
- Resisc45 – [DOI]
Remote sensing image classification dataset with 45 categories.
- UC Merced – [UCMerced_Datasets]
Aerial image dataset for land use classification.
- Potsdam – [ISPRS]
Semantic segmentation dataset for urban areas from aerial imagery.
- Inria Aerial Image Labeling – [Inria]
Aerial image segmentation dataset for building footprint extraction.
- NAIP – [USGS_NAIP]
National Agriculture Imagery Program data for the USA.
- Sentinel-2 – [Sentinel]
Multispectral imagery from the Sentinel-2 mission.
- Landsat – [Landsat_USGS]
Long-term archive of medium-resolution satellite imagery.
- OpenStreetMap – [OpenStreetMap]
Collaborative project to create a free editable map of the world.
- GFED (Global Fire Emissions Database) – [GFED]
Global dataset of biomass burning emissions.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- Open Buildings – [MSFT_Bldgs]
Global building footprint detection dataset.
- OpenAerialMap – [OpenAerialMap]
Open-source aerial imagery dataset.
- NLCD – [NLCD Legend]
National Land Cover Database for the USA.
- NASA Marine Debris – [NASA Data]
Marine debris detection dataset.
- Major-Tom – [GitHub]
Large-scale remote sensing image classification dataset.
- Google Satellite Embedding – [GitHub]
Pre-trained embeddings for Google satellite imagery.
- GBIF – [GBIF]
Global biodiversity information facility dataset.
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- Dota – [DOTA]
Large-scale dataset for object detection in aerial images.
- Cropland Data Layer – [USDA NASS]
Crop-specific land cover dataset for the USA.
- Cropharvest – [GitHub]
Crop type mapping dataset for Europe using Sentinel-1 and Sentinel-2.
- Cowc – [GitHub]
Counting objects in aerial images dataset.
- Copernicus-Pretrain – [GitHub]
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
- Copernicus-Embed – [GitHub]
Pre-trained embeddings for Copernicus data.
- Copernicus-Bench – [GitHub]
Benchmark dataset for Copernicus data.
- Cloud-Cover-Detection – [GitHub]
Cloud cover detection dataset.
- Clay-Embeddings – [GitHub]
Pre-trained embeddings for Clay dataset.
- Chesapeake – [GitHub]
Land cover classification dataset for the Chesapeake Bay region.
- Chabud – [GitHub]
Building footprint extraction dataset.
- Caffe – [Caffe Website]
Deep learning framework for remote sensing.
- Cabuar – [GitHub]
Agricultural field boundary detection dataset.
- Bright – [GitHub]
Bright object detection dataset.
- Biomassters – [GitHub]
Biomass estimation dataset.
- Benin Cashew Plantations – [GitHub]
Cashew plantation mapping dataset for Benin.
- Benchmark.csv – [Benchmark GitHub]
Benchmark dataset for remote sensing.
- Advance – [GitHub]
Advanced remote sensing dataset.
- Aboveground-Woody-Biomass – [GitHub]
Aboveground woody biomass estimation dataset.
- ---------------------------------------------------------------
- EuroSAT – [Zenodo]
Land use classification dataset using Sentinel-2 satellite data.
- EuroCrops – [PMC_10495462]
Crop type mapping dataset for Europe.
- National Land Cover Database (NLCD) – [LINK] Photogrammetric Engineering & Remote Sensing (2001)
USA land cover classes.
- SSL4EO-S12 – [LINK] IEEE Geoscience and Remote Sensing (2023)
Multimodal, multitemporal dataset for self-supervised learning.
- Copernicus-Pretrain [LINK] IEEE Geoscience and Remote Sensing (2023)
An extension of the SSL4EO-S12 dataset to all major Sentinel missions (S1-S5P).
Research Directions
- Unified Earth Foundation Models:
- Interpretability in EO AI: Exploring how these embeddings can be interpreted by domain experts.
- Ethics and Bias: Investigating fairness and bias in global EO models trained on unevenly distributed data.
- Edge Deployment: Making these large foundation models deployable on resource-constrained platforms (e.g., for field use).