Difference between revisions of "TorchGeo DOFA FLORO"

From OSGeo
Jump to navigation Jump to search
(init)
 
(rewrite)
Line 53: Line 53:
 
==== Purpose ====
 
==== Purpose ====
 
The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality.
 
The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality.
 
  
 
== FLORO Theory and Architecture Analysis ==
 
== FLORO Theory and Architecture Analysis ==
  
 
=== Core Design Principles ===
 
=== Core Design Principles ===
* Diversity-driven representation learning: Prioritizes sensor heterogeneity over corpus scale to enable cross-modal transferability
+
* Unified multimodal input space: Heterogeneous remote sensing inputs (optical, SAR, elevation) are concatenated into fixed-width tensors with validity/availability masks indicating presence or absence of bands/modalities — no structural adaptation per sensor type.
* Frozen-encoder evaluation protocol: Pretrained encoder remains fixed during downstream task adaptation, testing pure representational quality
+
* Frozen-encoder transfer protocol: Pretrained encoder weights remain immutable during downstream task fine-tuning; only lightweight decoders trained for specific tasks (classification, segmentation, regression).
* Masked autoencoding with modality-specific reconstruction: Learns latent structure by predicting masked tokens per input stream using lightweight decoders
+
* Modality-aware masked autoencoding: Random patch masking occurs independently across all input tokens regardless of originating modality; reconstruction supervised by separate shallow decoders per output band group — discarded after pretraining.
  
 
=== Key Technical Components ===
 
=== Key Technical Components ===
1. '''FLOROGeoEncoder''': Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs
+
1. '''FLOROGeoEncoder''': Shared Vision Transformer encoder processing unified multimodal token sequences augmented with geo-positional embeddings and validity masks.
2. '''FLOROSSDecoder''': Masked autoencoding decoder used only during pretraining phase; reconstructs missing patches from visible context
+
2. '''FLOROSSDecoder''': Lightweight modality-specific reconstruction heads used exclusively during pretraining; one decoder per output band group (e.g., multispectral, elevation) to supervise latent representation learning via self-supervised loss minimization.
3. '''FLOROLinearClassDecoder''': Task-specific linear classifier head trained under frozen-encoder protocol for downstream tasks
+
3. '''NormalizeWithValidity''' transform: Applies normalization using validity channels as weights — scales pixel values proportionally based on which spectral groups or modalities are present in each patch; enables robust handling of incomplete sensor configurations without dropping data entirely.
  
 
=== Key Classes: ===
 
=== Key Classes: ===
1. <code>FLOROGeoEncoder</code> - Main Vision Transformer encoder class for latent representation learning
+
1. <code>FLOROGeoEncoder</code> - Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs with geo-position augmentation and validity-aware processing.
2. <code>FLOROSSDecoder</code> - Masked autoencoding decoder used only during pretraining phase; reconstructs missing patches from visible context
+
2. <code>FLOROSSDecoder</code> - Shallow decoder module used only during pretraining phase; reconstructs masked patches per output band group under self-supervised objective using cross-modal context from visible tokens.
3. <code>FLOROLinearClassDecoder</code> - Task-specific linear classifier head trained under frozen-encoder protocol for downstream tasks
+
3. <code>NormalizeWithValidity</code> - Data transformation class that applies channel-wise normalization weighted by validity indicators — ensures missing bands do not dominate gradients while preserving signal strength where data exists.
  
 
=== Architectural Features: ===
 
=== Architectural Features: ===
* Single unified ViT encoder: Processes grouped multimodal inputs without modality-specific branches or fusion layers; uses standard self-attention blocks with DropPath and LayerScale regularization
+
* Single unified ViT encoder: Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types (optical, auxiliary, validity).
* Grouped channel structure: Optical and auxiliary streams are concatenated into fixed-width input tensors with validity masks indicating presence/absence of bands/modalities; processed via PatchEmbed layer that handles variable spectral groupings
+
* Grouped channel structure + validity masking: Optical streams (Blue/Green/Red/NIR/SWIR/etc.) and auxiliary streams (Elevation/VV/VH) are concatenated into fixed-width tensors; each group has associated binary validity flags that influence normalization but do not remove tokens from sequence.
* Masked token prediction: During pretraining, random patch masking occurs independently per token regardless of its originating modality; model reconstructs them using visible context from same or other modalities through FLOROSSDecoder
+
* Hybrid geo-positional embeddings: Patch tokens augmented with projected Earth coordinates (EPSG:3857 global extent), globally normalized, then encoded via sinusoidal functions — provides spatial grounding independent of spectral content; absent in DOFA’s wavelength-conditioned design.
* Frozen-encoder transfer paradigm: Downstream tasks train only task-specific decoders (e.g., FLOROLinearClassDecoder) atop fixed encoder weights — no fine-tuning of base representation
+
* Modality-aware masking strategy: During pretraining, random patches are masked independently per token regardless of modality origin; FLOROSSDecoder restores them using visible context from same or other modalities through cross-modal attention mechanisms within transformer blocks.
  
 
=== FLORO+ Enhancement ===
 
=== FLORO+ Enhancement ===
* Curated multimodal diversity strategy: Uses ~80K samples spanning Sentinel-2, Sentinel-1, SkySat, UAV RGB/multispectral, and terrain products to maximize sensing condition variation rather than sample count
+
* Curated multimodal diversity over scale: ~80K samples spanning Sentinel-2 (13 bands), Sentinel-1 SAR, SkySat HR imagery, UAV RGB/multispectral, and terrain products — prioritizes variation in sensing conditions rather than sheer volume to maximize transferability.
* Availability-aware processing: Validity channels explicitly signal which spectral groups or modalities are present per patch; implemented via NormalizeWithValidity transform that scales data using validity masks, enabling robust handling of incomplete sensor configurations during both pretraining and transfer
+
* Availability-aware processing pipeline: BandDropping augmentation randomly removes entire spectral groups during training; NormalizeWithValidity ensures remaining valid channels contribute proportionally to gradient updates — simulates real-world sensor incompleteness without architectural modification.
  
 
=== Input Representation ===
 
=== Input Representation ===
FLORO structures input as two primary streams with internal grouping:
+
FLORO structures input as two primary streams with internal grouping and explicit validity signaling:
  
 
==== Optical Stream ====
 
==== Optical Stream ====
Contains reflectance values organized by spectral group plus validity indicators:
+
Contains reflectance values organized by spectral group plus validity indicators processed via NormalizeWithValidity transform:
* Blue/Green/Red — visible spectrum bands
+
* Blue/Green/Red — visible spectrum bands (present in UAV-RGB, HR Sat., S1/S2)
* Red Edge — vegetation stress indicator band
+
* Red Edge — vegetation stress indicator band (UAV-MS, S1/S2 only)
* NIR / NIR A — near-infrared for biomass and health assessment
+
* NIR / NIR A — near-infrared for biomass and health assessment (UAV-MS, HR Sat., S1/S2)
* SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis
+
* SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis (S1/S2 only)
* Validity Channels — binary flags per group indicating whether data is present or missing; processed via NormalizeWithValidity transform
+
* Validity Channels — binary flags per group indicating whether data is present or missing; directly used in NormalizeWithValidity to scale normalization parameters
  
 
==== Auxiliary Stream ====
 
==== Auxiliary Stream ====
Includes non-optical geospatial features:
+
Includes non-optical geospatial features also processed via NormalizeWithValidity:
* Elevation (DSM/DTM/DEM) — topographic structure information
+
* Elevation (DSM/DTM/DEM) — topographic structure information (all modalities except pure RGB UAVs may lack this)
* SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection
+
* SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection (S1/S2 only)
* Validity Channels — same mechanism as optical stream to denote availability; processed via NormalizeWithValidity transform
+
* Validity Channels — same mechanism as optical stream to denote availability; ensures missing auxiliary data does not distort learned representations
  
 
==== Implementation approach ====
 
==== Implementation approach ====
* Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor; positional embeddings are applied uniformly across all groups after PatchEmbed transformation
+
* Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor after PatchEmbed transformation; positional embeddings are applied uniformly across all groups including geo-position augmentation derived from projected coordinates.
* Masking scheme: Random patch masking occurs independently per token regardless of its originating modality, forcing cross-modal contextual inference during FLOROSSDecoder reconstruction
+
* Masking scheme: Random patch masking occurs independently per token regardless of its originating modality forcing cross-modal contextual inference during FLOROSSDecoder reconstruction via intermediate tokens optionally returned for downstream tasks.
* Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group these are removed after pretraining; downstream tasks use linear heads like FLOROLinearClassDecoder
+
* Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group (e.g., 8-band multispectral output, 3-band elevation/SAR output); these are removed after pretraining — replaced by task-specific heads like FLOROLinearClassDecoder during transfer phase.
  
 
==== Purpose ====
 
==== Purpose ====
The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification. When trained on full Sentinel-2 (13 bands) + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform.
+
The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification or dynamic weight generation (unlike DOFA’s wavelength-conditioned hypernetworks). When trained on full Sentinel-2 + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform. Geo-position embeddings further ensure spatial coherence across geographically dispersed training samples without requiring geographic metadata at test time if unavailable.

Revision as of 19:00, 31 May 2026


DOFA Theory and Architecture Analysis

Core Design Principles

  • Neuroplasticity-inspired: Based on brain's dynamic reorganization capacity in response to novel stimuli
  • Wavelength-conditioned dynamic hypernetwork: Uses wavelength as unifying parameter across EO modalities
  • Unified Transformer framework: Single architecture that handles diverse spectral bands and sensor modalities

Key Technical Components

1. Dynamic Hypernetwork: Generates network weights based on central wavelengths of each spectral band 2. Shared Vision Backbone: Universal feature learning module for all heterogeneous data modalities 3. Wavelength-aware Masked Image Modeling (MIM): Pretraining strategy that interpolates in weight space according to wavelength configurations

Key Classes:

1. MaskedAutoencoderViT - Main encoder class 2. Dynamic_MLP_OFA - Dynamic MLP layer for channel adaptation 3. TransformerWeightGenerator - For neuroplasticity-inspired weight generation

Architectural Features:

  • Single unified ViT: Uses standard Vision Transformer backbone with modifications
  • Dynamic MLP layers: Dynamic_MLP_OFA that adapts based on input channels
  • Wavelength-aware processing: Uses wave_lists for different spectral band handling
  • Neuroplasticity-inspired: Weight generation through transformer-based mechanism
  • Channel-flexible design: Works with 2-202+ channels through dynamic layer adaptation

DOFA+ Enhancement

  • Hierarchical Distillation Strategy: Preserves semantic priors from source model while guiding EO-specific pattern learning
  • Dual Training Strategy:
    • Wavelength-aware MIM for EO-specific spatial patterns
    • Hierarchical feature distillation for refining inherited semantic representations

MLP Layers

Looking at the DOFA code structure, dynamic MLP layers refers to a specific architectural component that adapts its parameters based on input characteristics:

Dynamic MLP Layers in DOFA:

  • Dynamic_MLP_OFA - A specialized MLP (Multi-Layer Perceptron) layer that dynamically adjusts its weights and structure
  • Unlike standard fixed MLPs, these layers can modify their internal parameters based on input features

How MLP Layers work: 1. Channel-adaptive processing': The MLP adapts to different input channel counts (2-202+ channels) 2. Wavelength-conditioned': Uses wavelength information to determine the appropriate weight configuration 3. Dynamic weight generation: Instead of fixed weights, the layer generates weights based on input characteristics

Implementation approach

  • TransformerWeightGenerator: A component that dynamically generates network weights based on central wavelengths
  • Hypernetwork concept: The dynamic MLP layer acts as a hypernetwork that produces weights for other layers
  • Spectral band awareness: The layer structure changes to accommodate different spectral configurations

Purpose

The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality.

FLORO Theory and Architecture Analysis

Core Design Principles

  • Unified multimodal input space: Heterogeneous remote sensing inputs (optical, SAR, elevation) are concatenated into fixed-width tensors with validity/availability masks indicating presence or absence of bands/modalities — no structural adaptation per sensor type.
  • Frozen-encoder transfer protocol: Pretrained encoder weights remain immutable during downstream task fine-tuning; only lightweight decoders trained for specific tasks (classification, segmentation, regression).
  • Modality-aware masked autoencoding: Random patch masking occurs independently across all input tokens regardless of originating modality; reconstruction supervised by separate shallow decoders per output band group — discarded after pretraining.

Key Technical Components

1. FLOROGeoEncoder: Shared Vision Transformer encoder processing unified multimodal token sequences augmented with geo-positional embeddings and validity masks. 2. FLOROSSDecoder: Lightweight modality-specific reconstruction heads used exclusively during pretraining; one decoder per output band group (e.g., multispectral, elevation) to supervise latent representation learning via self-supervised loss minimization. 3. NormalizeWithValidity transform: Applies normalization using validity channels as weights — scales pixel values proportionally based on which spectral groups or modalities are present in each patch; enables robust handling of incomplete sensor configurations without dropping data entirely.

Key Classes:

1. FLOROGeoEncoder - Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs with geo-position augmentation and validity-aware processing. 2. FLOROSSDecoder - Shallow decoder module used only during pretraining phase; reconstructs masked patches per output band group under self-supervised objective using cross-modal context from visible tokens. 3. NormalizeWithValidity - Data transformation class that applies channel-wise normalization weighted by validity indicators — ensures missing bands do not dominate gradients while preserving signal strength where data exists.

Architectural Features:

  • Single unified ViT encoder: Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types (optical, auxiliary, validity).
  • Grouped channel structure + validity masking: Optical streams (Blue/Green/Red/NIR/SWIR/etc.) and auxiliary streams (Elevation/VV/VH) are concatenated into fixed-width tensors; each group has associated binary validity flags that influence normalization but do not remove tokens from sequence.
  • Hybrid geo-positional embeddings: Patch tokens augmented with projected Earth coordinates (EPSG:3857 global extent), globally normalized, then encoded via sinusoidal functions — provides spatial grounding independent of spectral content; absent in DOFA’s wavelength-conditioned design.
  • Modality-aware masking strategy: During pretraining, random patches are masked independently per token regardless of modality origin; FLOROSSDecoder restores them using visible context from same or other modalities through cross-modal attention mechanisms within transformer blocks.

FLORO+ Enhancement

  • Curated multimodal diversity over scale: ~80K samples spanning Sentinel-2 (13 bands), Sentinel-1 SAR, SkySat HR imagery, UAV RGB/multispectral, and terrain products — prioritizes variation in sensing conditions rather than sheer volume to maximize transferability.
  • Availability-aware processing pipeline: BandDropping augmentation randomly removes entire spectral groups during training; NormalizeWithValidity ensures remaining valid channels contribute proportionally to gradient updates — simulates real-world sensor incompleteness without architectural modification.

Input Representation

FLORO structures input as two primary streams with internal grouping and explicit validity signaling:

Optical Stream

Contains reflectance values organized by spectral group plus validity indicators processed via NormalizeWithValidity transform:

  • Blue/Green/Red — visible spectrum bands (present in UAV-RGB, HR Sat., S1/S2)
  • Red Edge — vegetation stress indicator band (UAV-MS, S1/S2 only)
  • NIR / NIR A — near-infrared for biomass and health assessment (UAV-MS, HR Sat., S1/S2)
  • SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis (S1/S2 only)
  • Validity Channels — binary flags per group indicating whether data is present or missing; directly used in NormalizeWithValidity to scale normalization parameters

Auxiliary Stream

Includes non-optical geospatial features also processed via NormalizeWithValidity:

  • Elevation (DSM/DTM/DEM) — topographic structure information (all modalities except pure RGB UAVs may lack this)
  • SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection (S1/S2 only)
  • Validity Channels — same mechanism as optical stream to denote availability; ensures missing auxiliary data does not distort learned representations

Implementation approach

  • Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor after PatchEmbed transformation; positional embeddings are applied uniformly across all groups including geo-position augmentation derived from projected coordinates.
  • Masking scheme: Random patch masking occurs independently per token regardless of its originating modality — forcing cross-modal contextual inference during FLOROSSDecoder reconstruction via intermediate tokens optionally returned for downstream tasks.
  • Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group (e.g., 8-band multispectral output, 3-band elevation/SAR output); these are removed after pretraining — replaced by task-specific heads like FLOROLinearClassDecoder during transfer phase.

Purpose

The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification or dynamic weight generation (unlike DOFA’s wavelength-conditioned hypernetworks). When trained on full Sentinel-2 + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform. Geo-position embeddings further ensure spatial coherence across geographically dispersed training samples without requiring geographic metadata at test time if unavailable.