TorchGeo DOFA FLORO

From OSGeo
Jump to navigation Jump to search


DOFA Theory and Architecture Analysis

Core Design Principles

  • Neuroplasticity-inspired: Based on brain's dynamic reorganization capacity in response to novel stimuli
  • Wavelength-conditioned dynamic hypernetwork: Uses wavelength as unifying parameter across EO modalities
  • Unified Transformer framework: Single architecture that handles diverse spectral bands and sensor modalities

Key Technical Components

1. Dynamic Hypernetwork: Generates network weights based on central wavelengths of each spectral band 2. Shared Vision Backbone: Universal feature learning module for all heterogeneous data modalities 3. Wavelength-aware Masked Image Modeling (MIM): Pretraining strategy that interpolates in weight space according to wavelength configurations

Key Classes:

1. MaskedAutoencoderViT - Main encoder class 2. Dynamic_MLP_OFA - Dynamic MLP layer for channel adaptation 3. TransformerWeightGenerator - For neuroplasticity-inspired weight generation

Architectural Features:

  • Single unified ViT: Uses standard Vision Transformer backbone with modifications
  • Dynamic MLP layers: Dynamic_MLP_OFA that adapts based on input channels
  • Wavelength-aware processing: Uses wave_lists for different spectral band handling
  • Neuroplasticity-inspired: Weight generation through transformer-based mechanism
  • Channel-flexible design: Works with 2-202+ channels through dynamic layer adaptation

DOFA+ Enhancement

  • Hierarchical Distillation Strategy: Preserves semantic priors from source model while guiding EO-specific pattern learning
  • Dual Training Strategy:
    • Wavelength-aware MIM for EO-specific spatial patterns
    • Hierarchical feature distillation for refining inherited semantic representations

MLP Layers

Looking at the DOFA code structure, dynamic MLP layers refers to a specific architectural component that adapts its parameters based on input characteristics:

Dynamic MLP Layers in DOFA:

  • Dynamic_MLP_OFA - A specialized MLP (Multi-Layer Perceptron) layer that dynamically adjusts its weights and structure
  • Unlike standard fixed MLPs, these layers can modify their internal parameters based on input features

How MLP Layers work: 1. Channel-adaptive processing': The MLP adapts to different input channel counts (2-202+ channels) 2. Wavelength-conditioned': Uses wavelength information to determine the appropriate weight configuration 3. Dynamic weight generation: Instead of fixed weights, the layer generates weights based on input characteristics

Implementation approach

  • TransformerWeightGenerator: A component that dynamically generates network weights based on central wavelengths
  • Hypernetwork concept: The dynamic MLP layer acts as a hypernetwork that produces weights for other layers
  • Spectral band awareness: The layer structure changes to accommodate different spectral configurations

Purpose

The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality.

FLORO Theory and Architecture Analysis

Core Design Principles

  • Unified multimodal input space: Heterogeneous remote sensing inputs (optical, SAR, elevation) are concatenated into fixed-width tensors with validity/availability masks indicating presence or absence of bands/modalities — no structural adaptation per sensor type. Validity flags influence normalization weight scaling but do not remove tokens from the sequence.
  • Frozen-encoder transfer protocol: Pretrained encoder weights remain immutable during downstream task fine-tuning; only lightweight decoders trained for specific tasks (classification, segmentation, regression). Intermediate transformer tokens may optionally be returned to support multi-scale or hierarchical decoding.
  • Modality-aware masked autoencoding: Random patch masking occurs independently across all input tokens regardless of originating modality; reconstruction supervised by separate shallow decoders per output band group — discarded after pretraining.

Key Technical Components

1. FLOROGeoEncoder: Shared Vision Transformer encoder processing unified multimodal token sequences augmented with geo-positional embeddings (when available) and validity masks. Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types. 2. FLOROSSDecoder: Lightweight modality-specific reconstruction heads used exclusively during pretraining; one decoder per output band group (e.g., multispectral, elevation/SAR) to supervise latent representation learning via self-supervised loss minimization over masked patches restored from visible context. 3. NormalizeWithValidity transform: Applies channel-wise normalization using validity channels as weights — scales pixel values proportionally based on which spectral groups or modalities are present in each patch; ensures missing bands do not dominate gradients while preserving signal strength where data exists.

Key Classes:

1. FLOROGeoEncoder - Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs with geo-position augmentation and validity-aware processing. 2. FLOROSSDecoder - Shallow decoder module used only during pretraining phase; reconstructs masked patches per output band group under self-supervised objective using cross-modal context from visible tokens via transformer blocks. 3. NormalizeWithValidity - Data transformation class that applies channel-wise normalization weighted by validity indicators — ensures missing bands do not distort learned representations while retaining full token sequence structure.

Architectural Features:

  • Single unified ViT encoder: Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types (optical, auxiliary, validity). Token sequences include positional embeddings augmented by projected geographic coordinates when metadata is available.
  • Grouped channel structure + validity masking: Optical streams (Blue/Green/Red/NIR/SWIR/etc.) and auxiliary streams (Elevation/VV/VH) are concatenated into fixed-width tensors; each group has associated binary validity flags that influence normalization but do not remove tokens from sequence — enabling robust handling of incomplete sensor configurations.
  • Hybrid geo-positional embeddings: Patch tokens augmented with projected Earth coordinates (EPSG:3857 global extent), globally normalized, then encoded via sinusoidal functions — provides spatial grounding independent of spectral content; applied conditionally when geospatial metadata is available during pretraining or inference.
  • Modality-aware masking strategy: During pretraining, random patches are masked independently per token regardless of modality origin; FLOROSSDecoder restores them using visible context from same or other modalities through cross-modal attention mechanisms within transformer blocks — intermediate tokens optionally returned for downstream task adaptation.

FLORO+ Enhancement

  • Curated multimodal diversity over scale: ~80K samples spanning Sentinel-2 (13 bands), Sentinel-1 SAR, SkySat HR imagery, UAV RGB/multispectral, and terrain products — prioritizes variation in sensing conditions rather than sheer volume to maximize transferability across ecological domains.
  • Availability-aware processing pipeline: BandDropping augmentation randomly removes entire spectral groups during training; NormalizeWithValidity ensures remaining valid channels contribute proportionally to gradient updates — simulates real-world sensor incompleteness without architectural modification or dynamic weight generation.

Input Representation

FLORO structures input as two primary streams with internal grouping and explicit validity signaling:

Optical Stream

Contains reflectance values organized by spectral group plus validity indicators processed via NormalizeWithValidity transform:

  • Blue/Green/Red — visible spectrum bands (present in UAV-RGB, HR Sat., S1/S2)
  • Red Edge — vegetation stress indicator band (UAV-MS, S1/S2 only)
  • NIR / NIR A — near-infrared for biomass and health assessment (UAV-MS, HR Sat., S1/S2)
  • SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis (S1/S2 only)
  • Validity Channels — binary flags per group indicating whether data is present or missing; directly used in NormalizeWithValidity to scale normalization parameters without removing tokens

Auxiliary Stream

Includes non-optical geospatial features also processed via NormalizeWithValidity:

  • Elevation (DSM/DTM/DEM) — topographic structure information (available across most modalities except pure RGB UAVs may lack this)
  • SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection (S1/S2 only)
  • Validity Channels — same mechanism as optical stream to denote availability; ensures missing auxiliary data does not distort learned representations

Implementation approach

  • Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor after PatchEmbed transformation; positional embeddings are applied uniformly across all groups including geo-position augmentation derived from projected coordinates (when available).
  • Masking scheme: Random patch masking occurs independently per token regardless of its originating modality — forcing cross-modal contextual inference during FLOROSSDecoder reconstruction via intermediate tokens optionally returned for downstream tasks.
  • Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group (e.g., 8-band multispectral output, 3-band elevation/SAR output); these are removed after pretraining — replaced by task-specific heads like FLOROLinearClassDecoder during transfer phase.

Purpose

The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification or dynamic weight generation (unlike DOFA’s wavelength-conditioned hypernetworks). When trained on full Sentinel-2 + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform. Geo-position embeddings further ensure spatial coherence across geographically dispersed training samples without requiring geographic metadata at test time if unavailable.