TorchGeo DOFA FLORO
DOFA Theory and Architecture Analysis
Core Design Principles
- Neuroplasticity-inspired: Based on brain's dynamic reorganization capacity in response to novel stimuli
- Wavelength-conditioned dynamic hypernetwork: Uses wavelength as unifying parameter across EO modalities
- Unified Transformer framework: Single architecture that handles diverse spectral bands and sensor modalities
Key Technical Components
1. Dynamic Hypernetwork: Generates network weights based on central wavelengths of each spectral band 2. Shared Vision Backbone: Universal feature learning module for all heterogeneous data modalities 3. Wavelength-aware Masked Image Modeling (MIM): Pretraining strategy that interpolates in weight space according to wavelength configurations
Key Classes:
1. MaskedAutoencoderViT - Main encoder class
2. Dynamic_MLP_OFA - Dynamic MLP layer for channel adaptation
3. TransformerWeightGenerator - For neuroplasticity-inspired weight generation
Architectural Features:
- Single unified ViT: Uses standard Vision Transformer backbone with modifications
- Dynamic MLP layers:
Dynamic_MLP_OFAthat adapts based on input channels - Wavelength-aware processing: Uses
wave_listsfor different spectral band handling - Neuroplasticity-inspired: Weight generation through transformer-based mechanism
- Channel-flexible design: Works with 2-202+ channels through dynamic layer adaptation
DOFA+ Enhancement
- Hierarchical Distillation Strategy: Preserves semantic priors from source model while guiding EO-specific pattern learning
- Dual Training Strategy:
- Wavelength-aware MIM for EO-specific spatial patterns
- Hierarchical feature distillation for refining inherited semantic representations
MLP Layers
Looking at the DOFA code structure, dynamic MLP layers refers to a specific architectural component that adapts its parameters based on input characteristics:
Dynamic MLP Layers in DOFA:
Dynamic_MLP_OFA- A specialized MLP (Multi-Layer Perceptron) layer that dynamically adjusts its weights and structure- Unlike standard fixed MLPs, these layers can modify their internal parameters based on input features
How MLP Layers work: 1. Channel-adaptive processing': The MLP adapts to different input channel counts (2-202+ channels) 2. Wavelength-conditioned': Uses wavelength information to determine the appropriate weight configuration 3. Dynamic weight generation: Instead of fixed weights, the layer generates weights based on input characteristics
Implementation approach
TransformerWeightGenerator: A component that dynamically generates network weights based on central wavelengths- Hypernetwork concept: The dynamic MLP layer acts as a hypernetwork that produces weights for other layers
- Spectral band awareness: The layer structure changes to accommodate different spectral configurations
Purpose
The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality.
FLORO Theory and Architecture Analysis
Core Design Principles
- Unified multimodal input space: Heterogeneous remote sensing inputs (optical, SAR, elevation) are concatenated into fixed-width tensors with validity/availability masks indicating presence or absence of bands/modalities — no structural adaptation per sensor type. Validity flags influence normalization weight scaling but do not remove tokens from the sequence.
- Frozen-encoder transfer protocol: Pretrained encoder weights remain immutable during downstream task fine-tuning; only lightweight decoders trained for specific tasks (classification, segmentation, regression). Intermediate transformer tokens may optionally be returned to support multi-scale or hierarchical decoding.
- Modality-aware masked autoencoding: Random patch masking occurs independently across all input tokens regardless of originating modality; reconstruction supervised by separate shallow decoders per output band group — discarded after pretraining.
Key Technical Components
1. FLOROGeoEncoder: Shared Vision Transformer encoder processing unified multimodal token sequences augmented with geo-positional embeddings (when available) and validity masks. Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types. 2. FLOROSSDecoder: Lightweight modality-specific reconstruction heads used exclusively during pretraining; one decoder per output band group (e.g., multispectral, elevation/SAR) to supervise latent representation learning via self-supervised loss minimization over masked patches restored from visible context. 3. NormalizeWithValidity transform: Applies channel-wise normalization using validity channels as weights — scales pixel values proportionally based on which spectral groups or modalities are present in each patch; ensures missing bands do not dominate gradients while preserving signal strength where data exists.
Key Classes:
1. FLOROGeoEncoder - Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs with geo-position augmentation and validity-aware processing.
2. FLOROSSDecoder - Shallow decoder module used only during pretraining phase; reconstructs masked patches per output band group under self-supervised objective using cross-modal context from visible tokens via transformer blocks.
3. NormalizeWithValidity - Data transformation class that applies channel-wise normalization weighted by validity indicators — ensures missing bands do not distort learned representations while retaining full token sequence structure.
Architectural Features:
- Single unified ViT encoder: Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types (optical, auxiliary, validity). Token sequences include positional embeddings augmented by projected geographic coordinates when metadata is available.
- Grouped channel structure + validity masking: Optical streams (Blue/Green/Red/NIR/SWIR/etc.) and auxiliary streams (Elevation/VV/VH) are concatenated into fixed-width tensors; each group has associated binary validity flags that influence normalization but do not remove tokens from sequence — enabling robust handling of incomplete sensor configurations.
- Hybrid geo-positional embeddings: Patch tokens augmented with projected Earth coordinates (EPSG:3857 global extent), globally normalized, then encoded via sinusoidal functions — provides spatial grounding independent of spectral content; applied conditionally when geospatial metadata is available during pretraining or inference.
- Modality-aware masking strategy: During pretraining, random patches are masked independently per token regardless of modality origin; FLOROSSDecoder restores them using visible context from same or other modalities through cross-modal attention mechanisms within transformer blocks — intermediate tokens optionally returned for downstream task adaptation.
FLORO+ Enhancement
- Curated multimodal diversity over scale: ~80K samples spanning Sentinel-2 (13 bands), Sentinel-1 SAR, SkySat HR imagery, UAV RGB/multispectral, and terrain products — prioritizes variation in sensing conditions rather than sheer volume to maximize transferability across ecological domains.
- Availability-aware processing pipeline: BandDropping augmentation randomly removes entire spectral groups during training; NormalizeWithValidity ensures remaining valid channels contribute proportionally to gradient updates — simulates real-world sensor incompleteness without architectural modification or dynamic weight generation.
Input Representation
FLORO structures input as two primary streams with internal grouping and explicit validity signaling:
Optical Stream
Contains reflectance values organized by spectral group plus validity indicators processed via NormalizeWithValidity transform:
- Blue/Green/Red — visible spectrum bands (present in UAV-RGB, HR Sat., S1/S2)
- Red Edge — vegetation stress indicator band (UAV-MS, S1/S2 only)
- NIR / NIR A — near-infrared for biomass and health assessment (UAV-MS, HR Sat., S1/S2)
- SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis (S1/S2 only)
- Validity Channels — binary flags per group indicating whether data is present or missing; directly used in NormalizeWithValidity to scale normalization parameters without removing tokens
Auxiliary Stream
Includes non-optical geospatial features also processed via NormalizeWithValidity:
- Elevation (DSM/DTM/DEM) — topographic structure information (available across most modalities except pure RGB UAVs may lack this)
- SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection (S1/S2 only)
- Validity Channels — same mechanism as optical stream to denote availability; ensures missing auxiliary data does not distort learned representations
Implementation approach
- Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor after PatchEmbed transformation; positional embeddings are applied uniformly across all groups including geo-position augmentation derived from projected coordinates (when available).
- Masking scheme: Random patch masking occurs independently per token regardless of its originating modality — forcing cross-modal contextual inference during FLOROSSDecoder reconstruction via intermediate tokens optionally returned for downstream tasks.
- Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group (e.g., 8-band multispectral output, 3-band elevation/SAR output); these are removed after pretraining — replaced by task-specific heads like FLOROLinearClassDecoder during transfer phase.
Purpose
The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification or dynamic weight generation (unlike DOFA’s wavelength-conditioned hypernetworks). When trained on full Sentinel-2 + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform. Geo-position embeddings further ensure spatial coherence across geographically dispersed training samples without requiring geographic metadata at test time if unavailable.
--- File: Pasted ---
Comprehensive Architectural Comparison
}Summary of Key Differences
DOFA employs a neuroplasticity-inspired dynamic hypernetwork that generates network weights conditioned on the central wavelengths of input spectral bands, enabling continuous architectural adaptation to varying channel counts (2–202+) without modifying model structure externally. This approach treats sensor diversity as an internal parameterization problem solved via wavelength-aware weight generation (TransformerWeightGenerator, Dynamic_MLP_OFA).
FLORO adopts a fundamentally different strategy: it maintains a fixed, unified Vision Transformer encoder that processes heterogeneous inputs concatenated into fixed-width tensors with explicit validity/availability masks. Missing bands or modalities do not trigger structural changes — instead, they influence gradient contribution via the NormalizeWithValidity transform during training and are handled at inference time by leveraging learned cross-modal priors encoded in the shared latent space. FLORO also uniquely integrates hybrid geo-positional embeddings derived from projected Earth coordinates to provide spatial grounding independent of spectral content.
While both models use masked autoencoding for self-supervised pretraining, DOFA interpolates in weight space according to wavelength configurations during training, whereas FLORO performs random patch masking across all tokens regardless of modality origin and reconstructs them using separate shallow decoders per output band group (FLOROSSDecoder) — which are discarded after pretraining. FLORO’s frozen-encoder transfer protocol ensures that downstream tasks train only task-specific heads atop immutable encoder weights, emphasizing pure representational quality over architectural flexibility.
In essence:
- DOFA adapts *internally* via dynamic weight generation conditioned on spectral properties.
- FLORO adapts *externally* via data formatting (validity masks, normalization) and relies on a static architecture with cross-modal contextual inference enabled by structured input grouping and geo-positioning.
Both approaches achieve multimodal robustness but through opposing paradigms: DOFA embraces architectural plasticity; FLORO enforces structural rigidity while maximizing data-level flexibility.
| Topic | DOFA (Dynamic Optical Fusion Architecture) | FLORO (Foundation Model for Ecological Remote Sensing Across Sensors and Scales) | Notes |
|---|---|---|---|
| Design Philosophy | Neuroplasticity-inspired: Dynamic weight generation conditioned on spectral wavelength to adapt architecture per sensor type. | Unified multimodal input space: Fixed-width tensor concatenation with validity masks; no structural adaptation — encoder remains static regardless of input modality or channel count. | DOFA adapts internally via hypernetworks; FLORO adapts externally via data formatting and normalization strategies. |
| Flexibility Mechanism | Dynamic Hypernetwork (TransformerWeightGenerator) generates weights based on central wavelengths — enables continuous adaptation to 2–202+ channels without architectural change.
|
Validity-aware processing: NormalizeWithValidity scales normalization per band group using binary availability flags; missing bands do not remove tokens but reduce gradient contribution during training.
|
DOFA changes internal parameters dynamically; FLORO preserves fixed architecture and handles incompleteness through data weighting and masking. |
| Adaptation Strategy | Wavelength-conditioned dynamic MLP layers (Dynamic_MLP_OFA) adjust structure based on input spectral configuration — mimics brain’s reorganization under novel stimuli.
|
Frozen-encoder transfer: Pretrained encoder weights are immutable during downstream tasks; only lightweight task-specific decoders (e.g., FLOROLinearClassDecoder) are trained for classification, segmentation, or regression.
|
DOFA adapts at training time via dynamic layers; FLORO freezes representation after pretraining and relies on frozen-encoder evaluation protocol. |
| Training Approach | Wavelength-aware Masked Image Modeling (MIM) + hierarchical distillation — interpolates in weight space according to wavelength configurations during self-supervised pretraining. | Modality-aware masked autoencoding: Random patch masking across all tokens regardless of modality origin; reconstruction supervised by separate shallow decoders per output band group (FLOROSSDecoder) using cross-modal context from visible patches.
|
DOFA uses spectral interpolation in weight space; FLORO uses spatial-temporal token-level masking with modality-specific reconstruction heads. |
| Code Implementation | Compact, single-file approach with specialized dynamic components: MaskedAutoencoderViT, Dynamic_MLP_OFA, TransformerWeightGenerator. Handles any number of input channels via hypernetwork-generated weights.
|
Modular multi-component design: FLOROGeoEncoder (shared ViT), FLOROSSDecoder (pretraining-only reconstruction heads), NormalizeWithValidity (data transform). Requires fixed tensor width with validity channels appended to optical/auxiliary streams.
|
DOFA is parameter-adaptive; FLORO is structure-static but data-flexible via masking and normalization. |
| Resolution Handling | No explicit resolution handling — adapts through channel count flexibility only. Assumes uniform spatial dimensions across inputs. | Implicitly handles variable resolutions via patch tokenization (PatchEmbed) without requiring GSD specification; positional embeddings are geo-coordinate-based, not pixel-grid-dependent.
|
Neither model explicitly controls feature map resolution during inference; FLORO’s geo-positioning provides geographic grounding independent of sensor scale. |
| Architecture Modularity | Unified architecture with dynamic MLP layers for adaptability — no separate encoder/decoder modules beyond standard MAE structure. | Separate encoder (FLOROGeoEncoder) and decoder (FLOROSSDecoder) components; decoders discarded after pretraining, leaving only frozen encoder for transfer tasks.
|
DOFA integrates adaptation into core layers; FLORO separates representation learning (encoder) from reconstruction supervision (decoders). |
| Training Flexibility | Channel count varies per sample — model adapts via dynamic weight generation conditioned on wavelength lists (wave_lists). Supports SAR, RGB, hyperspectral within same architecture.
|
Band dropping augmentation randomly removes entire spectral groups during training; NormalizeWithValidity ensures remaining valid channels contribute proportionally to gradients — simulates real-world sensor incompleteness without architectural modification.
|
DOFA adapts per-sample via dynamic weights; FLORO trains on diverse configurations but uses fixed architecture with data-level robustness mechanisms. |
| Data Handling | Simpler data handling focused on channel count variations — assumes complete spectral coverage per sample, handled dynamically by hypernetworks. | Complex multimodal dataset structure: Optical (13 bands max), auxiliary (elevation, SAR VV/VH), validity masks for each group; supports UAV-RGB, UAV-MS, HR Sat., S1/S2 with partial overlaps in band availability. | DOFA assumes complete spectral inputs per sample; FLORO explicitly models incomplete sensor configurations via validity channels and normalization weighting. |
| Input Handling | Takes any number of channels as input — preprocessing handles different sensor specifications (SAR: 2, RGB: 3, S2: 9+) without requiring fixed tensor width or metadata injection. | Requires structured grouped inputs: Optical stream (Blue/Green/Red/NIR/SWIR/etc.) + auxiliary stream (Elevation/VV/VH) with validity flags; geo-position embeddings added when coordinates available — enables cross-modal contextual inference despite missing bands/modalities. | DOFA is channel-agnostic via dynamic weights; FLORO is modality-aware via structured grouping and availability signaling. |
| Evaluation Focus | Demonstrates capability across various tasks (segmentation, classification) with emphasis on spectral adaptability — does not emphasize resolution or geographic transferability. | Explicitly emphasizes frozen-encoder evaluation under PANGAEA benchmark protocol; tests transfer across semantic segmentation, scene classification, and regression tasks using curated multimodal diversity (~80K samples). | DOFA focuses on spectral flexibility; FLORO focuses on ecological domain generalization via sensor heterogeneity and geographic grounding. |
| Geo-Spatial Awareness | None — no explicit geospatial metadata integration; relies purely on spectral content for adaptation. | Hybrid geo-positional embeddings: Projected Earth coordinates (EPSG:3857) globally normalized and sinusoidally encoded — provides spatial context independent of spectral content; applied conditionally when metadata available. | FLORO uniquely incorporates geographic positioning into token sequence; DOFA operates solely on spectral domain without spatial grounding. |
| Decoder Usage Post-Pretraining | Standard MAE decoder retained for reconstruction during pretraining — no explicit discard mechanism described, but implied to be replaced by task-specific heads if needed. | Shallow decoders (FLOROSSDecoder) explicitly discarded after pretraining; only FLOROGeoEncoder is retained and used with frozen weights for downstream tasks via linear or convolutional classifiers.
|
FLORO enforces strict separation between representation learning (pretraining) and task adaptation (inference); DOFA’s decoder role less clearly defined post-pretraining. |
| Cross-Modal Inference Capability | Limited to spectral domain — adapts within optical/SAR/hyperspectral via wavelength conditioning; no explicit mechanism for fusing non-spectral modalities like elevation or SAR during inference without prior training configuration. | Trained on fused optical + auxiliary (SAR, elevation) inputs with validity masking — can infer from partial combinations at test time by leveraging learned cross-modal priors encoded in shared latent space via NormalizeWithValidity.
|
FLORO supports true multimodal fusion during inference; DOFA’s adaptation is primarily spectral and may not generalize to non-spectral auxiliary data without reconfiguration. |