TorchGeo DOFA FLORO

DOFA Theory and Architecture Analysis

Core Design Principles

Neuroplasticity-inspired: Based on brain's dynamic reorganization capacity in response to novel stimuli
Wavelength-conditioned dynamic hypernetwork: Uses wavelength as unifying parameter across EO modalities
Unified Transformer framework: Single architecture that handles diverse spectral bands and sensor modalities

Key Technical Components

1. Dynamic Hypernetwork: Generates network weights based on central wavelengths of each spectral band 2. Shared Vision Backbone: Universal feature learning module for all heterogeneous data modalities 3. Wavelength-aware Masked Image Modeling (MIM): Pretraining strategy that interpolates in weight space according to wavelength configurations

Key Classes:

1. MaskedAutoencoderViT - Main encoder class 2. Dynamic_MLP_OFA - Dynamic MLP layer for channel adaptation 3. TransformerWeightGenerator - For neuroplasticity-inspired weight generation

Architectural Features:

Single unified ViT: Uses standard Vision Transformer backbone with modifications
Dynamic MLP layers: Dynamic_MLP_OFA that adapts based on input channels
Wavelength-aware processing: Uses wave_lists for different spectral band handling
Neuroplasticity-inspired: Weight generation through transformer-based mechanism
Channel-flexible design: Works with 2-202+ channels through dynamic layer adaptation

DOFA+ Enhancement

Hierarchical Distillation Strategy: Preserves semantic priors from source model while guiding EO-specific pattern learning
Dual Training Strategy:
- Wavelength-aware MIM for EO-specific spatial patterns
- Hierarchical feature distillation for refining inherited semantic representations

MLP Layers

Looking at the DOFA code structure, dynamic MLP layers refers to a specific architectural component that adapts its parameters based on input characteristics:

Dynamic MLP Layers in DOFA:

Dynamic_MLP_OFA - A specialized MLP (Multi-Layer Perceptron) layer that dynamically adjusts its weights and structure
Unlike standard fixed MLPs, these layers can modify their internal parameters based on input features

How MLP Layers work: 1. Channel-adaptive processing': The MLP adapts to different input channel counts (2-202+ channels) 2. Wavelength-conditioned': Uses wavelength information to determine the appropriate weight configuration 3. Dynamic weight generation: Instead of fixed weights, the layer generates weights based on input characteristics

Implementation approach

TransformerWeightGenerator: A component that dynamically generates network weights based on central wavelengths
Hypernetwork concept: The dynamic MLP layer acts as a hypernetwork that produces weights for other layers
Spectral band awareness: The layer structure changes to accommodate different spectral configurations

Purpose

The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality.

FLORO Theory and Architecture Analysis

Core Design Principles

Diversity-driven representation learning: Prioritizes sensor heterogeneity over corpus scale to enable cross-modal transferability
Frozen-encoder evaluation protocol: Pretrained encoder remains fixed during downstream task adaptation, testing pure representational quality
Masked autoencoding with modality-specific reconstruction: Learns latent structure by predicting masked tokens per input stream using lightweight decoders

Key Technical Components

1. FLOROGeoEncoder: Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs 2. FLOROSSDecoder: Masked autoencoding decoder used only during pretraining phase; reconstructs missing patches from visible context 3. FLOROLinearClassDecoder: Task-specific linear classifier head trained under frozen-encoder protocol for downstream tasks

Key Classes:

1. FLOROGeoEncoder - Main Vision Transformer encoder class for latent representation learning 2. FLOROSSDecoder - Masked autoencoding decoder used only during pretraining phase; reconstructs missing patches from visible context 3. FLOROLinearClassDecoder - Task-specific linear classifier head trained under frozen-encoder protocol for downstream tasks

Architectural Features:

Single unified ViT encoder: Processes grouped multimodal inputs without modality-specific branches or fusion layers; uses standard self-attention blocks with DropPath and LayerScale regularization
Grouped channel structure: Optical and auxiliary streams are concatenated into fixed-width input tensors with validity masks indicating presence/absence of bands/modalities; processed via PatchEmbed layer that handles variable spectral groupings
Masked token prediction: During pretraining, random patch masking occurs independently per token regardless of its originating modality; model reconstructs them using visible context from same or other modalities through FLOROSSDecoder
Frozen-encoder transfer paradigm: Downstream tasks train only task-specific decoders (e.g., FLOROLinearClassDecoder) atop fixed encoder weights — no fine-tuning of base representation

FLORO+ Enhancement

Curated multimodal diversity strategy: Uses ~80K samples spanning Sentinel-2, Sentinel-1, SkySat, UAV RGB/multispectral, and terrain products to maximize sensing condition variation rather than sample count
Availability-aware processing: Validity channels explicitly signal which spectral groups or modalities are present per patch; implemented via NormalizeWithValidity transform that scales data using validity masks, enabling robust handling of incomplete sensor configurations during both pretraining and transfer

Input Representation

FLORO structures input as two primary streams with internal grouping:

Optical Stream

Contains reflectance values organized by spectral group plus validity indicators:

Blue/Green/Red — visible spectrum bands
Red Edge — vegetation stress indicator band
NIR / NIR A — near-infrared for biomass and health assessment
SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis
Validity Channels — binary flags per group indicating whether data is present or missing; processed via NormalizeWithValidity transform

Auxiliary Stream

Includes non-optical geospatial features:

Elevation (DSM/DTM/DEM) — topographic structure information
SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection
Validity Channels — same mechanism as optical stream to denote availability; processed via NormalizeWithValidity transform

Implementation approach

Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor; positional embeddings are applied uniformly across all groups after PatchEmbed transformation
Masking scheme: Random patch masking occurs independently per token regardless of its originating modality, forcing cross-modal contextual inference during FLOROSSDecoder reconstruction
Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group — these are removed after pretraining; downstream tasks use linear heads like FLOROLinearClassDecoder

Purpose

The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification. When trained on full Sentinel-2 (13 bands) + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform.

TorchGeo DOFA FLORO

Contents

DOFA Theory and Architecture Analysis

Core Design Principles

Key Technical Components

Key Classes:

Architectural Features:

DOFA+ Enhancement

MLP Layers

Dynamic MLP Layers in DOFA:

Implementation approach

Purpose

FLORO Theory and Architecture Analysis

Core Design Principles

Key Technical Components

Key Classes:

Architectural Features:

FLORO+ Enhancement

Input Representation

Optical Stream

Auxiliary Stream

Implementation approach

Purpose

Navigation menu

Search