Difference between revisions of "TorchGeo DOFA FLORO"
(init) |
(rewrite) |
||
| Line 53: | Line 53: | ||
==== Purpose ==== | ==== Purpose ==== | ||
The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality. | The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality. | ||
| − | |||
== FLORO Theory and Architecture Analysis == | == FLORO Theory and Architecture Analysis == | ||
=== Core Design Principles === | === Core Design Principles === | ||
| − | * | + | * Unified multimodal input space: Heterogeneous remote sensing inputs (optical, SAR, elevation) are concatenated into fixed-width tensors with validity/availability masks indicating presence or absence of bands/modalities — no structural adaptation per sensor type. |
| − | * Frozen-encoder | + | * Frozen-encoder transfer protocol: Pretrained encoder weights remain immutable during downstream task fine-tuning; only lightweight decoders trained for specific tasks (classification, segmentation, regression). |
| − | * | + | * Modality-aware masked autoencoding: Random patch masking occurs independently across all input tokens regardless of originating modality; reconstruction supervised by separate shallow decoders per output band group — discarded after pretraining. |
=== Key Technical Components === | === Key Technical Components === | ||
| − | 1. '''FLOROGeoEncoder''': | + | 1. '''FLOROGeoEncoder''': Shared Vision Transformer encoder processing unified multimodal token sequences augmented with geo-positional embeddings and validity masks. |
| − | 2. '''FLOROSSDecoder''': | + | 2. '''FLOROSSDecoder''': Lightweight modality-specific reconstruction heads used exclusively during pretraining; one decoder per output band group (e.g., multispectral, elevation) to supervise latent representation learning via self-supervised loss minimization. |
| − | 3. ''' | + | 3. '''NormalizeWithValidity''' transform: Applies normalization using validity channels as weights — scales pixel values proportionally based on which spectral groups or modalities are present in each patch; enables robust handling of incomplete sensor configurations without dropping data entirely. |
=== Key Classes: === | === Key Classes: === | ||
| − | 1. <code>FLOROGeoEncoder</code> - Main Vision Transformer encoder class for latent representation learning | + | 1. <code>FLOROGeoEncoder</code> - Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs with geo-position augmentation and validity-aware processing. |
| − | 2. <code>FLOROSSDecoder</code> - | + | 2. <code>FLOROSSDecoder</code> - Shallow decoder module used only during pretraining phase; reconstructs masked patches per output band group under self-supervised objective using cross-modal context from visible tokens. |
| − | 3. <code> | + | 3. <code>NormalizeWithValidity</code> - Data transformation class that applies channel-wise normalization weighted by validity indicators — ensures missing bands do not dominate gradients while preserving signal strength where data exists. |
=== Architectural Features: === | === Architectural Features: === | ||
| − | * Single unified ViT encoder: Processes | + | * Single unified ViT encoder: Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types (optical, auxiliary, validity). |
| − | * Grouped channel structure: Optical and auxiliary streams are concatenated into fixed-width | + | * Grouped channel structure + validity masking: Optical streams (Blue/Green/Red/NIR/SWIR/etc.) and auxiliary streams (Elevation/VV/VH) are concatenated into fixed-width tensors; each group has associated binary validity flags that influence normalization but do not remove tokens from sequence. |
| − | * | + | * Hybrid geo-positional embeddings: Patch tokens augmented with projected Earth coordinates (EPSG:3857 global extent), globally normalized, then encoded via sinusoidal functions — provides spatial grounding independent of spectral content; absent in DOFA’s wavelength-conditioned design. |
| − | + | * Modality-aware masking strategy: During pretraining, random patches are masked independently per token regardless of modality origin; FLOROSSDecoder restores them using visible context from same or other modalities through cross-modal attention mechanisms within transformer blocks. | |
=== FLORO+ Enhancement === | === FLORO+ Enhancement === | ||
| − | * Curated multimodal diversity | + | * Curated multimodal diversity over scale: ~80K samples spanning Sentinel-2 (13 bands), Sentinel-1 SAR, SkySat HR imagery, UAV RGB/multispectral, and terrain products — prioritizes variation in sensing conditions rather than sheer volume to maximize transferability. |
| − | * Availability-aware processing: | + | * Availability-aware processing pipeline: BandDropping augmentation randomly removes entire spectral groups during training; NormalizeWithValidity ensures remaining valid channels contribute proportionally to gradient updates — simulates real-world sensor incompleteness without architectural modification. |
=== Input Representation === | === Input Representation === | ||
| − | FLORO structures input as two primary streams with internal grouping: | + | FLORO structures input as two primary streams with internal grouping and explicit validity signaling: |
==== Optical Stream ==== | ==== Optical Stream ==== | ||
| − | Contains reflectance values organized by spectral group plus validity indicators: | + | Contains reflectance values organized by spectral group plus validity indicators processed via NormalizeWithValidity transform: |
| − | * Blue/Green/Red — visible spectrum bands | + | * Blue/Green/Red — visible spectrum bands (present in UAV-RGB, HR Sat., S1/S2) |
| − | * Red Edge — vegetation stress indicator band | + | * Red Edge — vegetation stress indicator band (UAV-MS, S1/S2 only) |
| − | * NIR / NIR A — near-infrared for biomass and health assessment | + | * NIR / NIR A — near-infrared for biomass and health assessment (UAV-MS, HR Sat., S1/S2) |
| − | * SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis | + | * SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis (S1/S2 only) |
| − | * Validity Channels — binary flags per group indicating whether data is present or missing; | + | * Validity Channels — binary flags per group indicating whether data is present or missing; directly used in NormalizeWithValidity to scale normalization parameters |
==== Auxiliary Stream ==== | ==== Auxiliary Stream ==== | ||
| − | Includes non-optical geospatial features: | + | Includes non-optical geospatial features also processed via NormalizeWithValidity: |
| − | * Elevation (DSM/DTM/DEM) — topographic structure information | + | * Elevation (DSM/DTM/DEM) — topographic structure information (all modalities except pure RGB UAVs may lack this) |
| − | * SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection | + | * SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection (S1/S2 only) |
| − | * Validity Channels — same mechanism as optical stream to denote availability; | + | * Validity Channels — same mechanism as optical stream to denote availability; ensures missing auxiliary data does not distort learned representations |
==== Implementation approach ==== | ==== Implementation approach ==== | ||
| − | * Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor; positional embeddings are applied uniformly across all groups | + | * Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor after PatchEmbed transformation; positional embeddings are applied uniformly across all groups including geo-position augmentation derived from projected coordinates. |
| − | * Masking scheme: Random patch masking occurs independently per token regardless of its originating modality | + | * Masking scheme: Random patch masking occurs independently per token regardless of its originating modality — forcing cross-modal contextual inference during FLOROSSDecoder reconstruction via intermediate tokens optionally returned for downstream tasks. |
| − | * Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group | + | * Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group (e.g., 8-band multispectral output, 3-band elevation/SAR output); these are removed after pretraining — replaced by task-specific heads like FLOROLinearClassDecoder during transfer phase. |
==== Purpose ==== | ==== Purpose ==== | ||
| − | The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification. When trained on full Sentinel-2 | + | The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification or dynamic weight generation (unlike DOFA’s wavelength-conditioned hypernetworks). When trained on full Sentinel-2 + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform. Geo-position embeddings further ensure spatial coherence across geographically dispersed training samples without requiring geographic metadata at test time if unavailable. |
Revision as of 19:00, 31 May 2026
DOFA Theory and Architecture Analysis
Core Design Principles
- Neuroplasticity-inspired: Based on brain's dynamic reorganization capacity in response to novel stimuli
- Wavelength-conditioned dynamic hypernetwork: Uses wavelength as unifying parameter across EO modalities
- Unified Transformer framework: Single architecture that handles diverse spectral bands and sensor modalities
Key Technical Components
1. Dynamic Hypernetwork: Generates network weights based on central wavelengths of each spectral band 2. Shared Vision Backbone: Universal feature learning module for all heterogeneous data modalities 3. Wavelength-aware Masked Image Modeling (MIM): Pretraining strategy that interpolates in weight space according to wavelength configurations
Key Classes:
1. MaskedAutoencoderViT - Main encoder class
2. Dynamic_MLP_OFA - Dynamic MLP layer for channel adaptation
3. TransformerWeightGenerator - For neuroplasticity-inspired weight generation
Architectural Features:
- Single unified ViT: Uses standard Vision Transformer backbone with modifications
- Dynamic MLP layers:
Dynamic_MLP_OFAthat adapts based on input channels - Wavelength-aware processing: Uses
wave_listsfor different spectral band handling - Neuroplasticity-inspired: Weight generation through transformer-based mechanism
- Channel-flexible design: Works with 2-202+ channels through dynamic layer adaptation
DOFA+ Enhancement
- Hierarchical Distillation Strategy: Preserves semantic priors from source model while guiding EO-specific pattern learning
- Dual Training Strategy:
- Wavelength-aware MIM for EO-specific spatial patterns
- Hierarchical feature distillation for refining inherited semantic representations
MLP Layers
Looking at the DOFA code structure, dynamic MLP layers refers to a specific architectural component that adapts its parameters based on input characteristics:
Dynamic MLP Layers in DOFA:
Dynamic_MLP_OFA- A specialized MLP (Multi-Layer Perceptron) layer that dynamically adjusts its weights and structure- Unlike standard fixed MLPs, these layers can modify their internal parameters based on input features
How MLP Layers work: 1. Channel-adaptive processing': The MLP adapts to different input channel counts (2-202+ channels) 2. Wavelength-conditioned': Uses wavelength information to determine the appropriate weight configuration 3. Dynamic weight generation: Instead of fixed weights, the layer generates weights based on input characteristics
Implementation approach
TransformerWeightGenerator: A component that dynamically generates network weights based on central wavelengths- Hypernetwork concept: The dynamic MLP layer acts as a hypernetwork that produces weights for other layers
- Spectral band awareness: The layer structure changes to accommodate different spectral configurations
Purpose
The dynamic MLP layers allow DOFA to handle varying sensor specifications without requiring multiple fixed architectures. When input data has 2 channels (SAR), 3 channels (RGB), or 202 channels (hyperspectral), the same model architecture can adapt through these dynamic layers rather than needing separate models for each modality.
FLORO Theory and Architecture Analysis
Core Design Principles
- Unified multimodal input space: Heterogeneous remote sensing inputs (optical, SAR, elevation) are concatenated into fixed-width tensors with validity/availability masks indicating presence or absence of bands/modalities — no structural adaptation per sensor type.
- Frozen-encoder transfer protocol: Pretrained encoder weights remain immutable during downstream task fine-tuning; only lightweight decoders trained for specific tasks (classification, segmentation, regression).
- Modality-aware masked autoencoding: Random patch masking occurs independently across all input tokens regardless of originating modality; reconstruction supervised by separate shallow decoders per output band group — discarded after pretraining.
Key Technical Components
1. FLOROGeoEncoder: Shared Vision Transformer encoder processing unified multimodal token sequences augmented with geo-positional embeddings and validity masks. 2. FLOROSSDecoder: Lightweight modality-specific reconstruction heads used exclusively during pretraining; one decoder per output band group (e.g., multispectral, elevation) to supervise latent representation learning via self-supervised loss minimization. 3. NormalizeWithValidity transform: Applies normalization using validity channels as weights — scales pixel values proportionally based on which spectral groups or modalities are present in each patch; enables robust handling of incomplete sensor configurations without dropping data entirely.
Key Classes:
1. FLOROGeoEncoder - Main Vision Transformer encoder class for latent representation learning on grouped multimodal inputs with geo-position augmentation and validity-aware processing.
2. FLOROSSDecoder - Shallow decoder module used only during pretraining phase; reconstructs masked patches per output band group under self-supervised objective using cross-modal context from visible tokens.
3. NormalizeWithValidity - Data transformation class that applies channel-wise normalization weighted by validity indicators — ensures missing bands do not dominate gradients while preserving signal strength where data exists.
Architectural Features:
- Single unified ViT encoder: Processes heterogeneous inputs without modality-specific branches; uses standard self-attention blocks with DropPath and LayerScale regularization applied uniformly across all token types (optical, auxiliary, validity).
- Grouped channel structure + validity masking: Optical streams (Blue/Green/Red/NIR/SWIR/etc.) and auxiliary streams (Elevation/VV/VH) are concatenated into fixed-width tensors; each group has associated binary validity flags that influence normalization but do not remove tokens from sequence.
- Hybrid geo-positional embeddings: Patch tokens augmented with projected Earth coordinates (EPSG:3857 global extent), globally normalized, then encoded via sinusoidal functions — provides spatial grounding independent of spectral content; absent in DOFA’s wavelength-conditioned design.
- Modality-aware masking strategy: During pretraining, random patches are masked independently per token regardless of modality origin; FLOROSSDecoder restores them using visible context from same or other modalities through cross-modal attention mechanisms within transformer blocks.
FLORO+ Enhancement
- Curated multimodal diversity over scale: ~80K samples spanning Sentinel-2 (13 bands), Sentinel-1 SAR, SkySat HR imagery, UAV RGB/multispectral, and terrain products — prioritizes variation in sensing conditions rather than sheer volume to maximize transferability.
- Availability-aware processing pipeline: BandDropping augmentation randomly removes entire spectral groups during training; NormalizeWithValidity ensures remaining valid channels contribute proportionally to gradient updates — simulates real-world sensor incompleteness without architectural modification.
Input Representation
FLORO structures input as two primary streams with internal grouping and explicit validity signaling:
Optical Stream
Contains reflectance values organized by spectral group plus validity indicators processed via NormalizeWithValidity transform:
- Blue/Green/Red — visible spectrum bands (present in UAV-RGB, HR Sat., S1/S2)
- Red Edge — vegetation stress indicator band (UAV-MS, S1/S2 only)
- NIR / NIR A — near-infrared for biomass and health assessment (UAV-MS, HR Sat., S1/S2)
- SWIR 1/SWIR 2 — shortwave infrared for moisture content analysis (S1/S2 only)
- Validity Channels — binary flags per group indicating whether data is present or missing; directly used in NormalizeWithValidity to scale normalization parameters
Auxiliary Stream
Includes non-optical geospatial features also processed via NormalizeWithValidity:
- Elevation (DSM/DTM/DEM) — topographic structure information (all modalities except pure RGB UAVs may lack this)
- SAR VV/VH — dual-polarization radar backscatter for surface texture and moisture detection (S1/S2 only)
- Validity Channels — same mechanism as optical stream to denote availability; ensures missing auxiliary data does not distort learned representations
Implementation approach
- Token grouping strategy: Each spectral or auxiliary group forms a contiguous block within the input tensor after PatchEmbed transformation; positional embeddings are applied uniformly across all groups including geo-position augmentation derived from projected coordinates.
- Masking scheme: Random patch masking occurs independently per token regardless of its originating modality — forcing cross-modal contextual inference during FLOROSSDecoder reconstruction via intermediate tokens optionally returned for downstream tasks.
- Decoder design: Separate lightweight MLP-based decoders map latent representations back to pixel space for each original input group (e.g., 8-band multispectral output, 3-band elevation/SAR output); these are removed after pretraining — replaced by task-specific heads like FLOROLinearClassDecoder during transfer phase.
Purpose
The grouped channel architecture allows FLORO to ingest variable sensor combinations without architectural modification or dynamic weight generation (unlike DOFA’s wavelength-conditioned hypernetworks). When trained on full Sentinel-2 + SAR + elevation, it can later transfer to datasets containing only RGB or UAV-only imagery by leveraging learned cross-modal priors encoded in the shared latent space — enabled explicitly through validity masking during both pretraining and inference phase via NormalizeWithValidity transform. Geo-position embeddings further ensure spatial coherence across geographically dispersed training samples without requiring geographic metadata at test time if unavailable.