Massively Multimodal Foundation Models: A Framework for Capturing Dependencies with Specialized Mixture-of-Experts

arXiv:2509.25678v3 Announce Type: replace
Abstract: Modern applications increasingly involve dozens of heterogeneous input streams, such as clinical sensors, wearables, imaging, and text, each with distinct measurement models, sampling rates, and noise characteristics. This \textit{massively multimodal} setting, where each sensor constitutes a separate modality, fundamentally differs from conventional multimodal learning focused on two or three modalities. As modality count grows, capturing their complex, time-varying dependencies becomes essential yet challenging. Mixture-of-Experts (MoE) architectures are naturally suited for this setting, their sparse routing mechanism enables efficient scaling across many modalities. Existing MoE architectures route tokens based on similarity alone, overlooking the rich temporal dependencies across modalities. We propose a framework that explicitly quantifies temporal dependencies between modality pairs across multiple time lags and uses these to guide MoE routing. A dependency-aware router dispatches tokens to specialized experts based on interaction type. This principled routing enables experts to learn generalizable dependency-processing skills. Experiments across healthcare, activity recognition, and affective computing benchmarks demonstrate substantial performance gains and interpretable routing patterns aligned with domain knowledge.