Abstract: Embodied Navigation is a fundamental capability for intelligent robots, requiring robots to follow human commands and move autonomously within physical environments. Despite significant advancements, most existing navigation approaches are tailored to specific navigation tasks, such as instruction following, searching objects, answering questions, tracking people, and more. However, the incremental demands of advanced embodied navigation raise the challenge of designing a practical navigation agent that can handle multi-navigation tasks and benefits from the synergy between these tasks. To this end, we present Uni-NaVid, a video-based vision-language-action (VLA) model to unify different paradigms of navigation tasks with textual instruction and RGB video streams as inputs and directly output discrete low-level actions. To efficiently process extensive RGB video streams, we propose an online token merge strategy that spatially and temporally consolidates similar visual information which leads to 5Hz inference speed. For training Uni-NaVid, we collected 3.6 million navigation data samples across four diverse navigation tasks. Extensive experiments on diverse navigation benchmarks demonstrate that Uni-NaVidachieves state-of-the-art performance within a unified framework. Additionally, real-world experiments confirm the model’s effectiveness and efficiency, shedding light on its strong generalizability.