Abstract: We address the long-horizon mapless navigation problem: enabling robots to traverse novel environments without relying on high-definition maps or precise waypoints that specify exactly where to navigate. Two major challenges arise: (1) learning robust, generalizable perceptual representations of the environment—where it is impossible to pre-enumerate all relevant factors and ensure robustness to perceptual aliasing—and (2) planning human-aligned navigation paths using these learned features. Existing solutions often struggle to generalize due to their reliance on: (a) hand-curated object lists, which overlook new, unforeseen factors; (b) end-to-end learning of navigation-relevant features, which is constrained by the limited availability of real robot data; (c) large sets of expert demonstrations, which provide insufficient guidance on the most critical perceptual cues; or (d) handcrafted reward functions for learning, which are difficult to design and adapt for new scenarios. To overcome these limitations, we propose CREStE, a framework for representation and policy learning that does not require large-scale robot datasets or manually specified feature sets. First, CREStE leverages visual foundation models trained on internet-scale data to learn continuous bird’s-eye-view representations capturing elevation, semantics, and instance-level features. Second, it incorporates a counterfactual-based loss and active learning procedure to focus on the perceptual cues that matter most by querying humans for counterfactual trajectory annotations in challenging scenes. We evaluate CREStE in kilometer-scale navigation tasks across six distinct urban environments. Our experiments demonstrate that CREStE achieves more efficient navigation and requires fewer human interventions than existing approaches, showcasing its robustness and effectiveness for long-horizon mapless navigation.