Abstract: Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing, which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Extensive experiments in both simulation and real-world environments demonstrate that Act2Goal, initialized from large-scale imitation learning, achieves strong performance across a wide range of long-horizon manipulation tasks. Beyond offline generalization, Act2Goal also supports reward-free online autonomous improvement via hindsight goal relabeling and LoRA-based finetuning, enabling rapid adaptation during deployment without external supervision. In real world experiments, it improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating the effectiveness of online training as a powerful complement to offline learning.