We turn prior robot experience into an append-only memory of image–action snippets. At deployment, the current camera frame is embedded, matched to similar snippets, and the best short segment is replayed for N steps; then we re-query and repeat. This training-free "retrieval-as-control" loop gives real-time, few-shot adaptation by simply appending new episodes.
Heterogeneous logs are standardized (sample rate + Cartesian EEF actions) and stored two ways: raw data in a Trajectory DB and embeddings in a Vector DB. At test time we run a hierarchical search—dataset-centroid filter → small local index → cosine k-NN—to fetch a snippet and replay its action vectors for the next N steps. This replaces per-step model calls and keeps lookups sub-second at million scale.
We evaluate RT-Cache on a bowl-reaching task across multiple camera views and placements. RT-Cache succeeds across more camera views/placements than VINN; Behavior-Retrieval and OpenVLA-OFT often exit out-of-workspace. Successful trials finish faster with RT-Cache (lower median, tighter spread) because multi-step replay reduces corrective micro-moves. The 3D path analysis shows RT-Cache closely tracks ground truth while VINN drifts.
We conduct comprehensive ablation studies to understand RT-Cache's behavior. With a few in-domain episodes, RT-Cache is reliable across bowl/cup/bottle and views. Our horizon sensitivity analysis shows that choosing snippet length N trades speed vs reactivity; N=3 is a good balance in our setup. Viewpoint-aligned replay ensures retrieval stays within the active camera and advances coherently over steps. Typical failures include near-misses or minor pose offsets under occlusion or viewpoint change.
We demonstrate immediate adaptation through retrieval-as-control on a pushing task. With no in-domain data, we add one short example to memory—no training, no fine-tuning. RT-Cache retrieves that exemplar and replays it to produce a trajectory that qualitatively matches ground truth, showcasing the power of our training-free approach.
RT-Cache runs on 8–10 GB GPU and shifts cost to storage (≈ 100 GB for embeddings/metadata). It performs no gradient updates at deployment (frozen encoders; build ANN index only), unlike VINN/Behavior-Retrieval/OpenVLA-OFT which require training or fine-tuning. This compute-storage tradeoff enables real-time control on modest hardware.
Overview of the RT-Cache retrieval-as-control pipeline in action, demonstrating multi-step snippet replay for robot manipulation tasks.
Comparison with baseline methods shows RT-Cache's robustness. VINN and Behavior Retrieval often fail due to single-step retrieval and lack of coherent action sequences.
RT-Cache enables faster task completion through multi-step replay. The coherent action sequences reduce corrective micro-movements and improve execution efficiency.
RT-Cache demonstrates strong generalization capabilities to novel objects and environments, leveraging the diverse experience stored in memory to handle unseen scenarios.
@article{kwon2025rtcache,
title={RT-Cache: Training-Free Retrieval for Real-Time Manipulation},
author={Kwon, Owen and George, Abraham and Bartsch, Alison and Farimani, Amir Barati},
journal={arXiv preprint arXiv:2505.09040},
year={2025}
}