Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a …
We take a Bayesian approach to imitation learning from multiple sensor inputs and apply to the task of opening office doors with a mobile manipulator. We show that using the Variational Information Bottleneck to regularize convolutional neural networks improves generalization to held-out domains, reduces the sim-to-real gap in a sensor-agnostic manner, and provides useful estimates of model uncertainty. In a real-world office environment, we achieve 96% task success.
Task Consistency Loss (TCL) is a self-supervised loss that encourages sim and real alignment both at the feature and action-prediction levels, building on top of RetinaGAN. We teach a mobile manipulator to autonomously approach a door, turn the handle to open the door, and enter the room. The imitation learning policy performs control from RGB and depth images and generalizes to doors not encountered in training data. We achieve 72% success across sixteen seen and unseen scenes using only ~16.2 hours of teleoperated demonstrations in sim and real.
SimGAN tackles domain adaptation by identifying a hybrid physics simulator to match simulated trajectories to those in the target domain. It uses a learned discriminative loss to address the limitations associated with manual loss design. Our hybrid simulator combines neural networks and traditional physics simulaton to balance expressiveness and generalizability, and alleviates the need for a carefully selected parameter set in System ID.
Deep reinforcement learning (RL) has shown great potential in solving robot manipulation tasks. However, existing RL policies have limited adaptability to environments with diverse dynamics properties, which is pivotal in solving many contact-rich manipulation tasks. We propose Contact-aware Online COntext Inference (COCOI), a deep RL method that encodes a context embedding of dynamics properties online using contact-rich interactions. We study this method based on a novel and challenging non-planar pushing task.
RetinaGAN is a generative adversarial network approach to adapt simulated images to realistic ones with object-detection consistency. Trained unsupervised without task loss dependencies, it preserves general object structure and texture in adapted images. On three real world tasks: grasping, pushing, and door opening, RetinaGAN improves performance for RL-based object instance grasping and continues to be effective even in the limited data regime.