Promoted
Promoted

Self-Supervised Learning For Robot to Overcome Distraction and Improve Generalization

A robot can be trained in a simulated environment using reinforcement learning(RL) and is tested in a real environment. A real environment unlike simulation which is a controlled environment has undesirable distractions. Real environment might have states which were never seen in simulations. A model requires a generalization to overcome such distraction. Self-Supervised model has shown significant results in the target environment even though the source environment had some degree of difference. Hansen et al in their paper show how a self-supervised model can adapt to a changing environment without any reward policy.

Policy Adaptation during Deployment(PAD)

0heading

Hansen et al build a network with a feature extractor e outputting to RL and Self-Supervised network. During the training phase, they observe samples from the relay buffer and then jointly optimise RL and self-supervised objectives. Self-supervised is trained on auxiliary tasks like inverse dynamics and rotation thus is responsible for the inverse dynamics prediction for control, and rotation prediction for navigation of the robot. At the testing phase, only self-supervise objective is optimised therefore reward policy is unavailable. A goal is to update the feature extractor during the test from a self-supervised gradient.

1paragraph
Self-Supervised Policy Adaptation during Deployment. Frigure from Hansen et al(2020).
2image

Using DeepMind Control (DMC) suite and CRLMaze navigation task, and experiment with both stationary (colors, objects, textures, lighting) and non-stationary (videos) environment changes, Policy adaptation transfers from simulation to a real robot and successfully adapts to environmental differences during deployment in two robotic manipulation tasks. PAD can adapt to unexpected objects in the scene. PAD tends to show improvement in generalization. The result also showed choice of self-supervised auxiliary tasks can make a difference to the performance. Overall this research shows fast test-time adaptation.

3paragraph

Another paper Bodnar et al(2020) carried out a similar experiment but worked on long-term reward-free adaptation scenarios. They reveal some undesirable effects on performance. The original representations of the source environment are progressively forgotten. A large perturbation in the original representations would cause catastrophic forgetting of the actions learned in the source environment, which would likely propagate to the actions taken in the target environment. Though this can be overcome with behaviour cloning.

4paragraph

Reward can be sparse and noisy (Shelhamer et al (2017)) for end-to-end optimisation. A range of self-supervised tasks that incorporate states, actions, and successors can provide auxiliary losses. These losses offer ubiquitous and instantaneous supervision for representation learning even in the absence of reward. Pure reinforcement learning methods are constrained by computational and data efficiency issues that can be remedied by auxiliary losses. Self-supervised pre-training and joint optimization improve the data efficiency and policy returns of end-to-end reinforcement learning.

5paragraph

References

6heading
7references