Train a robotic arm to pick up a cube and place it in a box autonomously, and explore how the model makes decisions using mechanistic interpretability techniques.
AutonomousArm is a research project focused on developing and analyzing a robotic arm controlled by a Transformer-based model. The goal is to understand not only how to achieve autonomous movement, but also how the underlying model represents and memorizes information.
We built a robotic arm using the lerobot guide, which includes instructions for ordering motors and 3D printing components.
3D printing video:
Watch here
We used a teleoperation setup to collect data for training the model.
Teleoperation video
Autonomous arm demo:
Watch here
We applied mechanistic interpretability techniques to analyze how the model makes decisions. The focus was on understanding how the model encodes information about the environment and how it generalizes actions.
-
Inputs:
- Top and lateral camera images (
$I_{\text{top}}, I_{\text{lat}}$ ) - Arm joint angles (
$q$ )
- Top and lateral camera images (
-
Feature Extraction:
- Images processed by ResNet to produce feature vectors.
-
Tokenization:
- 600 pixel tokens (300 per camera), arm state token, and condition token.
-
Transformer Encoding:
- Three encoder layers process all tokens for global information flow.
-
Action Vector Update:
- Zero-initialized action vector updated via cross-attention and MLP to produce robot actions.
Reference: ACT paper
The MLP layer in the decoder memorizes exact actions rather than generalizing, acting as a "lookup table" for robot movements.
Attention maps show only a subset of tokens receive high scores, and arm state information is absorbed by pixel tokens.
- ResNet Feature Extraction:
Verify that the ResNet features encode all necessary information for the model to perform actions without relying on attention layers.
This can be done by training a linear classifier on the ResNet features, with labels from different arm and cube positions, to see if it can predict the action vector. - Probing Action Vector:
Train linear classifiers to verify if the action vector encodes all relevant environmental details. - Fine-tuning Only MLP Layer:
Fine-tune the model on new locations by freezing all layers except the MLP to see if it can adapt to lift cubes from those locations, confirming the hypothesis of memorization.
- ResNet Feature Extraction:
The ResNet features successfully encoded the necessary information for the model to perform actions without relying on attention layers.
- Fine-tuning the MLP layer on new locations confirmed that the model can adapt to lift cubes from those locations, supporting the hypothesis of memorization.
Video before fine-tuning:
Watch here
Video after fine-tuning:
Watch here


