This repository aims to reproduce the results of recent publications that use vision-language models (VLMs) for robot manipulation tasks on low-cost DIY manipulators. The goal is to create a centralized hub for VLM-based manipulator projects, enabling rapid testing and benchmarking. I chose the Koch v1.1 manipulator to start, due to its compatibility with lerobot.
Note: The koch v1-1 has only 5DoF, which may be limiting for more complex experiments. For future projects, I would recommend a low-cost 6DoF robot (ex. Simple Automation).
Please follow the build instructions found on the original repository. Additionally, follow the lerobot example for running the code.
To simplify the forward and inverse kinematics, I set . This is good enough to achieve most pick-and-place tasks.
| Joint | Joint Limits (rad) | ||||
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 |
For all experiments, a single ZED mini stereo camera was positioned across from the Koch v1.1 manipulator, ensuring that it had a clear view of the manipulator's workspace.
The Perspective-n-Point (PnP) pose computation (cv2.solvePnP) was used to calculate the rotation and translation matrices between the camera frame and the robot/world frame. A blue object, held by the robot's end-effector, was tracked across the image to obtain pixel coordinates. The corresponding world coordinates were derived using inverse kinematics. See video below:
calibration.mp4
Due to limited computational resources, I did not do collision checking.
demo_eraser.mp4
demo_chess.1.mp4
demo_stack.mp4
TBD
