The videos themselves (the ones in the debug_images folder) have metadata saying they are 15 fps. The paper says that for DROID, videos are sampled at 5 FPS and actions at 15 FPS, with an action horizon of 24 (or a video frame horizon of 8 frames) making for 1.6 seconds.
But RELATIVE_OFFSETS = [-23, -16, -8, 0]. Firstly, the diff is [7, 8, 8] which already seems odd. Second of all, what's with the 8 anyway? If the video is 15 FPS but I wan't to sample at 5 FPS shouldn't we do something more like [-23, -20, -17, -14, -11, -8, -5, -2]? That way we are generating essentially [0, 3, 6, 9, 12, 15, 18, 21] for the video frames and [0, 1, 2, .... 23] for the actions.