Question on the definition of visually "ungrounded" categories

I agree that some categories may not provide enough aligned vision-language information for multi-modal learning. However, in the paper, you mentioned "video game commentaries" as an example. 

I wonder why it is not visually grounded. The people's comments are usually related to the games. In my opinion, we could filter this category only for its unreality, which means it may not benefit downstream tasks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on the definition of visually "ungrounded" categories #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question on the definition of visually "ungrounded" categories #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions