Imagine you've returned from a trip with hundreds of photos, and now you want to find specific objects in those images - or maybe you’d like to rearrange them by content. This tool helps you do exactly that!
With open vocabulary classification based on CLIP (applied to the whole image or to cut tiles), you can also visualize attention maps and similarity matrices for batches of images.
git clone https://github.com/kuzudev/clip_anyclass.git
cd clip_anyclass/docker
docker compose up -d --build
After launch, open http://127.0.0.1:8051/ in your browser.
-
Enter path to dir with images: Type the path to your images (relative to the repo directory), then click
Load Images -
Class Descriptions: Enter a class name and click
+to add it (add as many as you want) -
Confidence threshold: Adjusts how strictly class descriptions must match the images
-
Output save directory: Directory to save classification results (by default, results are saved in the
resultsdirectory) -
Checkboxes:
- Use tiles: Split images into tiles and run classification on each tile separately (uses slicer from pytorch-toolbelt)
- Draw attention maps: Visualize CLIP’s attention maps (via Transformer-MM-Explainability)
- Draw similarity matrices: Display a table showing image-to-class similarity scores
-
Click
Run Classificationto start!
- CLIP backbone: Uses standard CLIP (dfn2b pretrain by default - see
config.yaml), based on open_clip - Attention maps: Generated based on Generic Attention-Model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers (calculation of relevancy maps from CLIP’s attention layers).



