Your work is very good!
Simple and effective.
For your information, there is another solution from the explainability of CLIP.
Our work can achieve text to mask with SAM: https://github.com/xmed-lab/CLIP_Surgery
This is work is in the aspect of CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points.
Besides, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization.
This is the jupyter demo:
https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb
This is our segmentation results:

This is our heatmap:
