Instance segmentor first using sam model to get all obj's mask of the input image. Second using clip model to classify each mask with both image features and your ...