Home Computer Vision Utilizing Secure Diffusion and SAM to Modify Picture Contents Zero Shot

Utilizing Secure Diffusion and SAM to Modify Picture Contents Zero Shot

Utilizing Secure Diffusion and SAM to Modify Picture Contents Zero Shot



Current breakthroughs in giant language fashions (LLMs) and basis laptop imaginative and prescient fashions have unlocked new interfaces and strategies for modifying photographs or movies. You’ll have heard of inpainting, outpainting, generative fill, and textual content to picture; this submit will present you learn how to execute these new generative AI capabilities by constructing your individual visible editor utilizing solely textual content prompts and the latest open supply fashions.

Picture modifying is not about guide manipulation utilizing hosted software program. Fashions like Section Something Mannequin (SAM),  Secure Diffusion, and Grounding DINO have made it doable to carry out picture modifying utilizing solely textual content instructions. Collectively, they create a strong workflow that seamlessly combines picture zero shot detection, segmentation, and inpainting. The purpose of the tutorial is to reveal the potential of the three highly effective fashions to get you began so you may construct on prime of it.

By the tip of this information, you can rework and manipulate photographs utilizing nothing greater than textual content instructions. This weblog submit will fastidiously stroll you thru a tutorial on learn how to leverage these fashions for picture modifying!


For full implementation particulars, check with the complete Colab pocket book.

Altering Objects Completely

Immediate used for zero shot object detection: “Hearth Hydrant”, Immediate used for Era: “Picture Sales space”

Altering colours and texture of objects

Immediate used for zero shot object detection: “Automobile”, Immediate used for Era: “Purple Automobile”

Artistic Functions with Context

Immediate used for zero shot object detection: “Yoda”, Immediate used for Era: “Raccoon Yoda in Star Wars”

#Step 1: Set up Dependencies

Our course of begins by putting in the mandatory libraries and fashions. We start with SAM, a strong segmentation mannequin, Secure Diffusion for picture inpainting, and GroundingDINO for zero shot object detection.

!pip -q set up diffusers transformers scipy segment_anything
!git clone https://github.com/IDEA-Analysis/GroundingDINO.git
%cd GroundingDINO
!pip -q set up -e .

We’ll use Grounding DINO for zero shot object detection primarily based on the textual content enter, on this case “hearth hydrant”. Utilizing the predict operate from GroundingDINO, we get hold of the bins, logits, and phrases for our picture. We then annotate our picture utilizing these outcomes.

from groundingdino.util.inference import load_model, load_image, predict, annotate
TEXT_PROMPT = "hearth hydrant"
bins, logits, phrases = predict(
img_annnotated = annotate(image_source=src, bins=bins, logits=logits, phrases=phrases)[...,::-1]
Zero Shot Object Detection utilizing GroundingDINO

Then, we are going to use SAM to extract masks from the bounding field.

from segment_anything import SamPredictor, sam_model_registry
predictor = SamPredictor(sam_model_registry[model_type](checkpoint="./weights/sam_vit_h_4b8939.pth").to(system=system))

masks, _, _ = predictor.predict_torch(
           point_coords = None,
           point_labels = None,
           bins = new_boxes,
           multimask_output = False,
Segmented Object with Masks utilizing SAM

#Step 3: Modify Picture Utilizing Secure Diffusion

Then, we are going to modify the picture primarily based on a textual content immediate utilizing Secure Diffusion. The  pipe operate from Secure Diffusion is used to inpaint the areas recognized by the masks with the contents of the textual content immediate. Preserve this in thoughts to your use circumstances, you’ll need the inpainted objects to be the same kind and form to the thing they’re changing.

immediate = "Telephone Sales space"
edited = pipe(immediate=immediate, picture=original_img, mask_image=only_mask).photographs[0]

Use Instances for Enhancing Photographs with Textual content Prompts

  • Fast Prototyping: Speed up product improvement and testing with fast visualization enabling quicker suggestions and determination making for designers and builders.
  • Picture Translation and Localization: Assist range by translating and localizing visible content material with alternate options.
  • Video/Picture Enhancing and Content material Administration: Velocity up modifying photographs and movies utilizing textual content prompts as an alternative of UI, catering to particular person creators and enterprises for mass modifying duties.
  • Object Identification and Replacement: Simply determine objects and substitute them with different objects, akin to changing a beer bottle with a coke bottle.


That’s it! Leveraging highly effective fashions akin to SAM, Secure Diffusion, and Grounding DINO makes picture transformations simpler and extra accessible. With text-based instructions, we are able to instruct the fashions to execute exact duties akin to recognizing objects, segmenting them, and changing them with different objects.

The code on this tutorial supplies a place to begin for getting began with text-based picture modifying, and we encourage you to experiment with totally different objects and see what fascinating outcomes you may obtain.

Full Code

For full implementation particulars, check with the complete Colab pocket book.

def process_boxes(bins, src):
   H, W, _ = src.form
   boxes_xyxy = box_ops.box_cxcywh_to_xyxy(bins) * torch.Tensor([W, H, W, H])
   return predictor.rework.apply_boxes_torch(boxes_xyxy, src.form[:2]).to(system)

def edit_image(path, merchandise, immediate, box_threshold, text_threshold):
   src, img = load_image(path)
   bins, logits, phrases = predict(
   new_boxes = process_boxes(bins, src)
   masks, _, _ = predictor.predict_torch(
   img_annotated_mask = show_mask(masks[0][0].cpu(),
       annotate(image_source=src, bins=bins, logits=logits, phrases=phrases)[...,::-1]
   return pipe(immediate=immediate,
       picture=Picture.fromarray(src).resize((512, 512)),
      mask_image=Picture.fromarray(masks[0][0].cpu().numpy()).resize((512, 512))



Please enter your comment!
Please enter your name here