Home Computer Vision Methods to Construct a Picture Reminiscences App with CLIP

Methods to Construct a Picture Reminiscences App with CLIP

Methods to Construct a Picture Reminiscences App with CLIP


How to Build a Photo Memories App with CLIP

Many common pictures purposes have options that collate pictures into slideshows, typically known as “reminiscences.” These slideshows are centered round a theme similar to a specific location, particular person, or an idea widespread throughout your pictures.

Utilizing CLIP, a picture mannequin developed by OpenAI, we will construct a photograph reminiscences app that teams pictures in line with a specified theme. We are able to then collate the pictures retrieved by CLIP right into a video you can share with family and friends.

Right here is the reminiscences slideshow we make in the course of the tutorial:



With out additional ado, let’s get began!

Methods to Construct a Picture Reminiscences App with CLIP

To construct our photograph reminiscences app, we’ll:

  1. Set up the required dependencies
  2. Use CLIP to calculate embeddings for every picture in a folder
  3. Use CLIP to seek out associated photos given a textual content question (i.e. “folks” or “metropolis”)
  4. Write logic to show associated photos right into a video
  5. Save the slideshow now we have generated

You could be questioning: “what are embeddings?” Embeddings are numeric representations of photos, textual content, and different information you can examine. Embeddings are the important thing to this challenge: we will examine textual content and picture embeddings to seek out photos associated to the themes for which we need to make reminiscences.

Step #1: Set up Required Dependencies

Earlier than we will begin constructing our app, we have to set up just a few dependencies. Run the next command to put in the Python packages we’ll use in our utility:

pip set up faiss-cpu opencv-python Pillow

(In case you are engaged on a pc with a CUDA-enabled GPU, set up faiss-gpu as a substitute of faiss-gpu)

With the required dependencies put in, we are actually prepared to begin constructing our reminiscences app.

Begin by importing the required dependencies for the challenge:

import base64
import os
from io import BytesIO
import cv2
import faiss
import numpy as np
import requests
from PIL import Picture
import json

We’re going to use the Roboflow Inference Server for retrieving CLIP embeddings. You possibly can host the Inference Server your self, however for this information we’ll use the hosted model of the server.

Add the next fixed variables to your Python script, which we’ll use in a while to question the inference server.

INFERENCE_ENDPOINT = "https://infer.roboflow.com"

Change the `API_KEY` worth together with your Roboflow API key. Be taught how one can discover your Roboflow API key.

Now, let’s begin engaged on the logic for our utility.

Step #2: Calculate Picture Embeddings

Our utility goes to take a folder of photos and a textual content enter. We are going to then return a slideshow that accommodates photos associated to the textual content enter. For this, we have to calculate two forms of embeddings:

  1. Picture embeddings for every picture, and;
  2. A textual content embedding for the theme for a slideshow.

Let’s outline a perform that calls the Roboflow Inference Server and calculates a picture embedding:

def get_image_embedding(picture: str) -> dict:
    picture = picture.convert("RGB")

    buffer = BytesIO()
    picture.save(buffer, format="JPEG")
    picture = base64.b64encode(buffer.getvalue()).decode("utf-8")

    payload = {
        "physique": API_KEY,
        "picture": {"kind": "base64", "worth": picture},

    information = requests.put up(
        INFERENCE_ENDPOINT + "/clip/embed_image?api_key=" + API_KEY, json=payload

    response = information.json()

    embedding = response["embeddings"]

    return embedding

Subsequent, let’s outline one other perform that retrieves a textual content embedding for a question:

def get_text(immediate):
    text_prompt = requests.put up(
        f"{INFERENCE_ENDPOINT}/clip/embed_text?api_key={API_KEY}", json={"textual content": immediate}

    return np.array(text_prompt)

Step #3: Create an Index

The 2 features we wrote within the earlier step each return embeddings. However we haven’t written the logic to make use of them but! Subsequent, we have to calculate picture embeddings for a folder of photos. We are able to do that utilizing the next code:

index = faiss.IndexFlatL2(512)
image_frames = []

for body in os.listdir("./photos"):
    body = Picture.open("./photos/" + body)

    embedding = get_image_embedding(body)



with open("image_frames.json", "w+"):

faiss.write_index(index, "index.bin")

This code creates an “index”. This index will retailer all of our embeddings. We are able to effectively search this index utilizing textual content embeddings to seek out photos for our slideshow.

On the finish of this code, we save the index to a file for later use. We additionally save all the picture body file names to a file. That is vital as a result of the index doesn’t retailer these, and we have to know with what file every body within the index is related so we will make our slideshow.

Step #4: Retrieve Photos for the Slideshow

Subsequent, we have to retrieve photos for our slideshow. We are able to do that with a single line of code:

question = get_text("san francisco")
D, I = index.search(question, 3)

Within the first line of code, we name the get_text() perform we outlined earlier to retrieve a textual content embedding for a question. On this instance, our question is “san francisco”. Then, we search our picture index for photos whose embeddings are much like our textual content embedding.

This code will return photos ordered by their relevance to the question. For those who don’t have any photos related to the question, outcomes will nonetheless be returned, though they won’t be helpful in making a thematic slideshow. Thus, be sure you seek for themes are featured in your photos.

The three worth states we wish the highest three photos related to our textual content question. You possibly can improve or lower this quantity to retrieve extra or fewer photos to your slideshow.

Step #5: Discover Most Picture Width and Peak

There may be yet another step we have to full earlier than we will begin creating slideshows: we have to discover the biggest picture width and top values within the photos we’ll use to create every slideshow. It is because we have to know at what decision we should always save our video.

To seek out the utmost width and top values within the frames now we have gathered, we will use the next code:

video_frames = []
largest_width = 0
largest_height = 0

for i in I[0]:
    body = image_frames[i]
    cv2_frame = np.array(body)
    cv2_frame = cv2.cvtColor(cv2_frame, cv2.COLOR_BGR2RGB)

    video_frames.lengthen([cv2_frame] * 20)

    top, width, _ = cv2_frame.form

    if width > largest_width:
        largest_width = width

    if top > largest_height:
        largest_height = top

Step #6: Generate the Slideshow

We’re onto the ultimate step: create the slideshow. All the items are in place. We’ve got discovered photos associated to a textual content question, and calculated the decision we’ll use for our slideshow. The ultimate step is to create a video that makes use of the pictures.

We are able to create our slideshow utilizing the next code:

final_frames = []

for i, body in enumerate(video_frames):
    if body.form[0] < largest_height:
        distinction = largest_height - body.form[0]
        padding = distinction // 2

        body = cv2.copyMakeBorder(
            worth=(0, 0, 0),
    if body.form[1] < largest_width:
        distinction = largest_width - body.form[1]
        padding = distinction // 2

        body = cv2.copyMakeBorder(
            worth=(0, 0, 0),

video = cv2.VideoWriter(
    "video1.avi", cv2.VideoWriter_fourcc(*"MJPG"), 20, (largest_width, largest_height)

for body in final_frames:


This code creates a giant listing of all the frames we need to embody in our picture. These frames are padded with black pixels in line with the utmost top and width we recognized earlier. This ensures photos are usually not stretched to suit precisely the identical decision as the biggest picture. We then add all of those frames to a video and save the outcomes to a file known as video.avi.

Let’s run our code on a folder of photos. For this information, now we have run the reminiscences app on a collection of metropolis pictures. Here’s what our video appears like:



We’ve got efficiently generated a video with photos associated to “san francisco”.


CLIP is a flexible instrument with many makes use of in pc imaginative and prescient. On this information, now we have demonstrated how one can construct a photograph reminiscences app with CLIP. We used CLIP to calculate picture embeddings for all photos in a folder. We then saved these embeddings in an index.

Subsequent, we used CLIP to calculate a textual content embedding that we used to seek out photos associated to a textual content question. In our instance, this question was “san francisco”. Lastly, we accomplished some post-processing to make sure photos had been all the identical dimension, and compiled photos associated to our question right into a slideshow.



Please enter your comment!
Please enter your name here