Saturday, April 13, 2024
HomeComputer VisionThe way to Analyze and Classify Video with CLIP

The way to Analyze and Classify Video with CLIP

CLIP, a pc imaginative and prescient mannequin by OpenAI, could be utilized to resolve a variety of video evaluation and classification issues. Think about a state of affairs the place you wish to archive and allow search on a group of commercials. You can use CLIP to categorise movies into numerous classes (i.e. commercials that includes soccer, the seaside, and many others.). You may then use these classes to construct a media search engine for commercials.

On this information, we’re going to present the way to analyze and classify video with CLIP. We’ll take a video with 5 scenes that’s featured on the Roboflow homepage. We’ll use CLIP to reply three questions in regards to the video:

  1. Does the video comprise a development scene?
  2. If the video accommodates a development scene, when does that scene start?
  3. How lengthy do development scenes final?

Right here is the video with which we shall be working:

The method we use on this article could possibly be used to resolve different media analytics and evaluation issues, resembling:

  1. Which of a sequence of classes finest describes a video?
  2. Does a video comprise a restricted merchandise (i.e. alcohol)?
  3. At what timestamps do particular scenes happen?
  4. How lengthy is an merchandise on display?

With out additional ado, let’s get began!

The way to Classify Video with CLIP

To reply the questions we had earlier – does a video comprise a development scene, and when does that scene start – we are going to observe these steps:

  1. Set up the required dependencies
  2. Cut up up a video into frames
  3. Run CLIP to categorize a restricted set of frames

Step #1: Set up Required Dependencies

We’re going to use CLIP with the Roboflow Inference Server. The Inference Server gives an online API by way of which you’ll question Roboflow fashions in addition to basis fashions resembling CLIP. We’ll use the hosted Inference Server, so we need not set up it.

We have to set up the Roboflow Python package deal and supervision, which we are going to use for working inference and dealing with video, respectively:

pip set up roboflow supervision

Now we now have the required dependencies put in, we will begin classifying our video.

Step #2: Write Code to Use CLIP

To start out our script to investigate and classify video, we have to import dependencies and set a couple of variables that we are going to use all through our script;

Create a brand new Python file and add the next code:

import requests
import base64
from PIL import Picture
from io import BytesIO
import os

VIDEO = "./"

prompts = [
    "construction site",
    "something else"

ACTIVE_PROMPT = "development web site"

Exchange the next values above as required:

  • API_KEY: Your Roboflow API key. Discover ways to retrieve your Roboflow API key.
  • VIDEO: The title of the video to investigate and classify.
  • prompts: An inventory of classes into which every video body must be labeled.
  • ACTIVE_PROMPT: The immediate for which you wish to compute analytics. We use this earlier to report whether or not a video accommodates the energetic immediate, and when the scene that includes the energetic immediate first begins.

On this instance, we’re looking for scenes that comprise a development web site. We now have supplied two prompts: “development web site” and “one thing else”.

Subsequent, we have to outline a perform that may run inference on every body in our video:

def classify_image(picture: str) -> dict:
    image_data = Picture.fromarray(picture)

    buffer = BytesIO(), format="JPEG")
    image_data = base64.b64encode(buffer.getvalue()).decode("utf-8")

    payload = {
        "api_key": API_KEY,
        "topic": {
            "sort": "base64",
            "worth": image_data
        "immediate": prompts,

    knowledge = requests.publish(INFERENCE_ENDPOINT + "/clip/evaluate?api_key=" + API_KEY, json=payload)

    response = knowledge.json()

    highest_prediction = 0
    highest_prediction_index = 0

    for i, prediction in enumerate(response["similarity"]):
        if prediction > highest_prediction:
            highest_prediction = prediction
            highest_prediction_index = i

    return prompts[highest_prediction_index]

This perform will take a video body, run inference utilizing CLIP and the Roboflow Inference Server, then return a classification for that body utilizing the prompts we set earlier.

Lastly, we have to name this perform on frames in our video. To take action, we are going to use supervision to separate up our video into frames. We’ll then run CLIP on every body:

outcomes = []

for i, body in enumerate(sv.get_video_frames_generator(source_path=VIDEO, stride=10)):
    print("Body", i)
    label = classify_image(body)

video_length = 10 * len(outcomes)

video_length = video_length / 24

print(f"Does this video comprise a {ACTIVE_PROMPT}?", "sure" if ACTIVE_PROMPT in outcomes else "no")

if ACTIVE_PROMPT in outcomes:
    print(f"When does the {ACTIVE_PROMPT} first seem?", spherical(outcomes.index(ACTIVE_PROMPT) * 10 / 24, 0), "seconds")

print(f"For a way lengthy is the {ACTIVE_PROMPT} seen?", spherical(outcomes.rely(ACTIVE_PROMPT) * 10 / 24, 0), "seconds")

This code units a stride worth of 10. Because of this a body shall be collected to be used in classification each 10 frames within the video. For sooner outcomes, set a better stride worth. For exact outcomes, set the stride to a decrease worth. A stride worth of 10 means ~2 frames are collected per second (given a 24 FPS video).

After the code above has run CLIP on the video, the code then finds:

  1. Whether or not the video accommodates a development web site;
  2. When the development scene begins, and;
  3. How lengthy the development scene lasts.

Let’s run our code and see what occurs:

Does this video comprise a development web site? sure
When does the development web site first seem? 7 seconds
For a way lengthy is the development web site seen? 6 seconds

Our code has efficiently recognized that our video accommodates a development scene, has recognized a time at which the scene begins, and the length of the scene. CLIP did, nonetheless, embody the shipyard scene as development.

That is why the “development web site seen” metric is six seconds as a substitute of the ~3 seconds for which the precise development web site is seen. CLIP is probably going decoding the transferring heavy autos and the overall atmosphere of the shipyard as development, though no development is occurring.

CLIP isn’t good: the mannequin could not decide up on what is apparent to people. If CLIP would not carry out nicely on your use case, it’s value exploring the way to create a purpose-built classification mannequin on your mission. You need to use Roboflow to practice customized classification fashions.

Observe: The timestamps returned aren’t totally exact as a result of we now have set a stride worth of 10. For extra exact timestamps, set a decrease stride worth. Decrease stride values will run inference on extra frames, so inference will take longer.


CLIP is a flexible device for which there are a lot of purposes in video evaluation and classification. On this information, we confirmed the way to use the Roboflow Inference Server to categorise video with CLIP. We used CLIP to search out whether or not a video accommodates a specific scene, when that scene begins, and what number of the video accommodates that scene.

If you’ll want to determine particular objects in a picture – firm logos, particular merchandise, defects – you will want to make use of an object detection mannequin as a substitute of CLIP. We’re making ready a information on this matter that we are going to launch within the coming weeks.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments