Computer Vision for Video Artificial Intelligence

By David Jeyes

This is the follow-up to The Foundations of Video Artificial Intelligence; we recommend you read the concepts there first.

AI will disrupt many processes, markets, and even entire job roles. We are most excited about the many ways video content will be enabled in the enterprise. Video packs much more information than the images deep learning models have thus far thrived with.

Video also requires much more resources to store, process, and manage at enterprise scale. This post will focus on how Computer Vision algorithms will maximize the value generated from video. We expect this to be achieved with faster analysis, production, and even artistic direction.

Object Recognition

After a machine learning algorithm has digested a video frame, the Object Recognition process identifies the various subjects within it. Object Recognition for an AI is a collection of related tasks and not the single step human vision perceives it as. The key elements of Object Recognition include image classification, object localization, and finally object detection.

The image above shows how a complex scene is digested using the rapid iterative process detailed in the chart. Object Recognition is a key foundation of Computer Vision algorithms and lays the groundwork for more advanced features like facial recognition.

Automated Content Governance

Many enterprises host large amounts of video and it is critical to ensure that content is both brand appropriate and legal. As noted previously, video data is more complex versus other forms of media, and ensuring every frame is adhering to content guidelines and compliance rules is a tedious task for humans to do. Content Detection algorithms can now flag noncompliant videos to corporate standards, and can be adapted to governance needs across roles and industries.

Action Detection

Video Thumbnail
Action Detection Demo

One key advantage of video content is the ability to show instead of tell a story. Computer Vision advances are enabling AIs to decode what is being done and not just who is in it.

Combining Object Recognition with Action Detection will allow the analysis or prediction of why an object is committing an action. The algorithm once more needs extensive training to recognize an action, and this action will need to be visually detectable. The ability to guess an off-screen action has occurred still eludes AI observers.

Emotion Detection

One of the biggest challenges to digital messages is that as little as 7% of communication is expressed verbally by humans. This is what makes video such an attractive medium for businesses. This is also why when we were discussing text analysis by AI, we were limited to sentiment analysis. Beyond contextual clues and verbal tone, most emotional decoding is sourced from the speaker’s face and body language.

The visible signs of emotion can be subtle or faked. This makes video an ideal medium for flagging and decoding these elusive data points. AI has already gained competency at labeling emotions being displayed, and is now developing accuracy for predicting the context. Unlocking emotional response data from footage of a live crowd response would offer businesses new and invaluable engagement analytics. 

Researchers at Ohio State have developed AI more accurate than human observers at identifying emotions on images. The computer vision algorithm was taught to measure the vascular response of faces via color analysis. This was superior to the humans in that this technique will defeat attempts to mask or mislead with facial expressions.

Scene Detection & Automated Editing

Video editing is currently a highly complex, tedious, and thus expensive task. We are rapidly approaching an era where this long established fact is no longer true. An algorithm can easily detect when a scene transition because most of the content on the visual changes.

This enables training video editing algorithms with human edited videos to help them replicate what makes a compellingly edited scene. Current generation technology can be great for roughly recapping a large event, but true journalism still takes a human mind.

Facial Recognition & Replacement

Video Thumbnail
Face Replacement Using A Single Source Image

Facial Recognition software has been popularized in crime TV shows as a means of scouring for suspects in video footage or social media. We now see this technology operating on our mobile phones, with morphing social media filters and animated emojis in real time.

Recognizing faces adds more complexity upon Object Recognition, with a common limitation for reference images or videos. However, our first video here illustrates how much current algorithms can accomplish with modest reference data.

A startling application of this has been the rise of “deepfakes.” A deepfake is when a Deep Learning model doctors video footage in a realistic manner. This could be overlays of target faces (and sometimes voices) over another person. These techniques have been migrating from Hollywood studios and into home offices.

Deepfakes are being created that show prominent people doing and saying things without that subject’s consent. As deepfakes continue to improve, we will grow reliant on AI programs to confirm real world media content.  

Video Thumbnail
PBS- Deepfakes are getting Real

Video AI Diagnostics

Artificial Intelligence will be used for functions beyond just interpreting or better indexing video data in the near future. Videos will be flagged if content is duplicative or pirated. Video files will be upscaled or repaired if they are of subpar quality or contain errors. Humans will focus more on the creation of new content, while technical and admin tasks get streamlined within video platforms themselves.


A key tool for every content creator is the ability to assign and preserve ownership rights by using unique digital fingerprint. This digital fingerprint is then used to signal if someone has pirated or sampled content without consent. This already being applied with Digital Rights Management (DRM) software, where premium content is protected from such piracy or misuse. Fingerprinting techniques can be adapted for content management, such as removing duplicate files in video storage or minimizing archived files.

Video Quality Analysis

When distributing content across a complex network important data can be lost or corrupted during transit. It has also proven hard to ensure high quality streaming video across networks and devices until after it has started. However, Netflix partnered with academics to better predict video quality using what they call a Video Multimethod Assessment Fusion (VMAF).

This metric is being used to maximize the quality of video streams across user environments. This is done by testing various combinations of video codecs, encoders, settings, and transmission variants. Automating this highly technical and tedious work ensures that video quality optimizes across a wide array of environments.

Creative AI

As we have seen, Artificial Intelligence currently provides a broad range of tools for interpreting, managing, and enhancing video. A key obstacle of AI is their struggle to generate new content beyond just mimicking existing work. This is similar to how trained parrots can “talk,” but are generally poor at engaging in a coherent conversation.

The more intuition required for a task, the harder it is for a computer to learn how to do currently. However, the following video tasks are likely to find themselves disrupted by AI algorithms in the not-too-distant future.

Narration and Storytelling

Literary Components of a Story

The first disruptive trend we will see from combining Video and AI will be automated narration and storytelling. Current generation technology can identify objects, people and words within in a video. However, AIs cannot yet weave these components into an originally coherent story without significant human assistance. This ability is just starting to develop as traditional media outlets begin to outsource certain journalism tasks to machines.

Stories such as recaps of sports events or an earnings release are already being automated. For these, tabled data is instantly being translated into natural language insights for readers. For business videos, these complex insights will need to be cohesive and draw on information outside of spreadsheets.

The science of storytelling has been studied thoroughly, and an ample supply of public stories exist to train models with. It’s realistic to believe that the obstacles preventing a quality AI storyteller will soon be overcome.

Generating New Content

A major milestone in creativity will be Generative AI, which will combine and improve upon various skills detailed above. Generative AI will be programs capable of creating (cohesive) video content based on an input set of visuals, text, and audio. Currently these algorithms can only piece together concepts, follow templates, or fill-in small gaps left in existing videos. The results from current generation AIs range from comical to creepy, as shown by our Friends “script.”

AI with true creative skill will enable high quality video content by diminishing restrictions around budget and talent for producing it. Our last video demo shows realistic models being generated from scratch. Enabling computers to both write and act compelling stories will unlock video created without cameras, microphones, or even a human editor.

AI Friends Scene Generator – Screenwriters are still safe for now
Video Thumbnail
AI Generated Models

DataGrid’s Deep Learning Model Creating Realistic Models from Scratch