Video Speech Generator from YouTube with Artificial Intelligence

13 minute read

Hello everyone, today we are going to build an interesting application that creates video messages from Youtube Videos.

A simple application that replaces the original speech of the video from your input text. The program tries to clone the original voice and replace it with your new speech and the libs of the mouth are synchronized with Machine Learning.

For example, by taking one Youtube Video, like King Charles III like this:

we can transform it into a new video, where you say the following:

I am clonning your voice. Charles!. Machine intelligence is the last invention that humanity will ever need to make.

with this program you got the following results:

The application that we will create will be on hosted on Amazon Web Services (AWS) , in ordering to to perform the calculations

We are going to use SageMaker Notebook to create this application.

There are two versions of this app.

  • The notebook version (.ipynb ) that is the extended version that we are going to see first,
  • The python version (.py) that is the simplified version.

We are going to discuss the extended version to understand how things were done.

The web app that we want to to create is like this:


where you can make Albert Einstein speak about machine learning

or even Mark Zuckerberg.

Step 1 - Cloning the repository

You should login to AWS account and create a SageMaker Notebook, as was explained in the previous blog: How-to-run-WebApp-on-SageMaker-Notebook. For this project I have used ml.g4dn.4xlarge instances and we open a new terminal File>New> Terminal.


Then in the terminal, we clone the repository, and type the following

git clone

then enter to the directory

cd Video-Speech-Generator-from-Youtube

Step 2 - Environment setup

First, we need to install the environment to this application, which will be VideoMessage, which will be executed on python=3.7.13 and it is required ffmpeg and git-lfs.

To do this task. open a terminal and type:

sh-4.2$ sh

after it is done, you will get something like


You can open the Video-Message-Generator.ipynb notebook and choose the kernel VideoMessage.

Step 3 - Definition of variables.

In order to begin the construction of the application, we require to manage well the environments of this Cloud Service. We can import the module sys and determine which environment we are using.

import sys ,os

However, to install and import modules to our environment from the terminal we should be careful because Sagemaker runs on its own container, and we have to load properly the environment that we have created.

Sagemaker = True
if Sagemaker :
    env='source activate python3 && conda activate VideoMessage &&'

Step 4. Setup of the dependencies

We are going to install the dependencies of our custom program.

We will use the model Wav2Lip to synchronize the speech to the video.

Due to we are going to install it for the first time, we define True otherwise False

is_first_time = True
#Install dependency
# Download pretrained model
# Import the os module
import os
# Get the current working directory
parent_dir = os.getcwd()
if is_first_time:
    # Directory 
    directory = "sample_data"
    # Path 
    path = os.path.join(parent_dir, directory) 
        print("Directory '% s' created" % directory) 
    except Exception:
         print("Directory '% s'was already created" % directory)
    os.system('git clone')
    os.system('cd Wav2Lip &&{} pip install  -r requirements.txt'.format(env))

After the cell was executed, several modules will be installed:

Looking in indexes:,
Collecting librosa==0.7.0
  Downloading librosa-0.7.0.tar.gz (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 24.4 MB/s eta 0:00:00
  Preparing metadata ( started
  Preparing metadata ( finished with status 'done'

Successfully built librosa audioread
Installing collected packages: llvmlite, tqdm, threadpoolctl, pycparser, pillow, numpy, joblib, audioread, torch, scipy, opencv-python, opencv-contrib-python, numba, cffi, torchvision, soundfile, scikit-learn, resampy, librosa
Successfully installed audioread-3.0.0 cffi-1.15.1 joblib-1.1.0 librosa-0.7.0 llvmlite-0.31.0 numba-0.48.0 numpy-1.17.1 opencv-contrib-python- opencv-python- pillow-9.2.0 pycparser-2.21 resampy-0.3.1 scikit-learn-1.0.2 scipy-1.7.3 soundfile-0.10.3.post1 threadpoolctl-3.1.0 torch-1.1.0 torchvision-0.3.0 tqdm-4.45.0

Then we require to download some pre-trained models that use Wav2Lip.

from utils.default_models import ensure_default_models
from pathlib import Path
## Load the models one by one.
print("Preparing the models of Wav2Lip")
Preparing the models of Wav2Lip

After this, we require another program that will synthesize the voice from your text. We are going to use the Coqui-TTS that is needed for the generation of voice.

if is_first_time:
    os.system('git clone -b multilingual-torchaudio-SE TTS')
    os.system('{} pip install -q -e TTS/'.format(env))
    os.system('{} pip install -q torchaudio==0.9.0'.format(env))
    os.system('{} pip install -q youtube-dl'.format(env))
    os.system('{} pip install ffmpeg-python'.format(env))
    os.system('{} pip install gradio==3.0.4'.format(env))
    os.system('{} pip install pytube==12.1.0'.format(env))

then we load the repositories

#this code for recording audio
from IPython.display import HTML, Audio
from base64 import b64decode
import numpy as np
from import read as wav_read
import io
import ffmpeg
from pytube import YouTube
import random
from subprocess import call
import os
from IPython.display import HTML
from base64 import b64encode
from IPython.display import clear_output
from datetime import datetime

If the previous modules are loaded, then we have completed the first part of the setup of dependencies.

Step 5. Definition de modules used in this program

To manage all the information that contains our web application we require to create some helper functions,

def showVideo(path):
  mp4 = open(str(path),'rb').read()
  data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
  return HTML("""
  <video width=700 controls>
        <source src="%s" type="video/mp4">
  """ % data_url)
def time_between(t1, t2):
    FMT = '%H:%M:%S'
    t1 = datetime.strptime(t1, FMT)
    t2 = datetime.strptime(t2, FMT)
    delta = t2 - t1
    return str(delta)

In order to check that works

time_between("00:00:01","00:00:10" )

the result is ok.


The next step is to create the module to download videos from YouTube, I have defined two versions, one given by pytube and the other youtube-dl

def download_video(url):
    local_file = (
        .streams.filter(progressive=True, file_extension="mp4")
        .download(filename="youtube{}.mp4".format(random.randint(0, 10000)))
    return local_file
def download_youtube(url):
    #Select a Youtube Video
    #find youtube video id
    from urllib import parse as urlparse
    url_data = urlparse.urlparse(url)
    query = urlparse.parse_qs(url_data.query)
    YOUTUBE_ID = query["v"][0]
    url_download ="{}".format(YOUTUBE_ID)
    # download the youtube with the given ID
    os.system("{} youtube-dl -f  mp4 --output youtube.mp4 '{}'".format(env,url_download))

the difference is that youtube-dl takes too much time and can be used to download YouTube videos with higher quality but I prefer, for now, better performance. We can test both modules.

if is_first_time:
    URL = ''
    #URL = ''

Then we need to define some modules to clean our files.

def cleanup():
    import pathlib
    import glob
    types = ('*.mp4','*.mp3', '*.wav') # the tuple of file types
    #Finding mp4 and wave files
    junks = []
    for files in types:
        # Deleting those files
        for junk in junks:
            # Setting the path for the file to delete
            file = pathlib.Path(junk)
            # Calling the unlink method on the path
    except Exception:
        print("I cannot delete the file because it is being used by another process")
def clean_data():
    # importing all necessary libraries
    import sys, os
    # initial directory
    home_dir = os.getcwd()
    # some non existing directory
    fd = 'sample_data/'
    # Join various path components
    print("Path to clean:",path_to_clean)
    # trying to insert to false directory
        print("Inside to clean", os.getcwd())
    # Caching the exception
        print("Something wrong with specified\
        directory. Exception- ", sys.exc_info())
    # handling with finally
        print("Restoring the path")
        print("Current directory is-", os.getcwd())

The next step is to define a program that will trim the YouTube videos.

def youtube_trim(url,start,end):
    #cancel previous youtube
    #download youtube
    #download_youtube(url) # with youtube-dl (slow)
    # Get the current working directory
    parent_dir = os.getcwd()
    # Trim the video (start, end) seconds
    start =  start
    end =  end
    #Note: the trimmed video must have face on all frames
    interval = time_between(start, end)
    trimmed_video= parent_dir+'/sample_data/input_video.mp4'
    trimmed_audio= parent_dir+'/sample_data/input_audio.mp3'
    #delete trimmed if already exits
    # cut the video  
    call(["ffmpeg","-y","-i",input_videos,"-ss", start,"-t",interval,"-async","1",trimmed_video])
    # cut the audio
    call(["ffmpeg","-i",trimmed_video, "-q:a", "0", "-map","a",trimmed_audio])
    #Preview trimmed video
    print("Trimmed Video+Audio")
    return trimmed_video, trimmed_audio

Step 7- Simple check of pandas and NumPy versions

print("In our enviroment")
os.system('{} pip show pandas numpy'.format(env))
In our enviroment
Name: pandas
Version: 1.3.5
Summary: Powerful data structures for data analysis, time series, and statistics
Author: The Pandas Development Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /home/ec2-user/anaconda3/envs/VideoMessage/lib/python3.7/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: gradio, TTS
Name: numpy
Version: 1.19.5
Summary: NumPy is the fundamental package for array computing with Python.
Author: Travis E. Oliphant et al.
License: BSD
Location: /home/ec2-user/anaconda3/envs/VideoMessage/lib/python3.7/site-packages
Required-by: gradio, librosa, matplotlib, numba, opencv-contrib-python, opencv-python, pandas, resampy, scikit-learn, scipy, tensorboardX, torchvision, TTS, umap-learn

I need to select a custom version of OpenCV.

if is_first_time:
    os.system('{} pip install opencv-contrib-python-headless=='.format(env))

Step 8 - Libraries for voice recognition

We need to clone the voice from the YouTube clip, in order to reproduce the most real possible speech.

import sys
VOICE_PATH = "utils/"
# add libraries into environment
sys.path.append(VOICE_PATH) # set this if VOICE is not installed globally

Finally, we will import the last dependency with the following ::

from utils.voice import *

If was imported, then we have loaded all the essential modules required to run this program. The next step is the creation of the main program .

Step 9 - Video creation

In this part, we extract the trimmed audio and video, from the trimmed audio we extract the embeddings of the original sound, and decode and encode the sound by using our input text. Then from this new sound spectrum, we detect all the faces of the trimmed video clip, and then replaced it with a new mouth that is synchronized with the new sound spectrum. In a few words, we replace the original sound speech with the cloned voice in the new video.

def create_video(Text,Voicetoclone):
    clonned_audio = os.path.join(current_dir, out_audio)
    #Start Crunching and Preview Output
    #Note: Only change these, if you have to
    pad_top =  0#@param {type:"integer"}
    pad_bottom =  10#@param {type:"integer"}
    pad_left =  0#@param {type:"integer"}
    pad_right =  0#@param {type:"integer"}
    rescaleFactor =  1#@param {type:"integer"}
    nosmooth = False #@param {type:"boolean"}
    out_name ="result_voice.mp4"
    if nosmooth == False:
        is_command_ok = os.system('{} cd Wav2Lip && python --checkpoint_path checkpoints/wav2lip_gan.pth --face "../sample_data/input_video.mp4" --audio "../out/clonned_audio.wav" --outfile {} --pads {} {} {} {} --resize_factor {}'.format(env,out_file,pad_top ,pad_bottom ,pad_left ,pad_right ,rescaleFactor))
        is_command_ok = os.system('{} cd Wav2Lip && python --checkpoint_path checkpoints/wav2lip_gan.pth --face "../sample_data/input_video.mp4" --audio "../out/clonned_audio.wav" --outfile {} --pads {} {} {} {} --resize_factor {} --nosmooth'.format(env,out_file,pad_top ,pad_bottom ,pad_left ,pad_right ,rescaleFactor))

    if is_command_ok > 0:
        print("Error : Ensure the video contains a face in all the frames.")
        return  out_file
    print("Creation of Video done")
    return out_name

Step 10 - Test trimmed video

Now that we have finished writing all the pieces of the program, let’s choose one example video, from The King’s Speech: King Charles III, we trim the video,

URL = ''
#Testing first time
trimmed_video, trimmed_audio=youtube_trim(URL,"00:00:01","00:00:10")

Trimmed Video+Audio

size=     165kB time=00:00:09.01 bitrate= 150.2kbits/s speed=76.7x    
video:0kB audio:165kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.210846%

Step 11 - Add text + cloned voice + lib motion

From the trimmed video The King’s Speech: King Charles III we process everything together.

Text=' I am clonning your voice. Charles!. Machine intelligence is the last invention that humanity will ever need to make.'
#Testing first time
#Preview output video
print("Final Video Preview")
final_video= parent_dir+'/'+outvideo
print("Dowload this video from", final_video)

During its creation, it was created the synthesized sound

path url
 > text:  I am clonning your voice. Charles!. Machine intelligence is the last invention that humanity will ever need to make.
Generated Audio

in the following path:

 > Saving output to out/clonned_audio.wav

then it was created the video

Using cuda for inference.
Reading video frames...
Number of frames available for inference: 225
(80, 605)
Length of mel chunks: 186
Load checkpoint from: checkpoints/wav2lip_gan.pth
Model loaded
Creation of Video done
Final Video Preview
Dowload this video from /home/ec2-user/SageMaker/VideoMessageGen/result_voice.mp4

Step 12 - Creation of modules for the quality control

Here we create the modules needed for the web application.

def time_format_check(input1):
    timeformat = "%H:%M:%S"
        validtime = datetime.strptime(input1, timeformat)
        print("The time format is valid", input1)
        #Do your logic with validtime, which is a valid format
        return False
    except ValueError:
        print("The time {}  has not valid format hh:mm:ss".format(input1))
        return True

We need to count the time in seconds, so we define the following

def to_seconds(datetime_obj):
    from datetime import datetime
    time =datetime_obj
    date_time = datetime.strptime(time, "%H:%M:%S")
    a_timedelta = date_time - datetime(1900, 1, 1)
    seconds = a_timedelta.total_seconds()
    return seconds

and finally, we require to define the YouTube URL validator

def validate_youtube(url):
    #This creates a youtube objet
        yt = YouTube(url)  
    except Exception:
        print("Hi there URL seems invalid")
        return True, 0
    #This will return the length of the video in sec as an int
    video_length = yt.length
    if    video_length > 600:
        print("Your video is larger than 10 minutes")
        return True, video_length
        print("Your video is less than 10 minutes")
        return False, video_length

Step 13 - Creation of the main module

The following function give us the the final video.

def video_generator(text_to_say,url,initial_time,final_time):
    print('Checking the url',url)
    check1, video_length = validate_youtube(url)
    if check1 is True: return "./demo/tryagain2.mp4"
    check2 = validate_time(initial_time,final_time, video_length)
    if check2 is True: return "./demo/tryagain0.mp4"
    trimmed_video, trimmed_audio=youtube_trim(url