Simulanics AI Labs Blog

Transcribing Audio from Video Files Using Python

Introduction

Transcription is the process of converting spoken language into written text. While it might seem straightforward, the applications and importance of this technology are far-reaching and significant in today’s data-driven world. Transcribing audio can be invaluable in fields such as journalism, legal services, customer support, and academic research, among many others.

Why Is Transcription Important?

  1. Accessibility: Transcripts make content more accessible to people with hearing impairments.
  2. Data Analysis: Transcribed text can be easier to analyze, sort, and search through compared to audio or video formats.
  3. Content Discovery: It enhances SEO capabilities, making it easier for people to find your content.
  4. Multilingual Services: Transcripts can be translated into multiple languages more readily than audio.
  5. Learning and Development: It can be used in educational settings for better retention and understanding of the content.

The Code

To begin, we need to install the required Python libraries. Run the following command:

pip install pydub speechrecognition moviepy



Code Breakdown

Here is a breakdown of the code, section by section:

Importing Libraries

import argparse

import speech_recognition as sr

import os

from pydub import AudioSegment

from pydub.silence import split_on_silence

import moviepy.editor as mp



This section imports all the required libraries. argparse for command line arguments, speech_recognition for converting audio to text, pydub for audio manipulation, and moviepy for video editing.

Function: video_to_audio

def video_to_audio(in_path):

“””Convert video file to audio file”””



This function takes a video file path as input and converts it into an audio file (WAV format).

Function: large_audio_to_text

def large_audio_to_text(path):

“””Split audio into chunks and apply speech recognition”””



Here, the audio is divided into chunks, making it easier for the speech recognition engine to process it. Each chunk is transcribed to text.

Main Execution

# Create a speech recognition object

r = sr.Recognizer()

# Video to audio to text

audio_path = video_to_audio(args.in_video)

result = large_audio_to_text(audio_path)



In the main part of the script, we use argparse to get the video file from the command line. It’s then passed through the functions to finally get the transcribed text.

How to Use

To execute this script, use the following command:

python transcription.py video_name.mp4



Full Script Placeholder

#Usage: python transcription.py video_name.mp4

 

import argparse
import speech_recognition as sr
import os
from pydub import AudioSegment
from pydub.silence import split_on_silence
import moviepy.editor as mp
def video_to_audio(in_path):
    “””Convert video file to audio file”””
   
    # Extract base name and change extension to .wav
    base_name = os.path.basename(in_path)
    file_name, _ = os.path.splitext(base_name)
    out_path = f{file_name}.wav”
   
    video = mp.VideoFileClip(in_path)
    video.audio.write_audiofile(out_path)
   
    return out_path
def large_audio_to_text(path):
    “””Split audio into chunks and apply speech recognition”””
   
    # Open audio file with pydub
    sound = AudioSegment.from_wav(path)

 

    # Split audio where silence is 700ms or greater and get chunks
    chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700)
   
    # Create folder to store audio chunks
    folder_name = “audio-chunks”
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
   
    whole_text = “”
    # Process each chunk
    for i, audio_chunk in enumerate(chunks, start=1):
        # Export chunk and save in folder
        chunk_filename = os.path.join(folder_name, f“chunk{i}.wav”)
        audio_chunk.export(chunk_filename, format=“wav”)

 

        # Recognize chunk
        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)
            # Convert to text
            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print(“Error:”, str(e))
            else:
                text = f{text.capitalize()}. “
                print(chunk_filename, “:”, text)
                whole_text += text

 

    # Return text for all chunks
    return whole_text
# Create a speech recognition object
r = sr.Recognizer()

 

# Use argparse for command line arguments
parser = argparse.ArgumentParser(description=‘Convert video speech to text.’)
parser.add_argument(‘in_video’, type=str, help=‘Input video file path’)

 

args = parser.parse_args()

 

# Video to audio to text
audio_path = video_to_audio(args.in_video)
result = large_audio_to_text(audio_path)

 

# Print to shell and file
print(result)
print(result, file=open(‘result.txt’, ‘w’))

If you have any questions or need further assistance, feel free to drop me a message.

Happy coding!