Transcribing Audio from Video Files Using Python

Simulanics AI Labs Blog

Introduction

Transcription is the process of converting spoken language into written text. While it might seem straightforward, the applications and importance of this technology are far-reaching and significant in today’s data-driven world. Transcribing audio can be invaluable in fields such as journalism, legal services, customer support, and academic research, among many others.

Why Is Transcription Important?

Accessibility: Transcripts make content more accessible to people with hearing impairments.
Data Analysis: Transcribed text can be easier to analyze, sort, and search through compared to audio or video formats.
Content Discovery: It enhances SEO capabilities, making it easier for people to find your content.
Multilingual Services: Transcripts can be translated into multiple languages more readily than audio.
Learning and Development: It can be used in educational settings for better retention and understanding of the content.

The Code

To begin, we need to install the required Python libraries. Run the following command:

pip install pydub speechrecognition moviepy

Code Breakdown

Here is a breakdown of the code, section by section:

Importing Libraries

import argparse

import speech_recognition as sr

import os

from pydub import AudioSegment

from pydub.silence import split_on_silence

import moviepy.editor as mp

This section imports all the required libraries. argparse for command line arguments, speech_recognition for converting audio to text, pydub for audio manipulation, and moviepy for video editing.

Function: `video_to_audio`

def video_to_audio(in_path):

“””Convert video file to audio file”””

This function takes a video file path as input and converts it into an audio file (WAV format).

Function: `large_audio_to_text`

def large_audio_to_text(path):

“””Split audio into chunks and apply speech recognition”””

Here, the audio is divided into chunks, making it easier for the speech recognition engine to process it. Each chunk is transcribed to text.

Main Execution

# Create a speech recognition object

r = sr.Recognizer()

# Video to audio to text

audio_path = video_to_audio(args.in_video)

result = large_audio_to_text(audio_path)

In the main part of the script, we use argparse to get the video file from the command line. It’s then passed through the functions to finally get the transcribed text.

How to Use

To execute this script, use the following command:

python transcription.py video_name.mp4

Full Script Placeholder

#Usage: python transcription.py video_name.mp4

import argparse

import speech_recognition as sr

import os

from pydub import AudioSegment

from pydub.silence import split_on_silence

import moviepy.editor as mp

def video_to_audio(in_path):

“””Convert video file to audio file”””

# Extract base name and change extension to .wav

base_name = os.path.basename(in_path)

file_name, _ = os.path.splitext(base_name)

out_path = f“{file_name}.wav”

video = mp.VideoFileClip(in_path)

video.audio.write_audiofile(out_path)

return out_path

def large_audio_to_text(path):

“””Split audio into chunks and apply speech recognition”””

# Open audio file with pydub

sound = AudioSegment.from_wav(path)

# Split audio where silence is 700ms or greater and get chunks

chunks = split_on_silence(sound, min_silence_len=700, silence_thresh=sound.dBFS-14, keep_silence=700)

# Create folder to store audio chunks

folder_name = “audio-chunks”

if not os.path.isdir(folder_name):

os.mkdir(folder_name)

whole_text = “”

# Process each chunk

for i, audio_chunk in enumerate(chunks, start=1):

# Export chunk and save in folder

chunk_filename = os.path.join(folder_name, f“chunk{i}.wav”)

audio_chunk.export(chunk_filename, format=“wav”)

# Recognize chunk

with sr.AudioFile(chunk_filename) as source:

audio_listened = r.record(source)

# Convert to text

try:

text = r.recognize_google(audio_listened)

except sr.UnknownValueError as e:

print(“Error:”, str(e))

else:

text = f“{text.capitalize()}. “

print(chunk_filename, “:”, text)

whole_text += text

# Return text for all chunks

return whole_text

# Create a speech recognition object

r = sr.Recognizer()

# Use argparse for command line arguments

parser = argparse.ArgumentParser(description=‘Convert video speech to text.’)

parser.add_argument(‘in_video’, type=str, help=‘Input video file path’)

args = parser.parse_args()

# Video to audio to text

audio_path = video_to_audio(args.in_video)

result = large_audio_to_text(audio_path)

# Print to shell and file

print(result)

print(result, file=open(‘result.txt’, ‘w’))

If you have any questions or need further assistance, feel free to drop me a message.

Happy coding!

Transcribing Audio from Video Files Using Python

Simulanics AI Labs Blog