Python Script for Analyzing Media File Audio Languages

This Python script scans a specified directory for media files, analyzes their audio streams using ffprobe (part of FFmpeg), and generates a report categorizing files based on their audio language. It separates files with non-English audio, other language audio (or no audio), and provides a detailed report of all audio streams. The script is designed to run on Debian 12.
Key Features
  • Comprehensive Language Detection: Identifies English (‘eng’), specific non-English languages, and undefined (‘und’) audio streams.
  • Recursive Scanning: Processes all media files in the specified directory and its subdirectories.
  • Detailed Reporting: Provides both a summary of non-English/undefined files and a detailed breakdown of all audio streams.
  • Robust Error Handling: Skips problematic files and continues processing, with clear error messages.
  • Customizable: Media file extensions can be modified in the media_extensions set.
Script Summary:
  • Uses ffprobe to analyze media file streams
  • Supports common video formats (.mp4, .mkv, .avi, .mov, .wmv, .flv, .m4v)
  • Recursively scans all subdirectories
  • Creates a text file (no_english_audio.txt) with results
  • Handles errors gracefully
  • Prompts for directory path (defaults to current directory if Enter is pressed)
The output file will contain:
  • A list of file paths for media files without English audio streams
  • Or a message indicating all files have English audio if none are found without it
Install Dependencies
First, install the required dependencies:
sudo apt update
sudo apt install ffmpeg python3
Save the script to a file (e.g., check_audio.py)
Script:
#!/usr/bin/env python3

import os
import subprocess
import json
import sys
from pathlib import Path

def get_audio_streams(file_path):
    """
    Get detailed information about audio streams in a media file
    Returns list of dictionaries containing stream info
    """
    try:
        cmd = [
            'ffprobe',
            '-v', 'error',
            '-show_streams',
            '-print_format', 'json',
            str(file_path)
        ]
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        
        data = json.loads(result.stdout)
        audio_streams = []
        
        for stream in data.get('streams', []):
            if stream.get('codec_type') == 'audio':
                stream_info = {
                    'index': stream.get('index', 'unknown'),
                    'codec': stream.get('codec_name', 'unknown'),
                    'language': stream.get('tags', {}).get('language', 'und'),
                    'channels': stream.get('channels', 'unknown')
                }
                audio_streams.append(stream_info)
        
        return audio_streams
    
    except (subprocess.CalledProcessError, json.JSONDecodeError) as e:
        print(f"Error processing {file_path}: {e}")
        return []

def scan_directory(directory_path, output_file):
    """
    Scan directory for media files and analyze audio stream languages
    """
    media_extensions = {'.mp4', '.mkv', '.avi', '.mov', '.wmv', '.flv', '.m4v'}
    directory = Path(directory_path)
    
    if not directory.is_dir():
        print(f"Error: {directory_path} is not a valid directory")
        return
    
    # Lists for results
    no_english_files = []
    undefined_lang_files = []
    detailed_report = []
    
    # Scan directory
    for file_path in directory.rglob('*'):
        if file_path.is_file() and file_path.suffix.lower() in media_extensions:
            print(f"Analyzing: {file_path}")
            audio_streams = get_audio_streams(file_path)
            
            # Build detailed report entry
            file_entry = f"File: {file_path}\n"
            if audio_streams:
                file_entry += f"  Found {len(audio_streams)} audio stream(s):\n"
                has_english = False
                all_undefined = True
                
                for stream in audio_streams:
                    lang = stream['language'].lower()
                    file_entry += f"    Stream {stream['index']}: {lang} ({stream['codec']}, {stream['channels']} channels)\n"
                    if lang == 'eng':
                        has_english = True
                    if lang != 'und':
                        all_undefined = False
                
                # Categorize the file
                if not has_english:
                    if all_undefined:
                        undefined_lang_files.append(str(file_path))
                    else:
                        no_english_files.append(str(file_path))
            else:
                file_entry += "  No audio streams found\n"
                undefined_lang_files.append(str(file_path))  # Treat no audio as undefined
            
            detailed_report.append(file_entry)
    
    # Write results
    try:
        with open(output_file, 'w') as f:
            # Summary of files without English (specific non-English languages)
            f.write("=== Files With Non-English Audio (Excluding Undefined) ===\n")
            if no_english_files:
                f.write(f"Found {len(no_english_files)} file(s) with specific non-English audio:\n")
                f.write("\n".join(no_english_files))
                f.write("\n\n")
            else:
                f.write("No files found with specific non-English audio.\n\n")
            
            # Summary of files with undefined language
            f.write("=== Files With Undefined Language Audio (or No Audio) ===\n")
            if undefined_lang_files:
                f.write(f"Found {len(undefined_lang_files)} file(s) with undefined language audio:\n")
                f.write("\n".join(undefined_lang_files))
                f.write("\n\n")
            else:
                f.write("No files found with undefined language audio.\n\n")
            
            # Detailed report
            f.write("=== Detailed Audio Stream Report ===\n")
            f.write("\n".join(detailed_report))
        
        print(f"Results written to {output_file}")
    except IOError as e:
        print(f"Error writing to output file: {e}")

def main():
    if len(sys.argv) != 2:
        print("Usage: ./check_audio.py <directory_path>")
        print("Example: ./check_audio.py /path/to/media")
        sys.exit(1)
    
    directory_path = sys.argv[1]
    output_file = "audio_language_report.txt"
    
    try:
        subprocess.run(['ffprobe', '-version'], capture_output=True, check=True)
    except (subprocess.CalledProcessError, FileNotFoundError):
        print("Error: FFmpeg is not installed. Please install it using 'sudo apt install ffmpeg'")
        sys.exit(1)
    
    scan_directory(directory_path, output_file)

if __name__ == "__main__":
    main()
Make the script executable:
chmod +x check_audio.py
Run the script:
./check_audio.py /path/to/media
Example output in audio_language_report.txt:
=== Files With Non-English Audio (Excluding Undefined) ===
Found 1 file(s) with specific non-English audio:
/path/to/video1.mkv

=== Files With Undefined Language Audio (or No Audio) ===
Found 2 file(s) with undefined language audio:
/path/to/video2.mp4
/path/to/video4.avi

=== Detailed Audio Stream Report ===
File: /path/to/video1.mkv
  Found 1 audio stream(s):
    Stream 1: spa (aac, 2 channels)

File: /path/to/video2.mp4
  Found 1 audio stream(s):
    Stream 1: und (mp3, 2 channels)

File: /path/to/video3.mkv
  Found 2 audio stream(s):
    Stream 1: eng (aac, 6 channels)
    Stream 2: jpn (aac, 2 channels)

File: /path/to/video4.avi
  Found 0 audio stream(s):
    No audio streams found
Step-by-Step Breakdown
Script Initialization and Dependencies
  • The script uses Python 3, which is included with Debian 12.
  • Requires FFmpeg (ffprobe) to analyze media files. Install it with:
sudo apt update
sudo apt install ffmpeg
  • Imports necessary Python modules: os, subprocess, json, sys, and pathlib.Path.
  • The script is executed with a command-line argument specifying the directory to scan.
Command-Line Argument Handling
  • The script expects a single command-line argument: the path to the directory to scan.
  • Usage example:
./check_audio.py /path/to/media
  • If no or incorrect arguments are provided, it displays usage instructions and exits:
Usage: ./check_audio.py <directory_path>
Example: ./check_audio.py /path/to/media
  • The output report is saved to a file named audio_language_report.txt.
FFmpeg Availability Check
  • Verifies that ffprobe is installed by running:
ffprobe -version
  • If FFmpeg is not installed, the script exits with an error message instructing the user to install it.
Audio Stream Analysis (get_audio_streams Function)
  • Uses ffprobe to extract stream information from a media file in JSON format.
  • Command executed:
ffprobe -v error -show_streams -print_format json <file_path>
  • Parses the JSON output to identify audio streams.
  • For each audio stream, collects:

    • Stream index
    • Codec name (e.g., aac, mp3)
    • Language tag (defaults to ‘und’ if undefined)
    • Number of channels
  • Returns a list of dictionaries containing stream details or an empty list if an error occurs (e.g., file corruption or invalid format).
Directory Scanning (scan_directory Function)
  • Accepts the directory path and output file name as parameters.
  • Supports common media file extensions: .mp4, .mkv, .avi, .mov, .wmv, .flv, .m4v.
  • Recursively scans the directory using Path.rglob to find all media files.
  • For each file:
    • Calls get_audio_streams to retrieve audio stream details.
    • Builds a detailed report entry listing all audio streams, including their language, codec, and channels.
    • Categorizes the file based on its audio streams:
      • Files with English audio: If any stream has language ‘eng’, the file is excluded from summary lists.
      • Files with non-English audio: If no ‘eng’ stream exists and at least one stream has a specific language (e.g., ‘spa’, ‘fre’), the file is added to no_english_files.
      • Files with undefined language or no audio: If all streams are ‘und’ (undefined) or no audio streams exist, the file is added to undefined_lang_files.
Output Generation
Writes results to audio_language_report.txt in three sections:
  • Files With Non-English Audio (Excluding Undefined):
    • Lists files with specific non-English languages (e.g., Spanish, French).
    • Example:
Found 1 file(s) with specific non-English audio:
/path/to/video1.mkv
  • Files With Undefined Language Audio (or No Audio)
    • Lists files with only ‘und’ language tags or no audio streams.
    • Example:
Found 2 file(s) with undefined language audio:
/path/to/video2.mp4
/path/to/video4.avi
  • Detailed Audio Stream Report:
    • Lists all files with their audio stream details.
    • Example:
File: /path/to/video1.mkv
  Found 1 audio stream(s):
    Stream 1: spa (aac, 2 channels)
File: /path/to/video2.mp4
  Found 1 audio stream(s):
    Stream 1: und (mp3, 2 channels)
File: /path/to/video3.mkv
  Found 2 audio stream(s):
    Stream 1: eng (aac, 6 channels)
    Stream 2: jpn (aac, 2 channels)
File: /path/to/video4.avi
  Found 0 audio stream(s):
    No audio streams found
  • Handles IO errors gracefully, printing an error message if the output file cannot be written.
Error Handling
  • Checks for valid directory input; exits if the directory is invalid.
  • Handles ffprobe errors (e.g., corrupted files) by skipping problematic files and logging errors.
  • Manages JSON parsing errors, ensuring the script continues processing other files.