Project Overview
This project transforms PDF files into spoken audio using Python. The goal was to create a lightweight, dependency-minimal script that extracts text from PDFs and converts it into natural-sounding speech, effectively generating a “free audiobook.” The result is a clean, self-contained tool that demonstrates Python automation, API integration, and thoughtful design simplicity.
Core Technologies
- Python — scripting and automation
- pypdf — text extraction from PDF pages
- gTTS — Google Text-to-Speech API for audio generation
- argparse — command-line interface for configuration
Technical Execution
The script reads each page of a PDF and sends its extracted text to the Google Text-to-Speech API. Audio is generated per page, producing individual MP3 files named sequentially (page_001.mp3, page_002.mp3, etc.).
This page-by-page approach avoids long-text API limits while keeping processing simple and efficient.
The implementation emphasizes clean structure and minimalism — relying on only two external libraries and using Python's built-in argparse for flexibility. It gracefully handles missing text, empty PDFs, or scanned (non-text) pages, making it robust for general use.
Problem-Solving & Design Considerations
One of the key challenges was ensuring compatibility with both text-based and scanned PDFs. Because pypdf cannot read images, the solution recommends an OCR pre-processing step if text is not detected. This decision keeps the project modular — the TTS component remains independent of OCR complexity.
Another design decision was to output one MP3 per page. While concatenating them into a single file is possible, separating audio per page keeps dependencies light (no ffmpeg/pydub) and processing faster for users who only need certain chapters or sections.
Key Features
- ✅ Converts any text-based PDF into natural speech
- ✅ Creates one MP3 file per page for modular listening
- ✅ Supports language and speed options
- ✅ Simple, dependency-light architecture
Demonstrated Skill Set
- Proficiency in Python scripting and CLI design
- Integration with web-based APIs (HTTP request patterns)
- Pragmatic problem-solving and modular thinking
- Balancing simplicity, maintainability, and real-world utility
Reflection
The key takeaway from this project was learning to optimize for developer friction — fewer dependencies, clearer logic, and more maintainable code. Future improvements could include adding optional OCR for scanned PDFs, multi-language support, and the ability to compile a single audiobook file.