A comprehensive Python tool for adding Bates numbers to PDF documents, commonly used in legal document management and discovery processes.
# Install with Poetry
poetry install
# Launch the web interface
poetry run streamlit run app.pyThe web app opens at http://localhost:8501 with an intuitive drag & drop interface.
# Install the package
poetry install
# Use the bates command
poetry run bates --input document.pdf --bates-prefix "CASE-"User-friendly GUI - No command-line experience required!
- โจ Drag & drop file upload - Upload single or multiple PDFs with reordering support
- ๐ฏ Configuration presets - Pre-configured for Legal Discovery, Confidential, Exhibits
- ๐๏ธ Real-time preview - See your Bates format before processing
- ๐ Live progress tracking - Real-time status updates with cancel button and individual file progress
- โก Instant downloads - Individual files or bundled ZIP archive
- ๐จ Advanced customization - Logos, QR codes, borders, watermarks
- ๐ผ๏ธ Logo upload - SVG, PNG, JPG, WEBP with flexible positioning
- ๐ฑ QR code generation - Embed Bates numbers as scannable QR codes
- ๐ฒ Border styling - 4 decorative border styles for separator pages
- ๐ง Watermark support - Custom text overlays with opacity control
- ๐พ Session persistence - Save and load configurations for repeated use
- โช Undo/Redo - Full undo/redo support for all configuration changes
- โจ๏ธ Keyboard shortcuts - Fast navigation and actions (Ctrl+Z, Ctrl+Y, Ctrl+S, etc.)
- ๐ OCR text extraction - Extract text from scanned PDFs (local Tesseract and cloud options)
- ๐ Pre-flight validation - Automatic PDF validation before processing
- ๐ค Batch export formats - Export to JSON, CSV, Excel, and TIFF
- ๐ PDF preview panel - View PDF pages before processing
- ๐ Page rotation - Rotate pages during processing
- โ Bates validation - Real-time validation of Bates number formats
- โก Performance optimizations - 10-15x faster processing with parallel execution
- ๐ Processing history - View past processing jobs and their configurations
- ๐จ Improved UI - Wider sidebar (420px), collapsible sections, professional design
- ๐ฑ Responsive layout - Works on different screen sizes
- ๐ค AI document analysis - Detect discrimination, problematic content, and extract metadata
- Default - Blank slate for custom configurations
- Legal Discovery -
PLAINTIFF-PROD-000001format - Confidential -
CONFIDENTIAL-0001-AEOformat with red text - Exhibit -
EXHIBIT-101format starting at 101
- Local - Run on your computer
- Network - Share on your local network
- Streamlit Cloud - Free cloud hosting
- Docker - Container deployment
- Self-hosted - Deploy on your own server (or easily run on a Macbook)
๐ Complete Web UI Guide - Installation, usage, deployment, and troubleshooting
- โ Add sequential Bates numbers to each page of a PDF
- โ Customizable prefix and suffix (e.g., "CASE123-0001-DRAFT")
- โ Preserve original PDF attributes, bookmarks, and metadata
- โ Support for password-protected PDFs
- โ Batch processing of multiple PDFs with continuous numbering
- โ Progress tracking for large documents with individual file status
- โ Real-time status updates - Live progress tracking with cancellation support (Web UI)
- โ Combine multiple PDFs into single file with continuous Bates numbering
- โ Index page generation - Professional document index for combined PDFs
- โ Bates number filenames - Name output files by first Bates number with CSV/PDF mappings
- โ Custom fonts - Support for TrueType (.ttf) and OpenType (.otf) fonts
- โ Logo placement - Upload and position logos on separator pages (SVG, PNG, JPG, WEBP)
- โ QR codes - Generate QR codes with Bates numbers on all pages or separators
- โ Watermarks - Add customizable text watermarks with opacity and rotation
- โ ZIP download - Download all processed files as a single archive
- โ Session persistence - Save and load processing configurations
- โ Undo/Redo - Full history tracking for configuration changes
- โ Keyboard shortcuts - Efficient keyboard navigation and actions
- โ OCR support - Extract text from scanned documents (local and cloud)
- โ Pre-flight validation - Automatic PDF health checks before processing
- โ Multi-format export - JSON, CSV, Excel, TIFF batch export options
- โ Drag-and-drop reordering - Reorder files before processing
- โ PDF preview - View PDF pages in-app before processing
- โ Page rotation - Rotate individual pages during processing
- โ Bates validation - Real-time format validation with helpful error messages
- โ Performance optimization - Parallel processing with 10-15x speed improvements
- โ AI-powered document analysis - Discrimination detection, problematic content identification, metadata extraction
- Position: Place Bates numbers at various positions on the page
- Font: Customize font family, size, color, and style (bold/italic) or upload custom fonts
- Date/Time: Include optional timestamp with Bates numbers
- Padding: Configure number padding (e.g., 4 digits: "0001")
- Formatting: Full control over prefix/suffix format
- Separator Pages: Add separator pages between documents showing Bates ranges with optional logos and borders
- Index Pages: Generate professional table of contents for combined documents
- Logos: Upload custom logos (SVG, PNG, JPG, WEBP) with 8 placement options
- QR Codes: Generate QR codes containing Bates numbers (all pages or separator only)
- Borders: Add decorative borders to separator pages (solid, dashed, double, asterisks)
- Watermarks: Overlay custom text with configurable opacity, rotation, and positioning
- Download Options: Individual files or bundled ZIP archive
- AI Analysis: Detect discrimination patterns, identify problematic content, extract document metadata (optional)
- Python 3.9 or higher (3.9.7 not supported due to Streamlit compatibility)
- Poetry (recommended) or pip
# Clone the repository
git clone https://github.com/thepingdoctor/Bates-Labeler.git
cd Bates-Labeler
# Install with Poetry
poetry install
# This installs:
# - Core dependencies: pypdf, reportlab, tqdm
# - Web UI: streamlit
# - Dev tools: pytest, black, flake8, mypy (optional)# Clone the repository
git clone https://github.com/thepingdoctor/Bates-Labeler.git
cd Bates-Labeler
# Install the package
pip install -e .
# Or install from PyPI (when published)
pip install bates-labeler# Build the Docker image
docker build -t bates-labeler .
# Run the web interface
docker run -p 8501:8501 bates-labeler
# Access at http://localhost:8501- pypdf ^4.0.0 - PDF manipulation
- reportlab ^4.0.7 - PDF generation
- tqdm ^4.66.1 - Progress bars
- streamlit ^1.28.0 - Web interface (optional for CLI-only use)
- pytesseract - OCR text extraction (optional, requires Tesseract installation)
- Pillow - Image processing for OCR and previews
- pandas - Export to Excel and CSV formats
- openpyxl - Excel file generation
๐ Detailed Installation Guide - Poetry setup, publishing to PyPI, and more
For AI-powered document analysis, install additional dependencies:
# For OpenRouter (recommended)
pip install requests
# For Google Cloud Vertex AI
pip install google-cloud-aiplatform
# For Anthropic Claude
pip install anthropic๐ AI Features Documentation - Complete guide to AI document analysis ๐ AI Setup Guide - Step-by-step configuration for AI providers
๐ Web Interface - Best for:
- Non-technical users
- Visual configuration
- One-time or occasional use
- Seeing results immediately
โจ๏ธ Command Line - Best for:
- Automation and scripting
- Batch processing workflows
- Integration with other tools
- Repeated operations
Add Bates numbers to a single PDF:
poetry run bates --input "evidence.pdf" --bates-prefix "CASE123-"This will create evidence_bates.pdf with Bates numbers like "CASE123-0001", "CASE123-0002", etc.
Custom Position and Formatting
poetry run bates \
--input "contract.pdf" \
--bates-prefix "SMITH-v-JONES-" \
--bates-suffix "-CONFIDENTIAL" \
--start-number 100 \
--position top-right \
--font-size 12 \
--font-color red \
--boldInclude Date Stamp
poetry run bates \
--input "deposition.pdf" \
--bates-prefix "DEP-" \
--include-date \
--date-format "%Y/%m/%d %H:%M" \
--position bottom-centerBatch Processing
Process multiple PDFs with continuous numbering:
poetry run bates \
--batch doc1.pdf doc2.pdf doc3.pdf \
--bates-prefix "DISCOVERY-" \
--output-dir "./bates_stamped/"Password-Protected PDFs
poetry run bates \
--input "secured.pdf" \
--bates-prefix "SECURE-" \
--password "mypassword"Or omit the password flag to be prompted securely:
poetry run bates --input "secured.pdf" --bates-prefix "SECURE-"
# You'll be prompted: PDF is password protected. Enter password:| Option | Description | Default |
|---|---|---|
--input, -i |
Input PDF file path | Required* |
--batch, -b |
Batch process multiple PDFs | Required* |
--output, -o |
Output PDF file path | {input}_bates.pdf |
--output-dir |
Output directory for batch mode | Same as input |
*Either --input or --batch is required
| Option | Description | Default |
|---|---|---|
--bates-prefix |
Prefix for Bates number | "" |
--bates-suffix |
Suffix for Bates number | "" |
--start-number |
Starting number | 1 |
--padding |
Number padding width | 4 |
| Option | Description | Default |
|---|---|---|
--position |
Position on page | bottom-right |
Available positions:
top-left,top-center,top-rightbottom-left,bottom-center,bottom-rightcenter
| Option | Description | Default |
|---|---|---|
--font-name |
Font family | Helvetica |
--font-size |
Font size in points | 10 |
--font-color |
Color name or hex | black |
--bold |
Use bold font | False |
--italic |
Use italic font | False |
Available fonts: Helvetica, Times-Roman, Courier
| Option | Description | Default |
|---|---|---|
--include-date |
Include date stamp | False |
--date-format |
Date format string | %Y-%m-%d |
| Option | Description | Default |
|---|---|---|
--combine |
Combine all batch files into single PDF | False |
--document-separators |
Add separator pages between documents (with --combine) |
False |
--add-index |
Generate index page listing all documents (with --combine) |
False |
--bates-filenames |
Use Bates number as output filename (e.g., CASE-0001.pdf) | False |
--mapping-prefix |
Prefix for CSV/PDF mapping files | bates_mapping |
--custom-font |
Path to custom TrueType (.ttf) or OpenType (.otf) font | None |
--add-separator |
Add separator page at beginning showing Bates range | False |
| Option | Description | Default |
|---|---|---|
--password |
Password for encrypted PDFs | Prompt if needed |
You can also use the package as a Python module:
from bates_labeler import BatesNumberer
# Create a numberer instance
numberer = BatesNumberer(
prefix="CASE2023-",
start_number=1,
padding=6,
position="top-right",
font_size=12,
font_color="blue",
bold=True,
include_date=True
)
# Process a single PDF
numberer.process_pdf("input.pdf", "output.pdf")
# Batch processing is handled by the CLI
# For programmatic batch processing, loop through files:
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
output_name = pdf_file.replace(".pdf", "_bates.pdf")
numberer.process_pdf(pdf_file, output_name)
# Note: numbering continues across files automaticallypoetry run bates \
--batch *.pdf \
--bates-prefix "PLAINTIFF-PROD-" \
--start-number 1 \
--padding 6 \
--position bottom-right \
--font-size 8poetry run bates \
--input "trade_secrets.pdf" \
--bates-prefix "CONFIDENTIAL-" \
--bates-suffix "-AEO" \
--font-color red \
--bold \
--position top-centerpoetry run bates \
--input "exhibit.pdf" \
--bates-prefix "EXHIBIT-" \
--start-number 101 \
--padding 3 \
--position top-right \
--font-size 14 \
--boldpoetry run bates \
--batch archive/*.pdf \
--bates-prefix "ARCH-" \
--include-date \
--date-format "%Y%m%d" \
--position bottom-left \
--output-dir "./archived_bates/"poetry run bates \
--batch doc1.pdf doc2.pdf doc3.pdf \
--bates-prefix "CASE-" \
--combine \
--document-separators \
--add-index \
--output "combined_discovery.pdf"poetry run bates \
--batch discovery/*.pdf \
--bates-prefix "PROD-" \
--start-number 1000 \
--bates-filenames \
--output-dir "./numbered_files/"
# Creates: PROD-001000.pdf, PROD-001025.pdf, etc.
# Also generates: bates_mapping.csv and bates_mapping.pdfpoetry run bates \
--input "contract.pdf" \
--bates-prefix "CONTRACT-" \
--custom-font "/path/to/custom-font.ttf" \
--font-size 10 \
--position bottom-right-
WEB_UI_GUIDE.md - Complete guide to the Streamlit web interface
- Installation methods (Poetry, pip, Docker)
- Step-by-step usage instructions
- Configuration presets and options
- Deployment options (local, cloud, Docker, self-hosted)
- Security considerations and troubleshooting
-
AI_FEATURES.md - AI document analysis features
- Discrimination detection capabilities
- Problematic content identification
- Metadata extraction
- Supported AI providers (OpenRouter, Google Cloud, Anthropic)
- Configuration and usage examples
- API cost considerations and best practices
-
AI_SETUP_GUIDE.md - AI analysis setup guide
- Environment variable configuration
- Provider-specific setup instructions
- API key acquisition and security
- Testing and troubleshooting
-
PACKAGING.md - Developer guide for Poetry and packaging
- Poetry setup and configuration
- Publishing to PyPI
- Development workflow
- Testing and quality tools
-
Import Error: Install dependencies with Poetry:
poetry install
-
Python Version: Ensure you have Python 3.9 or higher (not 3.9.7):
python --version
-
Font Not Displaying Correctly: The package uses standard PDF fonts. For custom fonts, you'll need to register them with reportlab.
-
Large Files Running Slowly: Progress bars show status. Very large files (1000+ pages) may take several minutes.
-
Password Protected PDFs: Use the
--passwordflag or wait for the secure prompt. -
Overlapping with Existing Content: Try different positions or adjust font size.
-
Web UI Not Loading: Ensure Streamlit is installed and port 8501 is available:
poetry run streamlit run app.py --server.port 8502
Error: Input file not found: Check the file pathError: Invalid password: Verify the PDF passwordWarning: Invalid color: Color name not recognized, defaulting to blackpoetry: command not found: Install Poetry from https://python-poetry.org
- Test First: Always test on a copy of your documents first
- Backup Originals: Keep original files unchanged
- Consistent Prefixes: Use meaningful prefixes for easy identification
- Document Your System: Keep a record of your Bates numbering scheme
- Batch Processing: Group related documents for continuous numbering
- Currently supports standard PDF fonts (Helvetica, Times-Roman, Courier)
- Custom TrueType fonts require additional setup
- Very complex PDFs with forms may need special handling
- Rotated pages maintain their orientation
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
This script is provided as-is for legal document management purposes. Ensure compliance with your jurisdiction's requirements for legal document numbering. No warranties are provided, express or implied.
Completed in v1.1.0:
- Streamlit Web UI - Professional GUI interface for non-technical users
- Poetry packaging - Modern Python dependency management
- Docker support - Container deployment option
- Configuration presets - Quick-start templates for common use cases
- Custom TrueType/OpenType fonts - Upload and use custom .ttf/.otf fonts in both UI and CLI
- CSV/PDF mapping files - Automatic generation when using Bates number filenames
- PDF combining - Merge multiple PDFs with continuous Bates numbering
- Index page generation - Professional document index for combined PDFs
- Separator pages - Optional pages between documents showing Bates ranges with logos and borders
- Logo upload and placement - SVG, PNG, JPG, WEBP support with 8 placement options
- QR code generation - Scannable QR codes containing Bates numbers (all pages or separator only)
- Border styling - 4 decorative border styles for separator pages (solid, dashed, double, asterisks)
- Watermark capabilities - Custom text overlays with opacity, rotation, and positioning control
- ZIP download - Bundle all processed files into single archive
- Real-time status updates - Live progress tracking with cancellation support (Web UI)
Completed in v2.0.0:
- Session persistence - Save and load configurations for repeated workflows
- Undo/Redo functionality - Complete state management with Ctrl+Z/Ctrl+Y support
- Keyboard shortcuts - Fast navigation (Ctrl+S save, Ctrl+L load, Ctrl+P process, etc.)
- OCR text extraction - Extract text from scanned PDFs (local Tesseract and cloud options)
- Pre-flight PDF validation - Automatic PDF health checks before processing
- Batch export formats - Export to JSON, CSV, Excel (.xlsx), and TIFF
- Drag-and-drop file reordering - Reorder files in queue before processing
- PDF preview panel - In-app PDF page preview before processing
- Individual file progress - Track progress for each file in batch operations
- Page rotation support - Rotate pages during processing
- Bates number validation - Real-time format validation with error messages
- Performance optimizations - 10-15x faster with parallel processing and caching
- Processing history - View and restore previous processing jobs
Completed in v2.1.0:
- AI-powered document analysis - Optional AI integration for intelligent document processing
- Multi-provider AI support - OpenRouter, Google Cloud Vertex AI, Anthropic Claude
- Discrimination detection - Identify patterns across 8 categories (race, gender, age, disability, etc.)
- Problematic content identification - Detect harassment, bias, PII exposure, confidential data leaks
- Metadata extraction - Document classification, named entities, topics, sentiment analysis
- Intelligent caching - 60-90% cost reduction on repeat analyses
- Cost optimization - Typical cost $0.01-0.10 per document with efficient processing
Planned for future versions:
- Integration with document management systems
- Multi-threaded processing for large batches
- Cloud storage integration (Google Drive, Dropbox, OneDrive)
- Batch job scheduling and automation
- PDF form field preservation
- Advanced reporting and analytics
- Template management system
- Digital signatures and certification
Phase 1: Document Intelligence (Foundation) โ Completed in v2.0.0
- OCR support for scanned documents - Extract text from image-based PDFs using Tesseract/Cloud OCR
Phase 2: AI-Powered Analysis โ Completed in v2.1.0
- Discrimination detection - Identify patterns across 8 categories (race, gender, age, disability, religion, national origin, sexual orientation, pregnancy)
- Problematic content identification - Detect harassment, threats, hate speech, bias, PII exposure, confidential data leaks
- Metadata extraction - Document type classification, named entity recognition, topic modeling, sentiment analysis
- Multi-provider support - OpenRouter (100+ models), Google Cloud Vertex AI, Anthropic Claude
- Cost optimization - Intelligent caching (60-90% cost reduction), efficient chunking, rate limiting
Phase 3: Smart Processing & Quality (Planned)
- AI-powered quality assurance - Verify numbering continuity, detect missing pages, flag anomalies
- Duplicate and near-duplicate detection - Identify redundant pages in batch processing
- Auto-suggest Bates prefixes - Recommend prefixes based on document content and type
- Intelligent redaction detection - Identify and suggest redaction of PII (SSNs, account numbers, etc.)
Phase 4: Search & Discovery (Planned)
- Full-text searchable index generation - Create searchable database of all processed documents
- Semantic search capabilities - Find documents by concept, not just keywords
- AI document summarization - Generate executive summaries of long documents
Phase 5: Enhanced User Experience (Planned)
- Natural language configuration - Process documents using conversational commands
- AI assistant for troubleshooting - Help users optimize workflows and solve issues
- Smart defaults based on usage patterns - Learn from past configurations
- Workflow template suggestions - AI-generated processing templates
Phase 6: Advanced Automation (Planned)
- Automatic document routing - Organize processed files by type/category
- Batch processing optimization - Suggest optimal grouping and numbering strategies
- Anomaly detection and alerting - Flag unusual document characteristics
The AI analysis feature is optional and requires API credentials from one of the supported providers.
Quick Start:
-
Choose a Provider:
- OpenRouter (recommended): Access to 100+ models, cost-effective ($0.01-0.10 per document)
- Google Cloud Vertex AI: Enterprise-grade with Gemini models
- Anthropic Claude: Privacy-focused with long context windows
-
Configure Environment:
# Copy example configuration cp docs/.env.example .env # Edit .env and add your API key AI_ANALYSIS_ENABLED=true AI_PROVIDER=openrouter OPENROUTER_API_KEY=sk-or-v1-your-key-here
-
Use in Python:
from bates_labeler import BatesNumberer numberer = BatesNumberer( prefix="CASE-", ai_analysis_enabled=True, ai_provider="openrouter", ai_api_key="sk-or-v1-your-key", ai_analysis_callback=lambda result: print(f"Analysis: {result}") ) # Process PDF with AI analysis numberer.process_pdf("document.pdf", "output.pdf")
-
Use in Web UI:
- Launch:
poetry run streamlit run app.py - Expand "๐ค AI Document Analysis (Optional)" in sidebar
- Enable AI analysis and enter API key
- Process documents and view analysis results
- Launch:
What It Detects:
- ๐จ Discrimination patterns (race, gender, age, disability, etc.)
โ ๏ธ Problematic content (harassment, bias, PII exposure)- ๐ Document metadata (type, entities, topics, sentiment)
Cost: Typical cost is $0.01-0.10 per document. Caching reduces repeat analysis costs by 60-90%.
๐ See docs/AI_FEATURES.md for complete documentation