Skip to content

Python tool for adding Bates numbers to PDF documents for legal discovery and document management. Features both a user-friendly web interface and CLI, with support for batch processing, custom formatting, logos, QR codes, and watermarks.

License

Notifications You must be signed in to change notification settings

thepingdoctor/Bates-Labeler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

28 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

PDF Bates Numbering Tool

A comprehensive Python tool for adding Bates numbers to PDF documents, commonly used in legal document management and discovery processes.

๐Ÿš€ Quick Start

Option 1: Web Interface (Recommended for Most Users)

# Install with Poetry
poetry install

# Launch the web interface
poetry run streamlit run app.py

The web app opens at http://localhost:8501 with an intuitive drag & drop interface.

Option 2: Command Line Interface

# Install the package
poetry install

# Use the bates command
poetry run bates --input document.pdf --bates-prefix "CASE-"

๐ŸŒ Web Interface

User-friendly GUI - No command-line experience required!

Features

  • โœจ Drag & drop file upload - Upload single or multiple PDFs with reordering support
  • ๐ŸŽฏ Configuration presets - Pre-configured for Legal Discovery, Confidential, Exhibits
  • ๐Ÿ‘๏ธ Real-time preview - See your Bates format before processing
  • ๐Ÿ“Š Live progress tracking - Real-time status updates with cancel button and individual file progress
  • โšก Instant downloads - Individual files or bundled ZIP archive
  • ๐ŸŽจ Advanced customization - Logos, QR codes, borders, watermarks
  • ๐Ÿ–ผ๏ธ Logo upload - SVG, PNG, JPG, WEBP with flexible positioning
  • ๐Ÿ“ฑ QR code generation - Embed Bates numbers as scannable QR codes
  • ๐Ÿ”ฒ Border styling - 4 decorative border styles for separator pages
  • ๐Ÿ’ง Watermark support - Custom text overlays with opacity control
  • ๐Ÿ’พ Session persistence - Save and load configurations for repeated use
  • โช Undo/Redo - Full undo/redo support for all configuration changes
  • โŒจ๏ธ Keyboard shortcuts - Fast navigation and actions (Ctrl+Z, Ctrl+Y, Ctrl+S, etc.)
  • ๐Ÿ“ OCR text extraction - Extract text from scanned PDFs (local Tesseract and cloud options)
  • ๐Ÿ” Pre-flight validation - Automatic PDF validation before processing
  • ๐Ÿ“ค Batch export formats - Export to JSON, CSV, Excel, and TIFF
  • ๐Ÿ“„ PDF preview panel - View PDF pages before processing
  • ๐Ÿ”„ Page rotation - Rotate pages during processing
  • โœ… Bates validation - Real-time validation of Bates number formats
  • โšก Performance optimizations - 10-15x faster processing with parallel execution
  • ๐Ÿ“‹ Processing history - View past processing jobs and their configurations
  • ๐ŸŽจ Improved UI - Wider sidebar (420px), collapsible sections, professional design
  • ๐Ÿ“ฑ Responsive layout - Works on different screen sizes
  • ๐Ÿค– AI document analysis - Detect discrimination, problematic content, and extract metadata

Configuration Presets

  • Default - Blank slate for custom configurations
  • Legal Discovery - PLAINTIFF-PROD-000001 format
  • Confidential - CONFIDENTIAL-0001-AEO format with red text
  • Exhibit - EXHIBIT-101 format starting at 101

Deployment Options

  • Local - Run on your computer
  • Network - Share on your local network
  • Streamlit Cloud - Free cloud hosting
  • Docker - Container deployment
  • Self-hosted - Deploy on your own server (or easily run on a Macbook)

๐Ÿ“– Complete Web UI Guide - Installation, usage, deployment, and troubleshooting


Features

Core Functionality

  • โœ… Add sequential Bates numbers to each page of a PDF
  • โœ… Customizable prefix and suffix (e.g., "CASE123-0001-DRAFT")
  • โœ… Preserve original PDF attributes, bookmarks, and metadata
  • โœ… Support for password-protected PDFs
  • โœ… Batch processing of multiple PDFs with continuous numbering
  • โœ… Progress tracking for large documents with individual file status
  • โœ… Real-time status updates - Live progress tracking with cancellation support (Web UI)
  • โœ… Combine multiple PDFs into single file with continuous Bates numbering
  • โœ… Index page generation - Professional document index for combined PDFs
  • โœ… Bates number filenames - Name output files by first Bates number with CSV/PDF mappings
  • โœ… Custom fonts - Support for TrueType (.ttf) and OpenType (.otf) fonts
  • โœ… Logo placement - Upload and position logos on separator pages (SVG, PNG, JPG, WEBP)
  • โœ… QR codes - Generate QR codes with Bates numbers on all pages or separators
  • โœ… Watermarks - Add customizable text watermarks with opacity and rotation
  • โœ… ZIP download - Download all processed files as a single archive
  • โœ… Session persistence - Save and load processing configurations
  • โœ… Undo/Redo - Full history tracking for configuration changes
  • โœ… Keyboard shortcuts - Efficient keyboard navigation and actions
  • โœ… OCR support - Extract text from scanned documents (local and cloud)
  • โœ… Pre-flight validation - Automatic PDF health checks before processing
  • โœ… Multi-format export - JSON, CSV, Excel, TIFF batch export options
  • โœ… Drag-and-drop reordering - Reorder files before processing
  • โœ… PDF preview - View PDF pages in-app before processing
  • โœ… Page rotation - Rotate individual pages during processing
  • โœ… Bates validation - Real-time format validation with helpful error messages
  • โœ… Performance optimization - Parallel processing with 10-15x speed improvements
  • โœ… AI-powered document analysis - Discrimination detection, problematic content identification, metadata extraction

Customization Options

  • Position: Place Bates numbers at various positions on the page
  • Font: Customize font family, size, color, and style (bold/italic) or upload custom fonts
  • Date/Time: Include optional timestamp with Bates numbers
  • Padding: Configure number padding (e.g., 4 digits: "0001")
  • Formatting: Full control over prefix/suffix format
  • Separator Pages: Add separator pages between documents showing Bates ranges with optional logos and borders
  • Index Pages: Generate professional table of contents for combined documents
  • Logos: Upload custom logos (SVG, PNG, JPG, WEBP) with 8 placement options
  • QR Codes: Generate QR codes containing Bates numbers (all pages or separator only)
  • Borders: Add decorative borders to separator pages (solid, dashed, double, asterisks)
  • Watermarks: Overlay custom text with configurable opacity, rotation, and positioning
  • Download Options: Individual files or bundled ZIP archive
  • AI Analysis: Detect discrimination patterns, identify problematic content, extract document metadata (optional)

๐Ÿ“ฆ Installation

Requirements

  • Python 3.9 or higher (3.9.7 not supported due to Streamlit compatibility)
  • Poetry (recommended) or pip

Method 1: Poetry (Recommended)

# Clone the repository
git clone https://github.com/thepingdoctor/Bates-Labeler.git
cd Bates-Labeler

# Install with Poetry
poetry install

# This installs:
# - Core dependencies: pypdf, reportlab, tqdm
# - Web UI: streamlit
# - Dev tools: pytest, black, flake8, mypy (optional)

Method 2: pip

# Clone the repository
git clone https://github.com/thepingdoctor/Bates-Labeler.git
cd Bates-Labeler

# Install the package
pip install -e .

# Or install from PyPI (when published)
pip install bates-labeler

Method 3: Docker

# Build the Docker image
docker build -t bates-labeler .

# Run the web interface
docker run -p 8501:8501 bates-labeler

# Access at http://localhost:8501

Dependencies

  • pypdf ^4.0.0 - PDF manipulation
  • reportlab ^4.0.7 - PDF generation
  • tqdm ^4.66.1 - Progress bars
  • streamlit ^1.28.0 - Web interface (optional for CLI-only use)
  • pytesseract - OCR text extraction (optional, requires Tesseract installation)
  • Pillow - Image processing for OCR and previews
  • pandas - Export to Excel and CSV formats
  • openpyxl - Excel file generation

๐Ÿ“– Detailed Installation Guide - Poetry setup, publishing to PyPI, and more

Optional: AI Analysis Dependencies

For AI-powered document analysis, install additional dependencies:

# For OpenRouter (recommended)
pip install requests

# For Google Cloud Vertex AI
pip install google-cloud-aiplatform

# For Anthropic Claude
pip install anthropic

๐Ÿ“– AI Features Documentation - Complete guide to AI document analysis ๐Ÿ“– AI Setup Guide - Step-by-step configuration for AI providers

๐Ÿ’ป Usage

Choose Your Interface

๐ŸŒ Web Interface - Best for:

  • Non-technical users
  • Visual configuration
  • One-time or occasional use
  • Seeing results immediately

โŒจ๏ธ Command Line - Best for:

  • Automation and scripting
  • Batch processing workflows
  • Integration with other tools
  • Repeated operations

Command Line Interface (CLI)

Basic Usage

Add Bates numbers to a single PDF:

poetry run bates --input "evidence.pdf" --bates-prefix "CASE123-"

This will create evidence_bates.pdf with Bates numbers like "CASE123-0001", "CASE123-0002", etc.

Advanced Examples

Custom Position and Formatting

poetry run bates \
  --input "contract.pdf" \
  --bates-prefix "SMITH-v-JONES-" \
  --bates-suffix "-CONFIDENTIAL" \
  --start-number 100 \
  --position top-right \
  --font-size 12 \
  --font-color red \
  --bold

Include Date Stamp

poetry run bates \
  --input "deposition.pdf" \
  --bates-prefix "DEP-" \
  --include-date \
  --date-format "%Y/%m/%d %H:%M" \
  --position bottom-center

Batch Processing

Process multiple PDFs with continuous numbering:

poetry run bates \
  --batch doc1.pdf doc2.pdf doc3.pdf \
  --bates-prefix "DISCOVERY-" \
  --output-dir "./bates_stamped/"

Password-Protected PDFs

poetry run bates \
  --input "secured.pdf" \
  --bates-prefix "SECURE-" \
  --password "mypassword"

Or omit the password flag to be prompted securely:

poetry run bates --input "secured.pdf" --bates-prefix "SECURE-"
# You'll be prompted: PDF is password protected. Enter password:

Command Line Options

Input/Output Options

Option Description Default
--input, -i Input PDF file path Required*
--batch, -b Batch process multiple PDFs Required*
--output, -o Output PDF file path {input}_bates.pdf
--output-dir Output directory for batch mode Same as input

*Either --input or --batch is required

Bates Numbering Options

Option Description Default
--bates-prefix Prefix for Bates number ""
--bates-suffix Suffix for Bates number ""
--start-number Starting number 1
--padding Number padding width 4

Position Options

Option Description Default
--position Position on page bottom-right

Available positions:

  • top-left, top-center, top-right
  • bottom-left, bottom-center, bottom-right
  • center

Font Options

Option Description Default
--font-name Font family Helvetica
--font-size Font size in points 10
--font-color Color name or hex black
--bold Use bold font False
--italic Use italic font False

Available fonts: Helvetica, Times-Roman, Courier

Date/Time Options

Option Description Default
--include-date Include date stamp False
--date-format Date format string %Y-%m-%d

Advanced Options

Option Description Default
--combine Combine all batch files into single PDF False
--document-separators Add separator pages between documents (with --combine) False
--add-index Generate index page listing all documents (with --combine) False
--bates-filenames Use Bates number as output filename (e.g., CASE-0001.pdf) False
--mapping-prefix Prefix for CSV/PDF mapping files bates_mapping
--custom-font Path to custom TrueType (.ttf) or OpenType (.otf) font None
--add-separator Add separator page at beginning showing Bates range False

Security Options

Option Description Default
--password Password for encrypted PDFs Prompt if needed

๐Ÿ Python API Usage

You can also use the package as a Python module:

from bates_labeler import BatesNumberer

# Create a numberer instance
numberer = BatesNumberer(
    prefix="CASE2023-",
    start_number=1,
    padding=6,
    position="top-right",
    font_size=12,
    font_color="blue",
    bold=True,
    include_date=True
)

# Process a single PDF
numberer.process_pdf("input.pdf", "output.pdf")

# Batch processing is handled by the CLI
# For programmatic batch processing, loop through files:
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    output_name = pdf_file.replace(".pdf", "_bates.pdf")
    numberer.process_pdf(pdf_file, output_name)
    # Note: numbering continues across files automatically

๐Ÿ“‹ Common Use Cases

Legal Discovery

poetry run bates \
  --batch *.pdf \
  --bates-prefix "PLAINTIFF-PROD-" \
  --start-number 1 \
  --padding 6 \
  --position bottom-right \
  --font-size 8

Confidential Documents

poetry run bates \
  --input "trade_secrets.pdf" \
  --bates-prefix "CONFIDENTIAL-" \
  --bates-suffix "-AEO" \
  --font-color red \
  --bold \
  --position top-center

Exhibit Marking

poetry run bates \
  --input "exhibit.pdf" \
  --bates-prefix "EXHIBIT-" \
  --start-number 101 \
  --padding 3 \
  --position top-right \
  --font-size 14 \
  --bold

Archived Documents

poetry run bates \
  --batch archive/*.pdf \
  --bates-prefix "ARCH-" \
  --include-date \
  --date-format "%Y%m%d" \
  --position bottom-left \
  --output-dir "./archived_bates/"

Combine Multiple PDFs with Index

poetry run bates \
  --batch doc1.pdf doc2.pdf doc3.pdf \
  --bates-prefix "CASE-" \
  --combine \
  --document-separators \
  --add-index \
  --output "combined_discovery.pdf"

Use Bates Numbers as Filenames

poetry run bates \
  --batch discovery/*.pdf \
  --bates-prefix "PROD-" \
  --start-number 1000 \
  --bates-filenames \
  --output-dir "./numbered_files/"
# Creates: PROD-001000.pdf, PROD-001025.pdf, etc.
# Also generates: bates_mapping.csv and bates_mapping.pdf

Custom Font for Specialized Documents

poetry run bates \
  --input "contract.pdf" \
  --bates-prefix "CONTRACT-" \
  --custom-font "/path/to/custom-font.ttf" \
  --font-size 10 \
  --position bottom-right

๐Ÿ“š Documentation

  • WEB_UI_GUIDE.md - Complete guide to the Streamlit web interface

    • Installation methods (Poetry, pip, Docker)
    • Step-by-step usage instructions
    • Configuration presets and options
    • Deployment options (local, cloud, Docker, self-hosted)
    • Security considerations and troubleshooting
  • AI_FEATURES.md - AI document analysis features

    • Discrimination detection capabilities
    • Problematic content identification
    • Metadata extraction
    • Supported AI providers (OpenRouter, Google Cloud, Anthropic)
    • Configuration and usage examples
    • API cost considerations and best practices
  • AI_SETUP_GUIDE.md - AI analysis setup guide

    • Environment variable configuration
    • Provider-specific setup instructions
    • API key acquisition and security
    • Testing and troubleshooting
  • PACKAGING.md - Developer guide for Poetry and packaging

    • Poetry setup and configuration
    • Publishing to PyPI
    • Development workflow
    • Testing and quality tools

โ“ Troubleshooting

Common Issues

  1. Import Error: Install dependencies with Poetry:

    poetry install
  2. Python Version: Ensure you have Python 3.9 or higher (not 3.9.7):

    python --version
  3. Font Not Displaying Correctly: The package uses standard PDF fonts. For custom fonts, you'll need to register them with reportlab.

  4. Large Files Running Slowly: Progress bars show status. Very large files (1000+ pages) may take several minutes.

  5. Password Protected PDFs: Use the --password flag or wait for the secure prompt.

  6. Overlapping with Existing Content: Try different positions or adjust font size.

  7. Web UI Not Loading: Ensure Streamlit is installed and port 8501 is available:

    poetry run streamlit run app.py --server.port 8502

Error Messages

  • Error: Input file not found: Check the file path
  • Error: Invalid password: Verify the PDF password
  • Warning: Invalid color: Color name not recognized, defaulting to black
  • poetry: command not found: Install Poetry from https://python-poetry.org

Best Practices

  1. Test First: Always test on a copy of your documents first
  2. Backup Originals: Keep original files unchanged
  3. Consistent Prefixes: Use meaningful prefixes for easy identification
  4. Document Your System: Keep a record of your Bates numbering scheme
  5. Batch Processing: Group related documents for continuous numbering

Limitations

  • Currently supports standard PDF fonts (Helvetica, Times-Roman, Courier)
  • Custom TrueType fonts require additional setup
  • Very complex PDFs with forms may need special handling
  • Rotated pages maintain their orientation

Contributing

Feel free to submit issues, fork the repository, and create pull requests for any improvements.

License

This script is provided as-is for legal document management purposes. Ensure compliance with your jurisdiction's requirements for legal document numbering. No warranties are provided, express or implied.

๐Ÿ”ฎ Future Enhancements

Completed in v1.1.0:

  • Streamlit Web UI - Professional GUI interface for non-technical users
  • Poetry packaging - Modern Python dependency management
  • Docker support - Container deployment option
  • Configuration presets - Quick-start templates for common use cases
  • Custom TrueType/OpenType fonts - Upload and use custom .ttf/.otf fonts in both UI and CLI
  • CSV/PDF mapping files - Automatic generation when using Bates number filenames
  • PDF combining - Merge multiple PDFs with continuous Bates numbering
  • Index page generation - Professional document index for combined PDFs
  • Separator pages - Optional pages between documents showing Bates ranges with logos and borders
  • Logo upload and placement - SVG, PNG, JPG, WEBP support with 8 placement options
  • QR code generation - Scannable QR codes containing Bates numbers (all pages or separator only)
  • Border styling - 4 decorative border styles for separator pages (solid, dashed, double, asterisks)
  • Watermark capabilities - Custom text overlays with opacity, rotation, and positioning control
  • ZIP download - Bundle all processed files into single archive
  • Real-time status updates - Live progress tracking with cancellation support (Web UI)

Completed in v2.0.0:

  • Session persistence - Save and load configurations for repeated workflows
  • Undo/Redo functionality - Complete state management with Ctrl+Z/Ctrl+Y support
  • Keyboard shortcuts - Fast navigation (Ctrl+S save, Ctrl+L load, Ctrl+P process, etc.)
  • OCR text extraction - Extract text from scanned PDFs (local Tesseract and cloud options)
  • Pre-flight PDF validation - Automatic PDF health checks before processing
  • Batch export formats - Export to JSON, CSV, Excel (.xlsx), and TIFF
  • Drag-and-drop file reordering - Reorder files in queue before processing
  • PDF preview panel - In-app PDF page preview before processing
  • Individual file progress - Track progress for each file in batch operations
  • Page rotation support - Rotate pages during processing
  • Bates number validation - Real-time format validation with error messages
  • Performance optimizations - 10-15x faster with parallel processing and caching
  • Processing history - View and restore previous processing jobs

Completed in v2.1.0:

  • AI-powered document analysis - Optional AI integration for intelligent document processing
  • Multi-provider AI support - OpenRouter, Google Cloud Vertex AI, Anthropic Claude
  • Discrimination detection - Identify patterns across 8 categories (race, gender, age, disability, etc.)
  • Problematic content identification - Detect harassment, bias, PII exposure, confidential data leaks
  • Metadata extraction - Document classification, named entities, topics, sentiment analysis
  • Intelligent caching - 60-90% cost reduction on repeat analyses
  • Cost optimization - Typical cost $0.01-0.10 per document with efficient processing

Planned for future versions:

  • Integration with document management systems
  • Multi-threaded processing for large batches
  • Cloud storage integration (Google Drive, Dropbox, OneDrive)
  • Batch job scheduling and automation
  • PDF form field preservation
  • Advanced reporting and analytics
  • Template management system
  • Digital signatures and certification

AI & Intelligence Features

Phase 1: Document Intelligence (Foundation) โœ… Completed in v2.0.0

  • OCR support for scanned documents - Extract text from image-based PDFs using Tesseract/Cloud OCR

Phase 2: AI-Powered Analysis โœ… Completed in v2.1.0

  • Discrimination detection - Identify patterns across 8 categories (race, gender, age, disability, religion, national origin, sexual orientation, pregnancy)
  • Problematic content identification - Detect harassment, threats, hate speech, bias, PII exposure, confidential data leaks
  • Metadata extraction - Document type classification, named entity recognition, topic modeling, sentiment analysis
  • Multi-provider support - OpenRouter (100+ models), Google Cloud Vertex AI, Anthropic Claude
  • Cost optimization - Intelligent caching (60-90% cost reduction), efficient chunking, rate limiting

Phase 3: Smart Processing & Quality (Planned)

  • AI-powered quality assurance - Verify numbering continuity, detect missing pages, flag anomalies
  • Duplicate and near-duplicate detection - Identify redundant pages in batch processing
  • Auto-suggest Bates prefixes - Recommend prefixes based on document content and type
  • Intelligent redaction detection - Identify and suggest redaction of PII (SSNs, account numbers, etc.)

Phase 4: Search & Discovery (Planned)

  • Full-text searchable index generation - Create searchable database of all processed documents
  • Semantic search capabilities - Find documents by concept, not just keywords
  • AI document summarization - Generate executive summaries of long documents

Phase 5: Enhanced User Experience (Planned)

  • Natural language configuration - Process documents using conversational commands
  • AI assistant for troubleshooting - Help users optimize workflows and solve issues
  • Smart defaults based on usage patterns - Learn from past configurations
  • Workflow template suggestions - AI-generated processing templates

Phase 6: Advanced Automation (Planned)

  • Automatic document routing - Organize processed files by type/category
  • Batch processing optimization - Suggest optimal grouping and numbering strategies
  • Anomaly detection and alerting - Flag unusual document characteristics

Using AI Document Analysis

The AI analysis feature is optional and requires API credentials from one of the supported providers.

Quick Start:

  1. Choose a Provider:

    • OpenRouter (recommended): Access to 100+ models, cost-effective ($0.01-0.10 per document)
    • Google Cloud Vertex AI: Enterprise-grade with Gemini models
    • Anthropic Claude: Privacy-focused with long context windows
  2. Configure Environment:

    # Copy example configuration
    cp docs/.env.example .env
    
    # Edit .env and add your API key
    AI_ANALYSIS_ENABLED=true
    AI_PROVIDER=openrouter
    OPENROUTER_API_KEY=sk-or-v1-your-key-here
  3. Use in Python:

    from bates_labeler import BatesNumberer
    
    numberer = BatesNumberer(
        prefix="CASE-",
        ai_analysis_enabled=True,
        ai_provider="openrouter",
        ai_api_key="sk-or-v1-your-key",
        ai_analysis_callback=lambda result: print(f"Analysis: {result}")
    )
    
    # Process PDF with AI analysis
    numberer.process_pdf("document.pdf", "output.pdf")
  4. Use in Web UI:

    • Launch: poetry run streamlit run app.py
    • Expand "๐Ÿค– AI Document Analysis (Optional)" in sidebar
    • Enable AI analysis and enter API key
    • Process documents and view analysis results

What It Detects:

  • ๐Ÿšจ Discrimination patterns (race, gender, age, disability, etc.)
  • โš ๏ธ Problematic content (harassment, bias, PII exposure)
  • ๐Ÿ“Š Document metadata (type, entities, topics, sentiment)

Cost: Typical cost is $0.01-0.10 per document. Caching reduces repeat analysis costs by 60-90%.

๐Ÿ“– See docs/AI_FEATURES.md for complete documentation

About

Python tool for adding Bates numbers to PDF documents for legal discovery and document management. Features both a user-friendly web interface and CLI, with support for batch processing, custom formatting, logos, QR codes, and watermarks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •