Command Line Interface

docviz-python provides a command-line interface for document extraction.

Basic Usage

# Extract content from a single document
docviz extract document.pdf

# Extract with specific output format
docviz extract document.pdf --format json --output results.json

# Extract from multiple documents in a directory
docviz batch input_directory/ --output output_directory/

# Extract specific content types
docviz extract document.pdf --types table --types text

Command Reference

Extract Command

Extract content from a single document.

docviz extract [OPTIONS] FILE_PATH

Options:

    --output, -o PATH          Output file path
    --format, -f TEXT          Output format (json, csv, excel, xml)
    --types, -t TEXT           Content types to extract (all, table, text, figure, equation, other)
    --confidence FLOAT         Detection confidence threshold (default: 0.5)
    --device TEXT              Device to use for detection (cpu, cuda)
    --verbose, -v              Enable verbose output
    --help                     Show help message

Batch Command

Extract content from multiple documents in a directory.

docviz batch [OPTIONS] INPUT_DIR

Options:

    --output, -o PATH          Output directory
    --format, -f TEXT          Output format (json, csv, excel, xml)
    --types, -t TEXT           Content types to extract (all, table, text, figure, equation, other)
    --confidence FLOAT         Detection confidence threshold (default: 0.5)
    --device TEXT              Device to use for detection (cpu, cuda)
    --pattern TEXT             File pattern to match (default: *.pdf)
    --verbose, -v              Enable verbose output
    --help                     Show help message

Info Command

Show information about DocViz and available options.

docviz info

Examples

Basic Extraction

# Extract all content from a PDF
docviz extract document.pdf

# Save as JSON with specific output file
docviz extract document.pdf --format json --output results.json

# Save as CSV
docviz extract document.pdf --format csv --output results.csv

Selective Extraction

# Extract only tables and text
docviz extract document.pdf --types table --types text

# Extract all content types (default)
docviz extract document.pdf --types all

Batch Processing

# Process all PDFs in a directory
docviz batch input_directory/ --output results/

# Process with specific pattern
docviz batch input_directory/ --output results/ --pattern "*.pdf"

Custom Configuration

# Use GPU for faster processing
docviz extract document.pdf --device cuda

# Set confidence threshold
docviz extract document.pdf --confidence 0.7

Output Formats

JSON Format

docviz extract document.pdf --format json

Produces a JSON file with structured extraction results.

CSV Format

docviz extract document.pdf --format csv

Produces a CSV file with tabular data.

Excel Format

docviz extract document.pdf --format excel

Produces an Excel file (.xlsx) with extraction results.

XML Format

docviz extract document.pdf --format xml

Produces an XML file with structured extraction results.

Environment Variables

You can set environment variables for configuration:

# Set API key for LLM integration (if used in processing)
export OPENAI_API_KEY="your-api-key-here"

Error Handling

The CLI provides informative error messages for common issues:

  • File not found

  • Invalid file format

  • Missing dependencies

  • API authentication errors

  • Memory issues with large documents

Verbose Output

Enable verbose output for debugging:

docviz extract document.pdf --verbose

This will show: * Processing progress * Detailed error messages * Configuration information * Performance metrics