Command Line Interface¶
docviz-python provides a command-line interface for document extraction.
Basic Usage¶
# Extract content from a single document
docviz extract document.pdf
# Extract with specific output format
docviz extract document.pdf --format json --output results.json
# Extract from multiple documents in a directory
docviz batch input_directory/ --output output_directory/
# Extract specific content types
docviz extract document.pdf --types table --types text
Command Reference¶
Extract Command¶
Extract content from a single document.
docviz extract [OPTIONS] FILE_PATH
Options:
--output, -o PATH Output file path
--format, -f TEXT Output format (json, csv, excel, xml)
--types, -t TEXT Content types to extract (all, table, text, figure, equation, other)
--confidence FLOAT Detection confidence threshold (default: 0.5)
--device TEXT Device to use for detection (cpu, cuda)
--verbose, -v Enable verbose output
--help Show help message
Batch Command¶
Extract content from multiple documents in a directory.
docviz batch [OPTIONS] INPUT_DIR
Options:
--output, -o PATH Output directory
--format, -f TEXT Output format (json, csv, excel, xml)
--types, -t TEXT Content types to extract (all, table, text, figure, equation, other)
--confidence FLOAT Detection confidence threshold (default: 0.5)
--device TEXT Device to use for detection (cpu, cuda)
--pattern TEXT File pattern to match (default: *.pdf)
--verbose, -v Enable verbose output
--help Show help message
Info Command¶
Show information about DocViz and available options.
docviz info
Examples¶
Basic Extraction¶
# Extract all content from a PDF
docviz extract document.pdf
# Save as JSON with specific output file
docviz extract document.pdf --format json --output results.json
# Save as CSV
docviz extract document.pdf --format csv --output results.csv
Selective Extraction¶
# Extract only tables and text
docviz extract document.pdf --types table --types text
# Extract all content types (default)
docviz extract document.pdf --types all
Batch Processing¶
# Process all PDFs in a directory
docviz batch input_directory/ --output results/
# Process with specific pattern
docviz batch input_directory/ --output results/ --pattern "*.pdf"
Custom Configuration¶
# Use GPU for faster processing
docviz extract document.pdf --device cuda
# Set confidence threshold
docviz extract document.pdf --confidence 0.7
Output Formats¶
JSON Format¶
docviz extract document.pdf --format json
Produces a JSON file with structured extraction results.
CSV Format¶
docviz extract document.pdf --format csv
Produces a CSV file with tabular data.
Excel Format¶
docviz extract document.pdf --format excel
Produces an Excel file (.xlsx) with extraction results.
XML Format¶
docviz extract document.pdf --format xml
Produces an XML file with structured extraction results.
Environment Variables¶
You can set environment variables for configuration:
# Set API key for LLM integration (if used in processing)
export OPENAI_API_KEY="your-api-key-here"
Error Handling¶
The CLI provides informative error messages for common issues:
File not found
Invalid file format
Missing dependencies
API authentication errors
Memory issues with large documents
Verbose Output¶
Enable verbose output for debugging:
docviz extract document.pdf --verbose
This will show: * Processing progress * Detailed error messages * Configuration information * Performance metrics