Welcome to docviz-python documentation!¶

Extract content from documents easily with Python.

docviz-python is a powerful Python library for extracting and analyzing content from documents. It supports batch and selective extraction, custom configuration, and multiple output formats.

Key Features¶

PDF Support: Extract content from PDF documents (other formats coming soon)
Streaming Extraction: Process large documents with real-time results
Batch Processing: Handle multiple files efficiently
Selective Extraction: Choose what to extract (tables, text, figures, equations, etc.)
Multiple Output Formats: Export to JSON, CSV, Excel, XML
Simple API: Easy-to-use interface with high configurability
Async Support: Both synchronous and asynchronous processing
Chart Detection: Advanced detection and analysis of charts and figures

Quick Start¶

import asyncio
import docviz

async def main():
    # Create a document instance
    document = docviz.Document("path/to/your/document.pdf")

    # Extract all content asynchronously
    extractions = await document.extract_content()

    # Save results
    extractions.save("results", save_format=docviz.SaveFormat.JSON)

asyncio.run(main())

Installation¶

Using uv (recommended):

uv add docviz-python

Using pip:

pip install docviz-python

Package Structure¶

For a detailed overview of the package structure and components, see Package Structure.

Welcome to docviz-python documentation!¶

Key Features¶

Quick Start¶

Installation¶

Package Structure¶

Table of Contents¶

Indices and tables¶