Welcome to docviz-python documentation!¶

Extract content from documents easily with Python.

docviz-python is a robust Python library for extracting and analyzing content from documents. It offers batch and selective extraction, flexible configuration options, and supports multiple output formats.

GitHub: https://github.com/privateai-com/docviz

Try docviz Online on our website¶

You can test docviz functionality directly in your browser without any installation.

🚀 Try docviz Demo

Getting Started with the Web Interface¶

Follow these simple steps to process your documents online:

Create an Account

Visit the website and register for a new account to access the docviz interface.
Navigate to docviz

Once logged in, locate and click on the “docviz” tab in the left sidebar to access the document processing interface.
Upload Your PDF

Upload your PDF file using the “Upload PDF” field. The system will automatically begin processing your document.

Note

File Size Limit: The maximum file size is 50 MB. Larger files will be rejected during upload.

File Retention: Uploaded files are stored on our servers for exactly 7 days, after which they are automatically deleted for security and privacy reasons.
Wait for Processing

The system will process your document in the background. You can monitor the progress as the file is analyzed and content is extracted.
Download Results

When processing is finished, a green “success” badge will appear beside your document. Simply click the “Download” button to retrieve your results. The extracted data is provided in JSON format, making it easy to view, analyze, or integrate into your own applications and workflows.
View and Explore Results

Click on the document card to view detailed results organized by content type:
- Text: Extracted textual content
- Tables: Structured tabular data
- Images: Detected visual elements
- Formulas: Mathematical equations and expressions
Here is an example of an extracted image, which you can also view directly in the document preview on the right side:

Key Features¶

PDF Support: Extract content from PDF documents (other formats coming soon)
URL Inputs: Load documents from local paths or HTTP(S) URLs
Streaming Extraction: Process large documents with real-time results
Batch Processing: Handle multiple files efficiently
Chunked Extraction: Process documents in configurable page chunks
Selective Extraction: Choose what to extract (tables, text, figures, equations, etc.)
Multiple Output Formats: Export to JSON, CSV, Excel, XML
CLI Included: docviz command for single-file and batch processing
Async Support: Both synchronous and asynchronous processing
Chart Detection & LLM Summarization (optional): Detect visual elements and optionally summarize charts using an LLM
Automatic Dependencies: On first import, downloads required models and helps install Tesseract on Windows

Quick Start¶

import asyncio
import docviz

async def main():
    # Create a document instance
    document = docviz.Document("path/to/your/document.pdf")

    # Extract all content asynchronously
    extractions = await document.extract_content()

    # Save results
    extractions.save("results", save_format=docviz.SaveFormat.JSON)

asyncio.run(main())

Installation¶

Using uv (recommended):

uv add docviz-python

Using pip:

pip install docviz-python

Package Structure¶

For a detailed overview of the package structure and components, see Package Structure.