Welcome to docviz-python documentation!

docviz

Extract content from documents easily with Python.

docviz-python is a robust Python library for extracting and analyzing content from documents. It offers batch and selective extraction, flexible configuration options, and supports multiple output formats.

GitHub: https://github.com/privateai-com/docviz

Try docviz Online on our website

You can test docviz functionality directly in your browser without any installation.

Getting Started with the Web Interface

Follow these simple steps to process your documents online:

  1. Create an Account

    Login page screenshot

    Visit the website and register for a new account to access the docviz interface.

  2. Navigate to docviz

    Main dashboard screenshot

    Once logged in, locate and click on the “docviz” tab in the left sidebar to access the document processing interface.

  3. Upload Your PDF

    docviz upload interface screenshot

    Upload your PDF file using the “Upload PDF” field. The system will automatically begin processing your document.

    Note

    File Size Limit: The maximum file size is 50 MB. Larger files will be rejected during upload.

    File Retention: Uploaded files are stored on our servers for exactly 7 days, after which they are automatically deleted for security and privacy reasons.

  4. Wait for Processing

    The system will process your document in the background. You can monitor the progress as the file is analyzed and content is extracted.

  5. Download Results

    When processing is finished, a green “success” badge will appear beside your document. Simply click the “Download” button to retrieve your results. The extracted data is provided in JSON format, making it easy to view, analyze, or integrate into your own applications and workflows.

  6. View and Explore Results

    Click on the document card to view detailed results organized by content type:

    • Text: Extracted textual content

    • Tables: Structured tabular data

    • Images: Detected visual elements

    • Formulas: Mathematical equations and expressions

    Document results overview screenshot

    Here is an example of an extracted image, which you can also view directly in the document preview on the right side:

    Document results page 2 screenshot

Key Features

  • PDF Support: Extract content from PDF documents (other formats coming soon)

  • URL Inputs: Load documents from local paths or HTTP(S) URLs

  • Streaming Extraction: Process large documents with real-time results

  • Batch Processing: Handle multiple files efficiently

  • Chunked Extraction: Process documents in configurable page chunks

  • Selective Extraction: Choose what to extract (tables, text, figures, equations, etc.)

  • Multiple Output Formats: Export to JSON, CSV, Excel, XML

  • CLI Included: docviz command for single-file and batch processing

  • Async Support: Both synchronous and asynchronous processing

  • Chart Detection & LLM Summarization (optional): Detect visual elements and optionally summarize charts using an LLM

  • Automatic Dependencies: On first import, downloads required models and helps install Tesseract on Windows

Quick Start

import asyncio
import docviz

async def main():
    # Create a document instance
    document = docviz.Document("path/to/your/document.pdf")

    # Extract all content asynchronously
    extractions = await document.extract_content()

    # Save results
    extractions.save("results", save_format=docviz.SaveFormat.JSON)

asyncio.run(main())

Installation

Using uv (recommended):

uv add docviz-python

Using pip:

pip install docviz-python

Package Structure

For a detailed overview of the package structure and components, see Package Structure.

Table of Contents

Indices and tables