Basic Usage¶

This guide covers the fundamental concepts and basic operations in docviz-python.

Core Concepts¶

Document Class¶

The Document class is the main interface for working with documents. It represents a single document file and provides methods for content extraction.

import docviz

# Create a document instance
document = docviz.Document("path/to/document.pdf")

# Get document information
print(f"Document name: {document.name}")
print(f"Page count: {document.page_count}")

Extraction Types¶

docviz can extract different types of content from documents:

TEXT: Plain text content
TABLE: Tabular data with structure
FIGURE: Images, charts, and diagrams
EQUATION: Mathematical expressions

Extraction Results¶

The extraction process returns an ExtractionResult object containing:

entries: List of extracted content items (ExtractionEntry objects)
page_number: Number of pages processed

Each ExtractionEntry contains:

text: The extracted text content
class_: The type of content (table, text, figure, etc.)
confidence: Confidence score for the detection
bbox: Bounding box coordinates [x1, y1, x2, y2]
page_number: Page number where the content was found

Note

The bbox (bounding box) field is only meaningful for content types that represent images, figures, or other visual elements. For text and tables, the bbox will be the size of the page.

Basic Operations¶

Simple Extraction¶

Extract all content from a document:

import docviz

# Create document and extract content
document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()

# Print summary
print(f"Extracted {len(extractions.entries)} items")

Saving Results¶

Save extraction results in various formats:

import docviz

document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()

# Save as JSON
extractions.save("results", save_format=docviz.SaveFormat.JSON)

# Save as CSV
extractions.save("results", save_format=docviz.SaveFormat.CSV)

# Save in multiple formats
extractions.save("results", save_format=[
    docviz.SaveFormat.JSON,
    docviz.SaveFormat.CSV,
    docviz.SaveFormat.EXCEL
])

Selective Extraction¶

Extract only specific types of content:

import docviz

document = docviz.Document("sample.pdf")

# Extract only tables and text
extractions = document.extract_content_sync(
    includes=[
        docviz.ExtractionType.TABLE,
        docviz.ExtractionType.TEXT,
    ]
)

# Extract everything except figures
extractions = document.extract_content_sync(
    includes=[
        docviz.ExtractionType.TABLE,
        docviz.ExtractionType.TEXT,
        docviz.ExtractionType.EQUATION,
        docviz.ExtractionType.OTHER,
    ]
)

Working with URLs¶

Load documents from URLs:

import asyncio
import docviz

async def load_from_url():
    # Create document from URL
    document = await docviz.Document.from_url(
        "https://example.com/document.pdf"
    )

    # Extract content
    extractions = await document.extract_content()
    extractions.save("url_results", save_format=docviz.SaveFormat.JSON)

    return extractions

# Run the async function
result = asyncio.run(load_from_url())

Asynchronous vs Synchronous¶

docviz supports both synchronous and asynchronous operations:

Synchronous Usage¶

import docviz

# Simple synchronous extraction
document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()
extractions.save("results", save_format=docviz.SaveFormat.JSON)

Asynchronous Usage¶

import asyncio
import docviz

async def extract_async():
    document = docviz.Document("sample.pdf")
    extractions = await document.extract_content()
    extractions.save("results", save_format=docviz.SaveFormat.JSON)
    return extractions

# Run async function
result = asyncio.run(extract_async())

Error Handling¶

Handle common errors gracefully:

import docviz
from pathlib import Path

def safe_extract(file_path: str):
    try:
        # Check if file exists
        if not Path(file_path).exists():
            print(f"File not found: {file_path}")
            return None

        # Create document and extract
        document = docviz.Document(file_path)
        extractions = document.extract_content_sync()

        # Save results
        extractions.save("output", save_format=docviz.SaveFormat.JSON)
        return extractions

    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return None

# Use the safe extraction function
result = safe_extract("sample.pdf")
if result:
    print(f"Successfully extracted {len(result.entries)} items")

Progress Tracking¶

Monitor extraction progress:

import docviz
from tqdm import tqdm

document = docviz.Document("large_document.pdf")

# Create progress bar
with tqdm(total=document.page_count, desc="Extracting") as pbar:
    extractions = document.extract_content_sync(
        progress_callback=pbar.update
    )

extractions.save("progress_results", save_format=docviz.SaveFormat.JSON)

Working with Results¶

Access and process extraction results:

import docviz

document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()

# Iterate through extracted items
for entry in extractions.entries:
    print(f"Type: {entry.class_}")
    print(f"Page: {entry.page_number}")
    print(f"Content: {entry.text[:100]}...")
    print(f"Confidence: {entry.confidence:.2f}")
    print("---")

# Filter by content type
tables = [entry for entry in extractions.entries
          if entry.class_ == "table"]

text_items = [entry for entry in extractions.entries
              if entry.class_ == "text"]

print(f"Found {len(tables)} tables and {len(text_items)} text items")

Next Steps¶

Now that you understand the basics, explore:

Advanced Usage - Advanced features and configurations
Configuration - Detailed configuration options
API Reference - Complete API reference
Examples - More examples and use cases