Quick Start Guide¶
This guide will help you get started with docviz-python quickly. You’ll learn how to install the library and perform basic document extraction operations.
Installation¶
Install docviz-python using your preferred package manager:
Using uv (recommended):
uv add docviz-python
Using pip:
pip install docviz-python
From source:
git clone https://github.com/privateai-com/docviz.git
cd docviz
pip install -e .
Basic Usage¶
The simplest way to extract content from a document is to create a Document
instance and call the extraction method.
Asynchronous Usage¶
import asyncio
import docviz
async def main():
# Create a document instance (can be a local file or a URL)
document = docviz.Document("path/to/your/document.pdf")
# Extract all content asynchronously
extractions = await document.extract_content()
# Save results (file name without extension, it will be inherited from chosen format)
extractions.save("results", save_format=docviz.SaveFormat.JSON)
asyncio.run(main())
Synchronous Usage¶
import docviz
document = docviz.Document("path/to/your/document.pdf")
extractions = document.extract_content_sync()
extractions.save("results", save_format=docviz.SaveFormat.JSON)
Working with URLs¶
You can also work with documents from URLs:
import asyncio
import docviz
async def main():
# Create document from URL
document = await docviz.Document.from_url("https://example.com/document.pdf")
# Extract content
extractions = await document.extract_content()
extractions.save("results", save_format=docviz.SaveFormat.JSON)
asyncio.run(main())
What Gets Extracted¶
By default, docviz extracts the following content types:
Text: All text content from the document
Tables: Tabular data with structure preserved
Figures: Images, charts, and diagrams
Equations: Mathematical expressions
You can customize what gets extracted using the includes parameter:
import docviz
document = docviz.Document("path/to/document.pdf")
# Extract only specific types of content
extractions = document.extract_content_sync(
includes=[
docviz.ExtractionType.TABLE,
docviz.ExtractionType.TEXT,
docviz.ExtractionType.FIGURE,
]
)
Output Formats¶
docviz supports multiple output formats:
JSON: Structured data format
CSV: Comma-separated values
Excel: Microsoft Excel format
XML: Extensible Markup Language format
import docviz
document = docviz.Document("path/to/document.pdf")
extractions = document.extract_content_sync()
# Save in multiple formats
extractions.save("results", save_format=[
docviz.SaveFormat.JSON,
docviz.SaveFormat.CSV,
docviz.SaveFormat.EXCEL
])
Batch Processing¶
Process multiple documents efficiently:
import docviz
from pathlib import Path
# Process all PDF files in a directory
pdf_directory = Path("data/papers/")
output_dir = Path("output/")
output_dir.mkdir(exist_ok=True)
pdfs = pdf_directory.glob("*.pdf")
documents = [docviz.Document(str(pdf)) for pdf in pdfs]
extractions = docviz.batch_extract(documents)
for ext in extractions:
ext.save(output_dir, save_format=[docviz.SaveFormat.JSON, docviz.SaveFormat.CSV])
Streaming Processing¶
For large documents, use streaming to process page by page:
import docviz
document = docviz.Document("path/to/large_document.pdf")
# Process document in pages to save memory
async for page_result in document.extract_streaming():
# Process each page
page_result.save(f"page_{page_result.page_number}", save_format=docviz.SaveFormat.JSON)
Custom Configuration¶
Configure extraction parameters and LLM settings:
import os
import docviz
document = docviz.Document("path/to/document.pdf")
extractions = document.extract_content_sync(
extraction_config=docviz.ExtractionConfig(
page_limit=30,
zoom_x=2.0,
zoom_y=2.0
),
llm_config=docviz.LLMConfig(
model="gpt-4o-mini",
api_key=os.getenv("OPENAI_API_KEY"),
base_url="https://api.openai.com/v1",
)
)
extractions.save("configured_results", save_format=docviz.SaveFormat.JSON)
Progress Tracking¶
Monitor extraction progress:
import docviz
from tqdm import tqdm
document = docviz.Document("path/to/document.pdf")
# Extract with progress bar
with tqdm(total=document.page_count, desc="Extracting content") as pbar:
extractions = document.extract_content_sync(progress_callback=pbar.update)
extractions.save("progress_results", save_format=docviz.SaveFormat.JSON)
Next Steps¶
Now that you have the basics, explore:
User Guide - Detailed usage guide
API Reference - Complete API reference
Examples - More examples and use cases
For more advanced features and configurations, see the User Guide section.