Basic Usage¶
This guide covers the fundamental concepts and basic operations in docviz-python.
Core Concepts¶
Document Class¶
The Document
class is the main interface for working with documents. It represents a single document file and provides methods for content extraction.
import docviz
# Create a document instance
document = docviz.Document("path/to/document.pdf")
# Get document information
print(f"Document name: {document.name}")
print(f"Page count: {document.page_count}")
Extraction Types¶
docviz can extract different types of content from documents:
TEXT: Plain text content
TABLE: Tabular data with structure
FIGURE: Images, charts, and diagrams
EQUATION: Mathematical expressions
Extraction Results¶
The extraction process returns an ExtractionResult
object containing:
entries: List of extracted content items (ExtractionEntry objects)
page_number: Number of pages processed
Each ExtractionEntry contains:
text: The extracted text content
class_: The type of content (table, text, figure, etc.)
confidence: Confidence score for the detection
bbox: Bounding box coordinates [x1, y1, x2, y2]
page_number: Page number where the content was found
Note
The bbox
(bounding box) field is only meaningful for content types that represent images, figures, or other visual elements. For text and tables, the bbox
will be the size of the page.
Basic Operations¶
Simple Extraction¶
Extract all content from a document:
import docviz
# Create document and extract content
document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()
# Print summary
print(f"Extracted {len(extractions.entries)} items")
Saving Results¶
Save extraction results in various formats:
import docviz
document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()
# Save as JSON
extractions.save("results", save_format=docviz.SaveFormat.JSON)
# Save as CSV
extractions.save("results", save_format=docviz.SaveFormat.CSV)
# Save in multiple formats
extractions.save("results", save_format=[
docviz.SaveFormat.JSON,
docviz.SaveFormat.CSV,
docviz.SaveFormat.EXCEL
])
Selective Extraction¶
Extract only specific types of content:
import docviz
document = docviz.Document("sample.pdf")
# Extract only tables and text
extractions = document.extract_content_sync(
includes=[
docviz.ExtractionType.TABLE,
docviz.ExtractionType.TEXT,
]
)
# Extract everything except figures
extractions = document.extract_content_sync(
includes=[
docviz.ExtractionType.TABLE,
docviz.ExtractionType.TEXT,
docviz.ExtractionType.EQUATION,
docviz.ExtractionType.OTHER,
]
)
Working with URLs¶
Load documents from URLs:
import asyncio
import docviz
async def load_from_url():
# Create document from URL
document = await docviz.Document.from_url(
"https://example.com/document.pdf"
)
# Extract content
extractions = await document.extract_content()
extractions.save("url_results", save_format=docviz.SaveFormat.JSON)
return extractions
# Run the async function
result = asyncio.run(load_from_url())
Asynchronous vs Synchronous¶
docviz supports both synchronous and asynchronous operations:
Synchronous Usage¶
import docviz
# Simple synchronous extraction
document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()
extractions.save("results", save_format=docviz.SaveFormat.JSON)
Asynchronous Usage¶
import asyncio
import docviz
async def extract_async():
document = docviz.Document("sample.pdf")
extractions = await document.extract_content()
extractions.save("results", save_format=docviz.SaveFormat.JSON)
return extractions
# Run async function
result = asyncio.run(extract_async())
Error Handling¶
Handle common errors gracefully:
import docviz
from pathlib import Path
def safe_extract(file_path: str):
try:
# Check if file exists
if not Path(file_path).exists():
print(f"File not found: {file_path}")
return None
# Create document and extract
document = docviz.Document(file_path)
extractions = document.extract_content_sync()
# Save results
extractions.save("output", save_format=docviz.SaveFormat.JSON)
return extractions
except Exception as e:
print(f"Error processing {file_path}: {e}")
return None
# Use the safe extraction function
result = safe_extract("sample.pdf")
if result:
print(f"Successfully extracted {len(result.entries)} items")
Progress Tracking¶
Monitor extraction progress:
import docviz
from tqdm import tqdm
document = docviz.Document("large_document.pdf")
# Create progress bar
with tqdm(total=document.page_count, desc="Extracting") as pbar:
extractions = document.extract_content_sync(
progress_callback=pbar.update
)
extractions.save("progress_results", save_format=docviz.SaveFormat.JSON)
Working with Results¶
Access and process extraction results:
import docviz
document = docviz.Document("sample.pdf")
extractions = document.extract_content_sync()
# Iterate through extracted items
for entry in extractions.entries:
print(f"Type: {entry.class_}")
print(f"Page: {entry.page_number}")
print(f"Content: {entry.text[:100]}...")
print(f"Confidence: {entry.confidence:.2f}")
print("---")
# Filter by content type
tables = [entry for entry in extractions.entries
if entry.class_ == "table"]
text_items = [entry for entry in extractions.entries
if entry.class_ == "text"]
print(f"Found {len(tables)} tables and {len(text_items)} text items")
Next Steps¶
Now that you understand the basics, explore:
Advanced Usage - Advanced features and configurations
Configuration - Detailed configuration options
API Reference - Complete API reference
Examples - More examples and use cases