docviz¶

class docviz.DetectionConfig(imagesize, confidence, device, layout_detection_backend, model_path)[source]¶

Bases: object

Configuration for document layout detection and analysis.

This configuration class controls the behavior of the document layout detection system, which identifies and locates different content types (text, tables, figures, equations) within document pages.

The detection system uses computer vision models to analyze document images and identify regions of interest. This configuration allows fine-tuning of the detection process for optimal performance and accuracy.

Variables:

imagesize (int) – The size of the image to process for detection. Larger images generally provide better accuracy but require more computational resources. Common values are 512, 1024, or 2048. Default is typically 1024.
confidence (float) – The confidence threshold for detection results. Only detections with confidence scores above this threshold are included in results. Range: 0.0 to 1.0. Higher values are more selective but may miss valid content. Lower values include more content but may include false positives.
device (str) – The computing device to use for detection. Options include “cpu”, “cuda”, “mps” (Apple Silicon), or specific device identifiers like “cuda:0”. Use “cpu” for compatibility, “cuda” for NVIDIA GPUs.
layout_detection_backend (DetectionBackendEnum) – The detection backend to use. Different backends may use different models or algorithms for layout detection. Options include various YOLO-based models and other detection frameworks.
model_path (str) – Path to the detection model file. This should point to a valid model file compatible with the specified backend. The model file contains the trained weights and architecture for the detection system.

Parameters:

imagesize (int)
confidence (float)
device (str)
layout_detection_backend (DetectionBackendEnum)
model_path (str)

Example

>>> # Basic CPU configuration
>>> config = DetectionConfig(
...     imagesize=1024,
...     confidence=0.5,
...     device="cpu",
...     layout_detection_backend=DetectionBackendEnum.DOCLAYOUT_YOLO,
...     model_path="/path/to/model.pt"
... )
>>>
>>> # High-accuracy GPU configuration
>>> config = DetectionConfig(
...     imagesize=2048,
...     confidence=0.7,
...     device="cuda",
...     layout_detection_backend=DetectionBackendEnum.DOCLAYOUT_YOLO,
...     model_path="/path/to/model.pt"
... )

imagesize: int¶

confidence: float¶

device: str¶

layout_detection_backend: DetectionBackendEnum¶

model_path: str¶

__init__(imagesize, confidence, device, layout_detection_backend, model_path)¶

Parameters:

imagesize (int)
confidence (float)
device (str)
layout_detection_backend (DetectionBackendEnum)
model_path (str)

Return type:

None

class docviz.Document(file_path, config=None, filename=None)[source]¶

Bases: object

A class representing a document for content extraction and analysis.

The Document class is the primary interface for working with documents in DocViz. It provides methods for extracting various types of content (text, tables, figures, equations) from PDF documents and other supported formats.

The class handles document loading, validation, and provides both synchronous and asynchronous extraction methods. It supports streaming extraction for memory-efficient processing of large documents and chunked extraction for batch processing scenarios.

Variables:

file_path – Path object representing the document file location. Automatically resolved from input string (local path or URL).
config – ExtractionConfig instance containing default extraction settings. Used when no specific config is provided to extraction methods.
name – String name of the document, derived from the file stem.

Parameters:

file_path (str)
config (ExtractionConfig | None)
filename (str | None)

extract_content()[source]¶

Extract all content from the document asynchronously.

Parameters:

extraction_config (ExtractionConfig | None)
detection_config (DetectionConfig | None)
includes (list[ExtractionType] | None)
progress_callback (Callable[[int], None] | None)
llm_config (LLMConfig | None)

Return type:

ExtractionResult

extract_content_sync()[source]¶

Extract all content from the document synchronously.

Parameters:

extraction_config (ExtractionConfig | None)
detection_config (DetectionConfig | None)
includes (list[ExtractionType] | None)
progress_callback (Callable[[int], None] | None)
llm_config (LLMConfig | None)

Return type:

ExtractionResult

extract_streaming()[source]¶

Extract content page by page asynchronously.

Parameters:

extraction_config (ExtractionConfig | None)
detection_config (DetectionConfig | None)
includes (list[ExtractionType] | None)
progress_callback (Callable[[int], None] | None)
llm_config (LLMConfig | None)

Return type:

AsyncIterator[ExtractionResult]

extract_streaming_sync()[source]¶

Extract content page by page synchronously.

Parameters:

extraction_config (ExtractionConfig | None)
detection_config (DetectionConfig | None)
includes (list[ExtractionType] | None)
progress_callback (Callable[[int], None] | None)
llm_config (LLMConfig | None)

Return type:

Iterator[ExtractionResult]

extract_chunked()[source]¶

Extract content in configurable page chunks.

Parameters:

chunk_size (int)
extraction_config (ExtractionConfig | None)
detection_config (DetectionConfig | None)
includes (list[ExtractionType] | None)
llm_config (LLMConfig | None)

Return type:

Iterator[ExtractionChunk]

Properties:

page_count: The total number of pages in the document. Lazy-loaded on first: access using PyMuPDF.

Class Methods:

from_url: Create a Document instance from a URL, downloading the file first.

Example

>>> # Create document from local file
>>> doc = Document("document.pdf")
>>> print(f"Document has {doc.page_count} pages")
>>>
>>> # Extract all content
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
>>>
>>> # Extract specific content types
>>> tables_only = await doc.extract_content(
...     includes=[ExtractionType.TABLE]
... )
>>>
>>> # Stream processing for large documents
>>> async for page_result in doc.extract_streaming():
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")

__init__(file_path, config=None, filename=None)[source]¶

Initialize a Document instance.

Parameters:

file_path (str) – Path to the document file.
config (ExtractionConfig | None) – Configuration for extraction. If None, uses default extraction settings.
filename (str | None) – Optional filename for the document. If None, the filename will be extracted from the file path or a default name will be used.

async classmethod from_url(url, config=None, filename=None)[source]¶

Create a Document instance from a URL.

This class method downloads a document from a URL and creates a Document instance for it. The downloaded file is saved to a temporary location and managed by the Document instance.

The method supports various URL schemes (http, https, ftp, etc.) and automatically handles file naming. If no filename is provided, it attempts to extract one from the URL or uses a default name.

Parameters:

url (str) – URL to download the document from. Must be a valid URL pointing to a downloadable document file (PDF, etc.).
config (ExtractionConfig | None) – Configuration for extraction. If None, uses default extraction settings. This config will be used as the default for all extraction methods on this document.
filename (str | None) – Optional filename for the downloaded file. If None, the filename will be extracted from the URL or a default name will be used.

Returns:

Document instance with the downloaded file ready for extraction.

Return type:

Document

Raises:

Exception – If the URL is invalid, the file cannot be downloaded, or the downloaded file is not a valid document format.

Example

>>> # Download document from URL
>>> doc = await Document.from_url(
...     "https://example.com/document.pdf",
...     filename="my_document.pdf"
... )
>>>
>>> # Extract content from downloaded document
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")

property page_count: int¶

Get the total number of pages in the document.

This property provides lazy loading of the page count. The page count is only calculated when first accessed, and then cached for subsequent accesses. This approach avoids unnecessary file operations when the page count isn’t needed.

The method uses PyMuPDF (fitz) to open the document and count pages. If the document cannot be opened or the page count cannot be determined, it returns 0 and logs a warning.

Returns:

The total number of pages in the document. Returns 0 if the page: count cannot be determined.

Return type:

int

Raises:

No explicit exceptions are raised, but warnings may be logged if the – document cannot be opened or processed.

Example

>>> doc = Document("document.pdf")
>>> print(f"Document has {doc.page_count} pages")
>>> # The page count is now cached and won't be recalculated
>>> print(f"Still has {doc.page_count} pages")

async extract_content(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]¶

Extract all content from the document asynchronously.

This method extracts all content from the document in a single operation and returns a complete ExtractionResult containing all extracted elements. It’s the primary async method for document content extraction.

The method uses the document’s default configuration if no extraction_config is provided, allowing for document-specific default settings while still supporting per-extraction customization.

Processing characteristics: - Processes the entire document at once - Returns complete results in a single ExtractionResult - Uses document’s default config if no config provided - Supports all content types (text, tables, figures, equations) - Provides progress tracking capabilities

Parameters:

extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).
detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.
includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.
llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Returns:

Complete extraction result containing all extracted: content from the document, organized by page and content type.

Return type:

ExtractionResult

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Example

>>> doc = Document("document.pdf")
>>> # Extract all content using document's default config
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
>>>
>>> # Extract specific content types with custom config
>>> tables_only = await doc.extract_content(
...     includes=[ExtractionType.TABLE],
...     progress_callback=lambda page: print(f"Processing page {page}")
... )

extract_content_sync(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]¶

Parameters:

extraction_config (ExtractionConfig | None)
detection_config (DetectionConfig | None)
includes (list[ExtractionType] | None)
progress_callback (Callable[[int], None] | None)
llm_config (LLMConfig | None)

Return type:

ExtractionResult

async extract_streaming(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]¶

Extract content page by page for memory-efficient streaming processing.

This method provides memory-efficient streaming extraction by yielding results page by page as they are processed. It’s ideal for large documents where loading all content into memory at once would be problematic.

The method processes pages sequentially and yields each page’s results as soon as processing is complete. This allows for real-time processing and reduces memory usage compared to loading all results at once.

Key benefits: - Memory efficient: Only one page is processed at a time - Real-time results: Pages are yielded as soon as they’re processed - Progress tracking: Can track progress on a per-page basis - Scalable: Suitable for documents of any size - Configurable: Uses document’s default config if no config provided

Parameters:

extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).
detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.
includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.
llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Yields:

ExtractionResult –

Extraction result for each processed page. Each result: contains all extracted content for that specific page.

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

AsyncIterator[ExtractionResult]

Example

>>> doc = Document("large_document.pdf")
>>> # Process pages as they become available
>>> async for page_result in doc.extract_streaming():
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")
...     # Process each page immediately
...     for entry in page_result.entries:
...         if entry.class_ == "table":
...             print(f"Found table: {entry.text[:50]}...")

extract_streaming_sync(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]¶

Extract content page by page for memory-efficient streaming processing (sync version).

Parameters:

extraction_config (ExtractionConfig | None) – Configuration for extraction
detection_config (DetectionConfig | None) – Configuration for detection
includes (list[ExtractionType] | None) – Types of content to include
progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking
llm_config (LLMConfig | None) – Configuration for LLM

Yields:

ExtractionResult – Extraction result for each processed page

Return type:

Iterator[ExtractionResult]

extract_chunked(chunk_size=10, extraction_config=None, detection_config=None, includes=None, llm_config=None)[source]¶

Extract content in chunks for memory-efficient processing.

This method processes the document in configurable page chunks, providing a balance between memory efficiency and processing efficiency. It’s useful for large documents where you want to process multiple pages at once but still maintain reasonable memory usage.

The method divides the document into chunks of specified size and processes each chunk as a separate extraction operation. This approach allows for better memory management while still providing batch processing benefits.

Chunking strategy: - Divides document into chunks of chunk_size pages - Processes each chunk independently - Returns ExtractionChunk objects with chunk metadata - Maintains page numbering across chunks

Parameters:

chunk_size (int) – Number of pages to process in each chunk. Default is 10 pages. Larger chunks use more memory but may be more efficient for processing.
extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).
detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.
includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Yields:

ExtractionChunk –

Chunks of extraction results. Each chunk contains:

result: ExtractionResult for the chunk’s pages
start_page: First page number in the chunk
end_page: Last page number in the chunk

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

Iterator[ExtractionChunk]

Example

>>> doc = Document("large_document.pdf")
>>> # Process document in 5-page chunks
>>> for chunk in doc.extract_chunked(chunk_size=5):
...     print(f"Chunk {chunk.start_page}-{chunk.end_page}: {len(chunk.result.entries)} elements")
...     # Process each chunk
...     for entry in chunk.result.entries:
...         if entry.class_ == "table":
...             print(f"Table on page {entry.page_number}")

class docviz.ExtractionChunk(result, start_page, end_page)[source]¶

Bases: object

Represents a chunk of extraction results from streaming processing.

Variables:

result (ExtractionResult) – The extraction results for this chunk.
page_range (str) – The page range this chunk covers (e.g., “1-10”).
start_page (int) – The starting page number (1-indexed).
end_page (int) – The ending page number (1-indexed).

Parameters:

result (ExtractionResult)
start_page (int)
end_page (int)

result: ExtractionResult¶

start_page: int¶

end_page: int¶

property page_range: str¶: Get the page range as a string.

save(file_path, save_format)[source]¶

Save the chunk results to a file.

Parameters:

file_path (str | Path) – The path to save the file.
save_format (SaveFormat | list[SaveFormat]) – The format(s) to save in.

__init__(result, start_page, end_page)¶

Parameters:

result (ExtractionResult)
start_page (int)
end_page (int)

Return type:

None

class docviz.ExtractionConfig(page_limit=None, zoom_x=3.0, zoom_y=3.0, pdf_text_threshold_chars=1000, labels_to_exclude=<factory>, prefer_pdf_text=False)[source]¶

Bases: object

Configuration for document content extraction and processing.

This configuration class controls the behavior of the document content extraction system, which extracts and processes different types of content (text, tables, figures, equations) from detected document regions.

The extraction system handles various aspects of content processing including OCR, text extraction, table parsing, and content filtering. This configuration allows fine-tuning of the extraction process for optimal results.

Variables:

page_limit (int | None) – The maximum number of pages to extract from the document. If None, all pages in the document will be processed. Useful for processing large documents in parts or for testing with limited page ranges.
zoom_x (float) – The horizontal zoom factor for image processing. Higher values increase image resolution for better OCR accuracy but require more memory and processing time. Default is 3.0 for good balance of quality and performance.
zoom_y (float) – The vertical zoom factor for image processing. Higher values increase image resolution for better OCR accuracy but require more memory and processing time. Default is 3.0 for good balance of quality and performance.
pdf_text_threshold_chars (int) – The minimum number of characters required in a PDF text element to be considered valid content. Elements with fewer characters may be ignored in favor of OCR extraction. Default is 1000 characters.
labels_to_exclude (list[str]) – List of content labels to exclude from extraction. These labels correspond to specific content types that should be skipped during processing. Common exclusions include headers, footers, and other non-content elements.
prefer_pdf_text (bool) – Whether to prefer PDF-embedded text over OCR when both are available. When True, the system will use PDF text when it meets quality thresholds. When False, OCR will be used even when PDF text is available. Default is False for maximum compatibility.

Parameters:

page_limit (int | None)
zoom_x (float)
zoom_y (float)
pdf_text_threshold_chars (int)
labels_to_exclude (list[str])
prefer_pdf_text (bool)

Example

>>> # Basic extraction configuration
>>> config = ExtractionConfig(
...     page_limit=None,  # Process all pages
...     zoom_x=3.0,
...     zoom_y=3.0,
...     pdf_text_threshold_chars=1000,
...     labels_to_exclude=["header", "footer"],
...     prefer_pdf_text=False
... )
>>>
>>> # High-quality extraction for small documents
>>> config = ExtractionConfig(
...     page_limit=10,
...     zoom_x=4.0,
...     zoom_y=4.0,
...     pdf_text_threshold_chars=500,
...     labels_to_exclude=[],
...     prefer_pdf_text=True
... )

page_limit: int | None = None¶

zoom_x: float = 3.0¶

zoom_y: float = 3.0¶

pdf_text_threshold_chars: int = 1000¶

labels_to_exclude: list[str]¶

prefer_pdf_text: bool = False¶

__init__(page_limit=None, zoom_x=3.0, zoom_y=3.0, pdf_text_threshold_chars=1000, labels_to_exclude=<factory>, prefer_pdf_text=False)¶

Parameters:

page_limit (int | None)
zoom_x (float)
zoom_y (float)
pdf_text_threshold_chars (int)
labels_to_exclude (list[str])
prefer_pdf_text (bool)

Return type:

None

class docviz.ExtractionEntry(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)[source]¶

Bases: object

Extraction entry.

Variables:

text (str) – The text of the entry.
class (str) – The class of the entry.
confidence (float) – The confidence of the entry.
bbox (list[float]) – The bounding box of the entry.
page_number (int) – The page number of the entry.

Parameters:

text (str)
class_ (str)
confidence (float)
bbox (list[float])
page_number (int)

text: str¶

class_: str¶

confidence: float = -1.0¶

bbox: list[float]¶

page_number: int = -1¶

__init__(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)¶

Parameters:

text (str)
class_ (str)
confidence (float)
bbox (list[float])
page_number (int)

Return type:

None

class docviz.ExtractionResult(entries, page_number)[source]¶

Bases: object

Parameters:

entries (list[ExtractionEntry])
page_number (int)

__init__(entries, page_number)[source]¶

Parameters:

entries (list[ExtractionEntry])
page_number (int)

to_json(file_path)[source]¶

Save the extraction result to a JSON file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_csv(file_path)[source]¶

Save the extraction result to a CSV file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_excel(file_path)[source]¶

Save the extraction result to an Excel file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_xml(file_path)[source]¶

Save the extraction result to an XML file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

save(file_path_without_ext, save_format)[source]¶

Save the extraction result to a file. Its important to note that the file path is without extension.

Parameters:

file_path (str | Path) – The path to the file to save the result to without extension.
save_format (SaveFormat | list[SaveFormat]) – The format to save the result in.
file_path_without_ext (str | Path)

Raises:

ValueError – If provided save format is not presented in SaveFormat enum.

to_dict()[source]¶

Convert the extraction result to a dictionary.

Returns:: Dictionary representation of the extraction result.
Return type:: dict

to_dataframe()[source]¶

Convert the extraction result to a pandas DataFrame.

Returns:: DataFrame representation of the extraction result.
Return type:: pd.DataFrame

__str__()[source]¶

Return a human-readable string representation of the ExtractionResult.

Returns:: Pretty-printed summary of the extraction result.
Return type:: str

class docviz.ExtractionType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: Enum

Enumeration of content types that can be extracted from documents.

This enum defines the different types of content that can be extracted and processed from documents. Each type corresponds to a specific category of document content with its own processing requirements and characteristics.

The enum provides utility methods for working with extraction types, including getting all types (excluding the special ALL type) and converting to canonical label names used by the detection system.

Variables:

ALL – Special value indicating all content types should be extracted. This is a convenience option that expands to all individual types.
TABLE – Tabular data and structured information organized in rows and columns. Includes data tables, comparison tables, and other tabular formats.
TEXT – Regular text content including paragraphs, headings, lists, and other textual elements. This is the most common content type.
FIGURE – Visual elements including charts, graphs, diagrams, images, and other graphical content. Also includes charts and visualizations.
EQUATION – Mathematical expressions, formulas, and equations. Includes both inline and block mathematical content.
OTHER – Miscellaneous content that doesn’t fit into other categories. May include special elements, annotations, or unrecognized content.

Example

>>> # Extract all content types
>>> types = [ExtractionType.ALL]
>>>
>>> # Extract specific content types
>>> types = [ExtractionType.TABLE, ExtractionType.TEXT]
>>>
>>> # Get all individual types (excluding ALL)
>>> all_types = ExtractionType.get_all()
>>>
>>> # Convert to canonical label
>>> label = ExtractionType.TABLE.to_canonical_label()
>>> print(label)  # "table"

ALL = 'all'¶

TABLE = 'table'¶

TEXT = 'text'¶

FIGURE = 'figure'¶

EQUATION = 'equation'¶

OTHER = 'other'¶

classmethod get_all()[source]¶

to_canonical_label()[source]¶

Convert the extraction type to a canonical label.

This method maps the extraction type to a canonical label used by the detection system. The canonical label is a string representation of the extraction type that is used to identify the type of content in the document.

Returns:: The canonical label for the extraction type.
Return type:: str

class docviz.LLMConfig(model, api_key, base_url)[source]¶

Bases: object

Data class representing a single LLM config.

Variables:

model (str) – The model to use for the LLM.
api_key (str) – The API key to use for the LLM.
base_url (str) – The base URL to use for the LLM.

Parameters:

model (str)
api_key (str)
base_url (str)

model: str¶

api_key: str¶

base_url: str¶

__init__(model, api_key, base_url)¶

Parameters:

model (str)
api_key (str)
base_url (str)

Return type:

None

class docviz.OCRConfig(lang, chart_labels, labels_to_exclude)[source]¶

Bases: object

Configuration for OCR.

Variables:

lang (str) – The language to use for OCR.
chart_labels (list[str]) – The labels to use for chart OCR.
labels_to_exclude (list[str]) – The labels to exclude from OCR.

Parameters:

lang (str)
chart_labels (list[str])
labels_to_exclude (list[str])

lang: str¶

chart_labels: list[str]¶

labels_to_exclude: list[str]¶

__init__(lang, chart_labels, labels_to_exclude)¶

Parameters:

lang (str)
chart_labels (list[str])
labels_to_exclude (list[str])

Return type:

None

class docviz.SaveFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: Enum

JSON = 'json'¶

CSV = 'csv'¶

EXCEL = 'excel'¶

XML = 'xml'¶

docviz.batch_extract(documents, extraction_config=None, detection_config=None, includes=None, progress_callback=None)[source]¶

Extract content from multiple documents in batch.

This function processes multiple documents sequentially using the same configuration settings. It’s designed for bulk document processing scenarios where you need to extract content from a collection of documents with consistent settings.

Performance considerations: - Documents are processed sequentially, not in parallel - Memory usage scales with the number of documents and their sizes - Progress tracking is available for long-running batch operations - Each document is processed independently, so failures don’t affect other documents

Parameters:

documents (list[Document]) – List of Document objects to process. Each document should be a valid Document instance with an accessible file path.
extraction_config (ExtractionConfig | None) – Configuration for extraction. If None, default settings will be used for all documents.
detection_config (DetectionConfig | None) – Configuration for detection. If None, default settings will be used for all documents.
includes (list[ExtractionType] | None) – Types of content to include in extraction. If None, all content types will be extracted. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
progress_callback (Callable[[int], None] | None) – Optional callback function for progress tracking. The callback receives the current document index (1-based) as its argument. Useful for updating progress bars or logging in user interfaces.

Returns:

List of extraction results, one for each input document.: The order of results matches the order of input documents. Each result contains all extracted content for that document.

Return type:

list[ExtractionResult]

Example

>>> docs = [Document("doc1.pdf"), Document("doc2.pdf")]
>>> results = batch_extract(
...     documents=docs,
...     includes=[ExtractionType.TABLE, ExtractionType.TEXT],
...     progress_callback=lambda i: print(f"Processing document {i}")
... )
>>> len(results) == 2
True

async docviz.extract_content(document, extraction_config=None, detection_config=None, includes=None, progress_callback=None, ocr_config=None, llm_config=None)[source]¶

Extract content from a document asynchronously.

This is the primary async function for document content extraction. It processes the entire document and returns all extracted content in a single result. The function runs the synchronous extraction pipeline in a thread pool to provide async behavior while maintaining compatibility with the underlying processing pipeline.

The function automatically sets up default configurations if none are provided: - DetectionConfig: Uses CPU device with 1024 image size and 0.5 confidence threshold - OCRConfig: English language with chart labels for pictures, tables, and formulas - LLMConfig: Uses Gemma3 model with local Ollama server - ExtractionConfig: Uses default extraction settings

Processing workflow: 1. Validates and sets up default configurations 2. Creates a temporary directory for processing artifacts 3. Runs the extraction pipeline in a thread pool 4. Converts pipeline results to standardized format 5. Cleans up temporary files automatically

Parameters:

document (Document) – Document object to extract content from. Must have a valid file path accessible to the current process.
extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses default settings optimized for general document processing.
detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses CPU-based detection with balanced speed/accuracy settings.
includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.
ocr_config (OCRConfig | None) – Configuration for OCR processing. If None, uses English language with optimized settings for document analysis.
llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses local Gemma3 model via Ollama server.

Returns:

Complete extraction result containing all extracted content: from the document, organized by page and content type.

Return type:

ExtractionResult

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Example

>>> doc = Document("document.pdf")
>>> result = await extract_content(
...     document=doc,
...     includes=[ExtractionType.TABLE, ExtractionType.TEXT],
...     progress_callback=lambda page: print(f"Processing page {page}")
... )
>>> print(f"Extracted {len(result.entries)} elements")

async docviz.extract_content_streaming(document, extraction_config=None, detection_config=None, includes=None, progress_callback=None, ocr_config=None, llm_config=None)[source]¶

Extract content from a document asynchronously with streaming results.

This function provides memory-efficient streaming extraction by yielding results page by page as they are processed. It’s ideal for large documents where loading all content into memory at once would be problematic.

The function runs the synchronous streaming pipeline in a thread pool to provide async behavior while maintaining the memory efficiency of streaming processing. Each yielded result contains the extracted content for a single page.

Processing workflow: 1. Sets up default configurations if none provided 2. Creates temporary directory for processing artifacts 3. Runs streaming pipeline in thread pool 4. Yields page results as they become available 5. Cleans up temporary files on completion

Parameters:

document (Document) – Document object to extract content from. Must have a valid file path accessible to the current process.
extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses default settings optimized for general document processing.
detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses CPU-based detection with balanced speed/accuracy settings.
includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.
ocr_config (OCRConfig | None) – Configuration for OCR processing. If None, uses English language with optimized settings for document analysis.
llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses local Gemma3 model via Ollama server.

Yields:

ExtractionResult –

Extraction result for each processed page. Each result: contains all extracted content for that specific page.

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

AsyncIterator[ExtractionResult]

Example

>>> doc = Document("large_document.pdf")
>>> async for page_result in extract_content_streaming(doc):
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")
...     # Process each page as it becomes available

docviz.extract_content_streaming_sync(document, extraction_config=None, detection_config=None, includes=None, progress_callback=None, ocr_config=None, llm_config=None)[source]¶

Extract content from a document synchronously with streaming results.

This function provides memory-efficient streaming extraction by yielding results page by page as they are processed. It’s the core synchronous implementation that powers both sync and async streaming workflows.

The function processes pages one at a time and yields results immediately upon completion of each page. This approach is ideal for large documents where loading all content into memory at once would be problematic or when you need to start processing results before the entire document is complete.

Key benefits: - Memory efficient: Only one page is processed at a time - Real-time results: Pages are yielded as soon as they’re processed - Progress tracking: Can track progress on a per-page basis - Scalable: Suitable for documents of any size - Synchronous: No async/await complexity for simple use cases

Processing workflow: 1. Sets up default configurations if none provided 2. Creates temporary directory for processing artifacts 3. Runs streaming pipeline synchronously 4. Yields page results as they become available 5. Cleans up temporary files on completion

Parameters:

document (Document) – Document object to extract content from. Must have a valid file path accessible to the current process.
extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses default settings optimized for general document processing.
detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses CPU-based detection with balanced speed/accuracy settings.
includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.
ocr_config (OCRConfig | None) – Configuration for OCR processing. If None, uses English language with optimized settings for document analysis.
llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses local Gemma3 model via Ollama server.

Yields:

ExtractionResult –

Extraction result for each processed page. Each result: contains all extracted content for that specific page.

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

Iterator[ExtractionResult]

Example

>>> doc = Document("large_document.pdf")
>>> for page_result in extract_content_streaming_sync(doc):
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")
...     # Process each page as it becomes available

docviz.extract_content_sync(document, extraction_config=None, detection_config=None, includes=None, progress_callback=None, ocr_config=None, llm_config=None)[source]¶

Extract content from a document synchronously.

This is the core synchronous function for document content extraction. It processes the entire document in the current thread and returns all extracted content in a single result. This function is the foundation for both sync and async extraction workflows.

Processing workflow: 1. Validates and sets up default configurations 2. Creates a temporary directory for processing artifacts 3. Runs the extraction pipeline synchronously 4. Converts pipeline results to standardized format 5. Cleans up temporary files automatically

Memory and performance considerations: - Processes the entire document in memory - Uses temporary files for intermediate processing steps - Automatically cleans up temporary files on completion - Suitable for documents up to several hundred pages

Parameters:

document (Document) – Document object to extract content from. Must have a valid file path accessible to the current process.
extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses default settings optimized for general document processing.
detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses CPU-based detection with balanced speed/accuracy settings.
includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].
progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.
ocr_config (OCRConfig | None) – Configuration for OCR processing. If None, uses English language with optimized settings for document analysis.
llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses local Gemma3 model via Ollama server.

Returns:

Complete extraction result containing all extracted content: from the document, organized by page and content type.

Return type:

ExtractionResult

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Example

>>> doc = Document("document.pdf")
>>> result = extract_content_sync(
...     document=doc,
...     includes=[ExtractionType.TABLE, ExtractionType.TEXT],
...     progress_callback=lambda page: print(f"Processing page {page}")
... )
>>> print(f"Extracted {len(result.entries)} elements")

docviz._check_dependencies_once()[source]¶

Ensure dependencies are checked only once in a thread-safe and process-safe manner.

This function is called automatically on module import to verify that all required dependencies (models, libraries, etc.) are available before document processing. This prevents runtime errors and provides early feedback about missing dependencies.

A global variable tracks whether dependencies have been checked in the current thread. For process-level safety, a lock file at ~/.docviz/dependencies_checked.lock prevents multiple processes from performing the check simultaneously. Double-checked locking is used to minimize unnecessary locking and improve performance.

The function handles different asyncio contexts: - Creates a new event loop if none exists - Uses asyncio.run() for clean execution - Handles cases where event loop is already running (e.g., Jupyter notebooks)

Raises:: Exception – If any required dependency is missing or the dependency check fails. The specific exception type depends on what dependency is missing (e.g., FileNotFoundError for missing models, ImportError for missing packages).

docviz._run_async_dependency_check()[source]¶

Run the async dependency check with proper event loop handling.

This helper function handles different asyncio contexts gracefully: 1. If no event loop is running, use asyncio.run() (preferred modern approach) 2. If an event loop is already running (e.g., in Jupyter), create a new thread 3. Handle various edge cases and provide clear error messages

Raises:

RuntimeError – If dependency check fails after multiple attempts
Exception – Original exception from check_dependencies() if it’s not event loop related