Document Class

docviz.lib.document.class_._get_page_count_cached(file_path_str)[source]

Get page count for a document file with caching.

Parameters:

file_path_str (str) – String representation of the file path

Returns:

Number of pages in the document

Return type:

int

class docviz.lib.document.class_.Document(file_path, config=None, filename=None)[source]

Bases: object

A class representing a document for content extraction and analysis.

The Document class is the primary interface for working with documents in DocViz. It provides methods for extracting various types of content (text, tables, figures, equations) from PDF documents and other supported formats.

The class handles document loading, validation, and provides both synchronous and asynchronous extraction methods. It supports streaming extraction for memory-efficient processing of large documents and chunked extraction for batch processing scenarios.

Variables:
  • file_path – Path object representing the document file location. Automatically resolved from input string (local path or URL).

  • config – ExtractionConfig instance containing default extraction settings. Used when no specific config is provided to extraction methods.

  • name – String name of the document, derived from the file stem.

Parameters:
extract_content()[source]

Extract all content from the document asynchronously.

Parameters:
Return type:

ExtractionResult

extract_content_sync()[source]

Extract all content from the document synchronously.

Parameters:
Return type:

ExtractionResult

extract_streaming()[source]

Extract content page by page asynchronously.

Parameters:
Return type:

AsyncIterator[ExtractionResult]

extract_streaming_sync()[source]

Extract content page by page synchronously.

Parameters:
Return type:

Iterator[ExtractionResult]

extract_chunked()[source]

Extract content in configurable page chunks.

Parameters:
Return type:

Iterator[ExtractionChunk]

Properties:
page_count: The total number of pages in the document. Lazy-loaded on first

access using PyMuPDF.

Class Methods:

from_url: Create a Document instance from a URL, downloading the file first.

Example

>>> # Create document from local file
>>> doc = Document("document.pdf")
>>> print(f"Document has {doc.page_count} pages")
>>>
>>> # Extract all content
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
>>>
>>> # Extract specific content types
>>> tables_only = await doc.extract_content(
...     includes=[ExtractionType.TABLE]
... )
>>>
>>> # Stream processing for large documents
>>> async for page_result in doc.extract_streaming():
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")
__init__(file_path, config=None, filename=None)[source]

Initialize a Document instance.

The Document class is the primary interface for working with documents in DocViz. It provides methods for extracting various types of content (text, tables, figures, equations) from PDF documents and other supported formats.

The class handles document loading, validation, and provides both synchronous and asynchronous extraction methods. It supports streaming extraction for memory-efficient processing of large documents and chunked extraction for batch processing scenarios.

Parameters:
  • file_path (str) – Path to the document file.

  • config (ExtractionConfig | None) – Configuration for extraction. If None, uses default extraction settings.

  • filename (str | None) – Optional filename for the document. If None, the filename will be extracted from the file path or a default name will be used.

async classmethod from_url(url, config=None, filename=None)[source]

Create a Document instance from a URL.

This class method downloads a document from a URL and creates a Document instance for it. The downloaded file is saved to a temporary location and managed by the Document instance.

The method supports various URL schemes (http, https, ftp, etc.) and automatically handles file naming. If no filename is provided, it attempts to extract one from the URL or uses a default name.

Parameters:
  • url (str) – URL to download the document from. Must be a valid URL pointing to a downloadable document file (PDF, etc.).

  • config (ExtractionConfig | None) – Configuration for extraction. If None, uses default extraction settings. This config will be used as the default for all extraction methods on this document.

  • filename (str | None) – Optional filename for the downloaded file. If None, the filename will be extracted from the URL or a default name will be used.

Returns:

Document instance with the downloaded file ready for extraction.

Return type:

Document

Raises:

Exception – If the URL is invalid, the file cannot be downloaded, or the downloaded file is not a valid document format.

Example

>>> # Download document from URL
>>> doc = await Document.from_url(
...     "https://example.com/document.pdf",
...     filename="my_document.pdf"
... )
>>>
>>> # Extract content from downloaded document
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
property page_count: int

Get the total number of pages in the document.

This property provides lazy loading of the page count. The page count is only calculated when first accessed, and then cached for subsequent accesses. This approach avoids unnecessary file operations when the page count isn’t needed.

The method uses PyMuPDF (fitz) to open the document and count pages. If the document cannot be opened or the page count cannot be determined, it returns 0 and logs a warning.

Returns:

The total number of pages in the document. Returns 0 if the page

count cannot be determined.

Return type:

int

Raises:

No explicit exceptions are raised, but warnings may be logged if the – document cannot be opened or processed.

Example

>>> doc = Document("document.pdf")
>>> print(f"Document has {doc.page_count} pages")
>>> # The page count is now cached and won't be recalculated
>>> print(f"Still has {doc.page_count} pages")
async extract_content(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]

Extract all content from the document asynchronously.

This method extracts all content from the document in a single operation and returns a complete ExtractionResult containing all extracted elements. It’s the primary async method for document content extraction.

The method uses the document’s default configuration if no extraction_config is provided, allowing for document-specific default settings while still supporting per-extraction customization.

Processing characteristics: - Processes the entire document at once - Returns complete results in a single ExtractionResult - Uses document’s default config if no config provided - Supports all content types (text, tables, figures, equations) - Provides progress tracking capabilities

Parameters:
  • extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).

  • detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.

  • includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].

  • progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.

  • llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Returns:

Complete extraction result containing all extracted

content from the document, organized by page and content type.

Return type:

ExtractionResult

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Example

>>> doc = Document("document.pdf")
>>> # Extract all content using document's default config
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
>>>
>>> # Extract specific content types with custom config
>>> tables_only = await doc.extract_content(
...     includes=[ExtractionType.TABLE],
...     progress_callback=lambda page: print(f"Processing page {page}")
... )
extract_content_sync(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]
Parameters:
Return type:

ExtractionResult

async extract_streaming(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]

Extract content page by page for memory-efficient streaming processing.

This method provides memory-efficient streaming extraction by yielding results page by page as they are processed. It’s ideal for large documents where loading all content into memory at once would be problematic.

The method processes pages sequentially and yields each page’s results as soon as processing is complete. This allows for real-time processing and reduces memory usage compared to loading all results at once.

Key benefits: - Memory efficient: Only one page is processed at a time - Real-time results: Pages are yielded as soon as they’re processed - Progress tracking: Can track progress on a per-page basis - Scalable: Suitable for documents of any size - Configurable: Uses document’s default config if no config provided

Parameters:
  • extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).

  • detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.

  • includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].

  • progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.

  • llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Yields:

ExtractionResult

Extraction result for each processed page. Each result

contains all extracted content for that specific page.

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

AsyncIterator[ExtractionResult]

Example

>>> doc = Document("large_document.pdf")
>>> # Process pages as they become available
>>> async for page_result in doc.extract_streaming():
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")
...     # Process each page immediately
...     for entry in page_result.entries:
...         if entry.class_ == "table":
...             print(f"Found table: {entry.text[:50]}...")
extract_streaming_sync(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]

Extract content page by page for memory-efficient streaming processing (sync version).

Parameters:
  • extraction_config (ExtractionConfig | None) – Configuration for extraction

  • detection_config (DetectionConfig | None) – Configuration for detection

  • includes (list[ExtractionType] | None) – Types of content to include

  • progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking

  • llm_config (LLMConfig | None) – Configuration for LLM

Yields:

ExtractionResult – Extraction result for each processed page

Return type:

Iterator[ExtractionResult]

extract_chunked(chunk_size=10, extraction_config=None, detection_config=None, includes=None, llm_config=None)[source]

Extract content in chunks for memory-efficient processing.

This method processes the document in configurable page chunks, providing a balance between memory efficiency and processing efficiency. It’s useful for large documents where you want to process multiple pages at once but still maintain reasonable memory usage.

The method divides the document into chunks of specified size and processes each chunk as a separate extraction operation. This approach allows for better memory management while still providing batch processing benefits.

Chunking strategy: - Divides document into chunks of chunk_size pages - Processes each chunk independently - Returns ExtractionChunk objects with chunk metadata - Maintains page numbering across chunks

Parameters:
  • chunk_size (int) – Number of pages to process in each chunk. Default is 10 pages. Larger chunks use more memory but may be more efficient for processing.

  • extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).

  • detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.

  • includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].

  • llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Yields:

ExtractionChunk

Chunks of extraction results. Each chunk contains:
  • result: ExtractionResult for the chunk’s pages

  • start_page: First page number in the chunk

  • end_page: Last page number in the chunk

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

Iterator[ExtractionChunk]

Example

>>> doc = Document("large_document.pdf")
>>> # Process document in 5-page chunks
>>> for chunk in doc.extract_chunked(chunk_size=5):
...     print(f"Chunk {chunk.start_page}-{chunk.end_page}: {len(chunk.result.entries)} elements")
...     # Process each chunk
...     for entry in chunk.result.entries:
...         if entry.class_ == "table":
...             print(f"Table on page {entry.page_number}")

Document Class Reference

The Document class is the main interface for working with documents in docviz-python.

Constructor

class docviz.Document(file_path, config=None, filename=None)[source]

Bases: object

A class representing a document for content extraction and analysis.

The Document class is the primary interface for working with documents in DocViz. It provides methods for extracting various types of content (text, tables, figures, equations) from PDF documents and other supported formats.

The class handles document loading, validation, and provides both synchronous and asynchronous extraction methods. It supports streaming extraction for memory-efficient processing of large documents and chunked extraction for batch processing scenarios.

Variables:
  • file_path – Path object representing the document file location. Automatically resolved from input string (local path or URL).

  • config – ExtractionConfig instance containing default extraction settings. Used when no specific config is provided to extraction methods.

  • name – String name of the document, derived from the file stem.

Parameters:
extract_content()[source]

Extract all content from the document asynchronously.

Parameters:
Return type:

ExtractionResult

extract_content_sync()[source]

Extract all content from the document synchronously.

Parameters:
Return type:

ExtractionResult

extract_streaming()[source]

Extract content page by page asynchronously.

Parameters:
Return type:

AsyncIterator[ExtractionResult]

extract_streaming_sync()[source]

Extract content page by page synchronously.

Parameters:
Return type:

Iterator[ExtractionResult]

extract_chunked()[source]

Extract content in configurable page chunks.

Parameters:
Return type:

Iterator[ExtractionChunk]

Properties:
page_count: The total number of pages in the document. Lazy-loaded on first

access using PyMuPDF.

Class Methods:

from_url: Create a Document instance from a URL, downloading the file first.

Example

>>> # Create document from local file
>>> doc = Document("document.pdf")
>>> print(f"Document has {doc.page_count} pages")
>>>
>>> # Extract all content
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
>>>
>>> # Extract specific content types
>>> tables_only = await doc.extract_content(
...     includes=[ExtractionType.TABLE]
... )
>>>
>>> # Stream processing for large documents
>>> async for page_result in doc.extract_streaming():
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")
__init__(file_path, config=None, filename=None)[source]

Initialize a Document instance.

The Document class is the primary interface for working with documents in DocViz. It provides methods for extracting various types of content (text, tables, figures, equations) from PDF documents and other supported formats.

The class handles document loading, validation, and provides both synchronous and asynchronous extraction methods. It supports streaming extraction for memory-efficient processing of large documents and chunked extraction for batch processing scenarios.

Parameters:
  • file_path (str) – Path to the document file.

  • config (ExtractionConfig | None) – Configuration for extraction. If None, uses default extraction settings.

  • filename (str | None) – Optional filename for the document. If None, the filename will be extracted from the file path or a default name will be used.

async classmethod from_url(url, config=None, filename=None)[source]

Create a Document instance from a URL.

This class method downloads a document from a URL and creates a Document instance for it. The downloaded file is saved to a temporary location and managed by the Document instance.

The method supports various URL schemes (http, https, ftp, etc.) and automatically handles file naming. If no filename is provided, it attempts to extract one from the URL or uses a default name.

Parameters:
  • url (str) – URL to download the document from. Must be a valid URL pointing to a downloadable document file (PDF, etc.).

  • config (ExtractionConfig | None) – Configuration for extraction. If None, uses default extraction settings. This config will be used as the default for all extraction methods on this document.

  • filename (str | None) – Optional filename for the downloaded file. If None, the filename will be extracted from the URL or a default name will be used.

Returns:

Document instance with the downloaded file ready for extraction.

Return type:

Document

Raises:

Exception – If the URL is invalid, the file cannot be downloaded, or the downloaded file is not a valid document format.

Example

>>> # Download document from URL
>>> doc = await Document.from_url(
...     "https://example.com/document.pdf",
...     filename="my_document.pdf"
... )
>>>
>>> # Extract content from downloaded document
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
property page_count: int

Get the total number of pages in the document.

This property provides lazy loading of the page count. The page count is only calculated when first accessed, and then cached for subsequent accesses. This approach avoids unnecessary file operations when the page count isn’t needed.

The method uses PyMuPDF (fitz) to open the document and count pages. If the document cannot be opened or the page count cannot be determined, it returns 0 and logs a warning.

Returns:

The total number of pages in the document. Returns 0 if the page

count cannot be determined.

Return type:

int

Raises:

No explicit exceptions are raised, but warnings may be logged if the – document cannot be opened or processed.

Example

>>> doc = Document("document.pdf")
>>> print(f"Document has {doc.page_count} pages")
>>> # The page count is now cached and won't be recalculated
>>> print(f"Still has {doc.page_count} pages")
async extract_content(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]

Extract all content from the document asynchronously.

This method extracts all content from the document in a single operation and returns a complete ExtractionResult containing all extracted elements. It’s the primary async method for document content extraction.

The method uses the document’s default configuration if no extraction_config is provided, allowing for document-specific default settings while still supporting per-extraction customization.

Processing characteristics: - Processes the entire document at once - Returns complete results in a single ExtractionResult - Uses document’s default config if no config provided - Supports all content types (text, tables, figures, equations) - Provides progress tracking capabilities

Parameters:
  • extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).

  • detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.

  • includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].

  • progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.

  • llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Returns:

Complete extraction result containing all extracted

content from the document, organized by page and content type.

Return type:

ExtractionResult

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Example

>>> doc = Document("document.pdf")
>>> # Extract all content using document's default config
>>> result = await doc.extract_content()
>>> print(f"Extracted {len(result.entries)} elements")
>>>
>>> # Extract specific content types with custom config
>>> tables_only = await doc.extract_content(
...     includes=[ExtractionType.TABLE],
...     progress_callback=lambda page: print(f"Processing page {page}")
... )
extract_content_sync(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]
Parameters:
Return type:

ExtractionResult

async extract_streaming(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]

Extract content page by page for memory-efficient streaming processing.

This method provides memory-efficient streaming extraction by yielding results page by page as they are processed. It’s ideal for large documents where loading all content into memory at once would be problematic.

The method processes pages sequentially and yields each page’s results as soon as processing is complete. This allows for real-time processing and reduces memory usage compared to loading all results at once.

Key benefits: - Memory efficient: Only one page is processed at a time - Real-time results: Pages are yielded as soon as they’re processed - Progress tracking: Can track progress on a per-page basis - Scalable: Suitable for documents of any size - Configurable: Uses document’s default config if no config provided

Parameters:
  • extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).

  • detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.

  • includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].

  • progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking. Called with current page number during processing. Useful for UI progress updates.

  • llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Yields:

ExtractionResult

Extraction result for each processed page. Each result

contains all extracted content for that specific page.

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

AsyncIterator[ExtractionResult]

Example

>>> doc = Document("large_document.pdf")
>>> # Process pages as they become available
>>> async for page_result in doc.extract_streaming():
...     print(f"Page {page_result.page_number}: {len(page_result.entries)} elements")
...     # Process each page immediately
...     for entry in page_result.entries:
...         if entry.class_ == "table":
...             print(f"Found table: {entry.text[:50]}...")
extract_streaming_sync(extraction_config=None, detection_config=None, includes=None, progress_callback=None, llm_config=None)[source]

Extract content page by page for memory-efficient streaming processing (sync version).

Parameters:
  • extraction_config (ExtractionConfig | None) – Configuration for extraction

  • detection_config (DetectionConfig | None) – Configuration for detection

  • includes (list[ExtractionType] | None) – Types of content to include

  • progress_callback (Callable[[int], None] | None) – Optional callback for progress tracking

  • llm_config (LLMConfig | None) – Configuration for LLM

Yields:

ExtractionResult – Extraction result for each processed page

Return type:

Iterator[ExtractionResult]

extract_chunked(chunk_size=10, extraction_config=None, detection_config=None, includes=None, llm_config=None)[source]

Extract content in chunks for memory-efficient processing.

This method processes the document in configurable page chunks, providing a balance between memory efficiency and processing efficiency. It’s useful for large documents where you want to process multiple pages at once but still maintain reasonable memory usage.

The method divides the document into chunks of specified size and processes each chunk as a separate extraction operation. This approach allows for better memory management while still providing batch processing benefits.

Chunking strategy: - Divides document into chunks of chunk_size pages - Processes each chunk independently - Returns ExtractionChunk objects with chunk metadata - Maintains page numbering across chunks

Parameters:
  • chunk_size (int) – Number of pages to process in each chunk. Default is 10 pages. Larger chunks use more memory but may be more efficient for processing.

  • extraction_config (ExtractionConfig | None) – Configuration for extraction process. If None, uses the document’s default configuration (self.config).

  • detection_config (DetectionConfig | None) – Configuration for layout detection. If None, uses default detection settings optimized for general document processing.

  • includes (list[ExtractionType] | None) – List of content types to extract. If None, extracts all available content types. Use ExtractionType.ALL for all types or specify individual types like [ExtractionType.TABLE, ExtractionType.TEXT].

  • llm_config (LLMConfig | None) – Configuration for LLM-based content analysis. If None, uses default LLM settings for content enhancement and analysis.

Yields:

ExtractionChunk

Chunks of extraction results. Each chunk contains:
  • result: ExtractionResult for the chunk’s pages

  • start_page: First page number in the chunk

  • end_page: Last page number in the chunk

Raises:

Exception – If document processing fails, file access issues, or pipeline errors occur. The specific exception depends on the failure point.

Return type:

Iterator[ExtractionChunk]

Example

>>> doc = Document("large_document.pdf")
>>> # Process document in 5-page chunks
>>> for chunk in doc.extract_chunked(chunk_size=5):
...     print(f"Chunk {chunk.start_page}-{chunk.end_page}: {len(chunk.result.entries)} elements")
...     # Process each chunk
...     for entry in chunk.result.entries:
...         if entry.class_ == "table":
...             print(f"Table on page {entry.page_number}")

Usage Examples

Basic Document Creation

import docviz

# Create document from local file
document = docviz.Document("path/to/document.pdf")

# Create document from URL
document = await docviz.Document.from_url("https://example.com/document.pdf")

Asynchronous Extraction

import asyncio
import docviz

async def extract_document():
    document = docviz.Document("document.pdf")
    extractions = await document.extract_content()
    return extractions

result = asyncio.run(extract_document())

Synchronous Extraction

import docviz

document = docviz.Document("document.pdf")
extractions = document.extract_content_sync()

Streaming Extraction

import docviz

import asyncio

async def process_streaming():
    document = docviz.Document("large_document.pdf")

    # Process page by page asynchronously
    async for page_result in document.extract_streaming():
        print(f"Page {page_result.page_number}: {len(page_result.entries)} items")

# Run the async streaming function
asyncio.run(process_streaming())