Data Types¶
This section documents the data types and enums used in docviz-python.
Extraction Types¶
- class docviz.types.extraction_type.ExtractionType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
Enumeration of content types that can be extracted from documents.
This enum defines the different types of content that can be extracted and processed from documents. Each type corresponds to a specific category of document content with its own processing requirements and characteristics.
The enum provides utility methods for working with extraction types, including getting all types (excluding the special ALL type) and converting to canonical label names used by the detection system.
- Variables:
ALL – Special value indicating all content types should be extracted. This is a convenience option that expands to all individual types.
TABLE – Tabular data and structured information organized in rows and columns. Includes data tables, comparison tables, and other tabular formats.
TEXT – Regular text content including paragraphs, headings, lists, and other textual elements. This is the most common content type.
FIGURE – Visual elements including charts, graphs, diagrams, images, and other graphical content. Also includes charts and visualizations.
EQUATION – Mathematical expressions, formulas, and equations. Includes both inline and block mathematical content.
OTHER – Miscellaneous content that doesn’t fit into other categories. May include special elements, annotations, or unrecognized content.
Example
>>> # Extract all content types >>> types = [ExtractionType.ALL] >>> >>> # Extract specific content types >>> types = [ExtractionType.TABLE, ExtractionType.TEXT] >>> >>> # Get all individual types (excluding ALL) >>> all_types = ExtractionType.get_all() >>> >>> # Convert to canonical label >>> label = ExtractionType.TABLE.to_canonical_label() >>> print(label) # "table"
- ALL = 'all'¶
- TABLE = 'table'¶
- TEXT = 'text'¶
- FIGURE = 'figure'¶
- EQUATION = 'equation'¶
- OTHER = 'other'¶
- to_canonical_label()[source]¶
Convert the extraction type to a canonical label.
This method maps the extraction type to a canonical label used by the detection system. The canonical label is a string representation of the extraction type that is used to identify the type of content in the document.
- Returns:
The canonical label for the extraction type.
- Return type:
Extraction Configuration¶
- class docviz.types.extraction_config.ExtractionConfig(page_limit=None, zoom_x=3.0, zoom_y=3.0, pdf_text_threshold_chars=1000, labels_to_exclude=<factory>, prefer_pdf_text=False)[source]¶
Bases:
object
Configuration for document content extraction and processing.
This configuration class controls the behavior of the document content extraction system, which extracts and processes different types of content (text, tables, figures, equations) from detected document regions.
The extraction system handles various aspects of content processing including OCR, text extraction, table parsing, and content filtering. This configuration allows fine-tuning of the extraction process for optimal results.
- Variables:
page_limit (int | None) – The maximum number of pages to extract from the document. If None, all pages in the document will be processed. Useful for processing large documents in parts or for testing with limited page ranges.
zoom_x (float) – The horizontal zoom factor for image processing. Higher values increase image resolution for better OCR accuracy but require more memory and processing time. Default is 3.0 for good balance of quality and performance.
zoom_y (float) – The vertical zoom factor for image processing. Higher values increase image resolution for better OCR accuracy but require more memory and processing time. Default is 3.0 for good balance of quality and performance.
pdf_text_threshold_chars (int) – The minimum number of characters required in a PDF text element to be considered valid content. Elements with fewer characters may be ignored in favor of OCR extraction. Default is 1000 characters.
labels_to_exclude (list[str]) – List of content labels to exclude from extraction. These labels correspond to specific content types that should be skipped during processing. Common exclusions include headers, footers, and other non-content elements.
prefer_pdf_text (bool) – Whether to prefer PDF-embedded text over OCR when both are available. When True, the system will use PDF text when it meets quality thresholds. When False, OCR will be used even when PDF text is available. Default is False for maximum compatibility.
- Parameters:
Example
>>> # Basic extraction configuration >>> config = ExtractionConfig( ... page_limit=None, # Process all pages ... zoom_x=3.0, ... zoom_y=3.0, ... pdf_text_threshold_chars=1000, ... labels_to_exclude=["header", "footer"], ... prefer_pdf_text=False ... ) >>> >>> # High-quality extraction for small documents >>> config = ExtractionConfig( ... page_limit=10, ... zoom_x=4.0, ... zoom_y=4.0, ... pdf_text_threshold_chars=500, ... labels_to_exclude=[], ... prefer_pdf_text=True ... )
- __init__(page_limit=None, zoom_x=3.0, zoom_y=3.0, pdf_text_threshold_chars=1000, labels_to_exclude=<factory>, prefer_pdf_text=False)¶
Detection Configuration¶
- class docviz.types.detection_config.DetectionConfig(imagesize, confidence, device, layout_detection_backend, model_path)[source]¶
Bases:
object
Configuration for document layout detection and analysis.
This configuration class controls the behavior of the document layout detection system, which identifies and locates different content types (text, tables, figures, equations) within document pages.
The detection system uses computer vision models to analyze document images and identify regions of interest. This configuration allows fine-tuning of the detection process for optimal performance and accuracy.
- Variables:
imagesize (int) – The size of the image to process for detection. Larger images generally provide better accuracy but require more computational resources. Common values are 512, 1024, or 2048. Default is typically 1024.
confidence (float) – The confidence threshold for detection results. Only detections with confidence scores above this threshold are included in results. Range: 0.0 to 1.0. Higher values are more selective but may miss valid content. Lower values include more content but may include false positives.
device (str) – The computing device to use for detection. Options include “cpu”, “cuda”, “mps” (Apple Silicon), or specific device identifiers like “cuda:0”. Use “cpu” for compatibility, “cuda” for NVIDIA GPUs.
layout_detection_backend (DetectionBackendEnum) – The detection backend to use. Different backends may use different models or algorithms for layout detection. Options include various YOLO-based models and other detection frameworks.
model_path (str) – Path to the detection model file. This should point to a valid model file compatible with the specified backend. The model file contains the trained weights and architecture for the detection system.
- Parameters:
Example
>>> # Basic CPU configuration >>> config = DetectionConfig( ... imagesize=1024, ... confidence=0.5, ... device="cpu", ... layout_detection_backend=DetectionBackendEnum.DOCLAYOUT_YOLO, ... model_path="/path/to/model.pt" ... ) >>> >>> # High-accuracy GPU configuration >>> config = DetectionConfig( ... imagesize=2048, ... confidence=0.7, ... device="cuda", ... layout_detection_backend=DetectionBackendEnum.DOCLAYOUT_YOLO, ... model_path="/path/to/model.pt" ... )
- layout_detection_backend: DetectionBackendEnum¶
LLM Configuration¶
OCR Configuration¶
Save Formats¶
Extraction Result¶
- class docviz.types.extraction_result.ExtractionEntry(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)[source]¶
Bases:
object
Extraction entry.
- Variables:
- Parameters:
- class docviz.types.extraction_result.ExtractionResult(entries, page_number)[source]¶
Bases:
object
- Parameters:
entries (list[ExtractionEntry])
page_number (int)
- __init__(entries, page_number)[source]¶
- Parameters:
entries (list[ExtractionEntry])
page_number (int)
- to_json(file_path)[source]¶
Save the extraction result to a JSON file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- to_csv(file_path)[source]¶
Save the extraction result to a CSV file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- to_excel(file_path)[source]¶
Save the extraction result to an Excel file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- to_xml(file_path)[source]¶
Save the extraction result to an XML file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- save(file_path_without_ext, save_format)[source]¶
Save the extraction result to a file. Its important to note that the file path is without extension.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
save_format (SaveFormat | list[SaveFormat]) – The format to save the result in.
file_path_without_ext (str | Path)
- Raises:
ValueError – If provided save format is not presented in SaveFormat enum.
- to_dict()[source]¶
Convert the extraction result to a dictionary.
- Returns:
Dictionary representation of the extraction result.
- Return type:
Extraction Entry¶
- class docviz.types.extraction_result.ExtractionEntry(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)[source]¶
Bases:
object
Extraction entry.
- Variables:
- Parameters:
- class docviz.types.extraction_result.ExtractionResult(entries, page_number)[source]¶
Bases:
object
- Parameters:
entries (list[ExtractionEntry])
page_number (int)
- __init__(entries, page_number)[source]¶
- Parameters:
entries (list[ExtractionEntry])
page_number (int)
- to_json(file_path)[source]¶
Save the extraction result to a JSON file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- to_csv(file_path)[source]¶
Save the extraction result to a CSV file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- to_excel(file_path)[source]¶
Save the extraction result to an Excel file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- to_xml(file_path)[source]¶
Save the extraction result to an XML file.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
- save(file_path_without_ext, save_format)[source]¶
Save the extraction result to a file. Its important to note that the file path is without extension.
- Parameters:
file_path (str | Path) – The path to the file to save the result to without extension.
save_format (SaveFormat | list[SaveFormat]) – The format to save the result in.
file_path_without_ext (str | Path)
- Raises:
ValueError – If provided save format is not presented in SaveFormat enum.
- to_dict()[source]¶
Convert the extraction result to a dictionary.
- Returns:
Dictionary representation of the extraction result.
- Return type:
Extraction Chunk¶
- class docviz.types.extraction_chunk.ExtractionChunk(result, start_page, end_page)[source]¶
Bases:
object
Represents a chunk of extraction results from streaming processing.
- Variables:
result (ExtractionResult) – The extraction results for this chunk.
page_range (str) – The page range this chunk covers (e.g., “1-10”).
start_page (int) – The starting page number (1-indexed).
end_page (int) – The ending page number (1-indexed).
- Parameters:
result (ExtractionResult)
start_page (int)
end_page (int)
- result: ExtractionResult¶
- save(file_path, save_format)[source]¶
Save the chunk results to a file.
- Parameters:
file_path (str | Path) – The path to save the file.
save_format (SaveFormat | list[SaveFormat]) – The format(s) to save in.
- __init__(result, start_page, end_page)¶
- Parameters:
result (ExtractionResult)
start_page (int)
end_page (int)
- Return type:
None
Detection Result¶
Type Aliases¶
- docviz.types.aliases.numeric: TypeAlias = int | float¶
A number that can be either an integer or a float.
- docviz.types.aliases.RectangleTuple¶
A rectangle defined by (x1, y1, x2, y2) coordinates.
alias of
tuple
[int
|float
,int
|float
,int
|float
,int
|float
]
- docviz.types.aliases.RectangleList¶
A list of rectangles defined by (x1, y1, x2, y2) coordinates.