Data Types¶

This section documents the data types and enums used in docviz-python.

Extraction Types¶

class docviz.types.extraction_type.ExtractionType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: Enum

Enumeration of content types that can be extracted from documents.

This enum defines the different types of content that can be extracted and processed from documents. Each type corresponds to a specific category of document content with its own processing requirements and characteristics.

The enum provides utility methods for working with extraction types, including getting all types (excluding the special ALL type) and converting to canonical label names used by the detection system.

Variables:

ALL – Special value indicating all content types should be extracted. This is a convenience option that expands to all individual types.
TABLE – Tabular data and structured information organized in rows and columns. Includes data tables, comparison tables, and other tabular formats.
TEXT – Regular text content including paragraphs, headings, lists, and other textual elements. This is the most common content type.
FIGURE – Visual elements including charts, graphs, diagrams, images, and other graphical content. Also includes charts and visualizations.
EQUATION – Mathematical expressions, formulas, and equations. Includes both inline and block mathematical content.
OTHER – Miscellaneous content that doesn’t fit into other categories. May include special elements, annotations, or unrecognized content.

Example

>>> # Extract all content types
>>> types = [ExtractionType.ALL]
>>>
>>> # Extract specific content types
>>> types = [ExtractionType.TABLE, ExtractionType.TEXT]
>>>
>>> # Get all individual types (excluding ALL)
>>> all_types = ExtractionType.get_all()
>>>
>>> # Convert to canonical label
>>> label = ExtractionType.TABLE.to_canonical_label()
>>> print(label)  # "table"

ALL = 'all'¶

TABLE = 'table'¶

TEXT = 'text'¶

FIGURE = 'figure'¶

EQUATION = 'equation'¶

OTHER = 'other'¶

classmethod get_all()[source]¶

to_canonical_label()[source]¶

Convert the extraction type to a canonical label.

This method maps the extraction type to a canonical label used by the detection system. The canonical label is a string representation of the extraction type that is used to identify the type of content in the document.

Returns:: The canonical label for the extraction type.
Return type:: str

Extraction Configuration¶

class docviz.types.extraction_config.ExtractionConfig(page_limit=None, zoom_x=3.0, zoom_y=3.0, pdf_text_threshold_chars=1000, labels_to_exclude=<factory>, prefer_pdf_text=False)[source]¶

Bases: object

Configuration for document content extraction and processing.

This configuration class controls the behavior of the document content extraction system, which extracts and processes different types of content (text, tables, figures, equations) from detected document regions.

The extraction system handles various aspects of content processing including OCR, text extraction, table parsing, and content filtering. This configuration allows fine-tuning of the extraction process for optimal results.

Variables:

page_limit (int | None) – The maximum number of pages to extract from the document. If None, all pages in the document will be processed. Useful for processing large documents in parts or for testing with limited page ranges.
zoom_x (float) – The horizontal zoom factor for image processing. Higher values increase image resolution for better OCR accuracy but require more memory and processing time. Default is 3.0 for good balance of quality and performance.
zoom_y (float) – The vertical zoom factor for image processing. Higher values increase image resolution for better OCR accuracy but require more memory and processing time. Default is 3.0 for good balance of quality and performance.
pdf_text_threshold_chars (int) – The minimum number of characters required in a PDF text element to be considered valid content. Elements with fewer characters may be ignored in favor of OCR extraction. Default is 1000 characters.
labels_to_exclude (list[str]) – List of content labels to exclude from extraction. These labels correspond to specific content types that should be skipped during processing. Common exclusions include headers, footers, and other non-content elements.
prefer_pdf_text (bool) – Whether to prefer PDF-embedded text over OCR when both are available. When True, the system will use PDF text when it meets quality thresholds. When False, OCR will be used even when PDF text is available. Default is False for maximum compatibility.

Parameters:

page_limit (int | None)
zoom_x (float)
zoom_y (float)
pdf_text_threshold_chars (int)
labels_to_exclude (list[str])
prefer_pdf_text (bool)

Example

>>> # Basic extraction configuration
>>> config = ExtractionConfig(
...     page_limit=None,  # Process all pages
...     zoom_x=3.0,
...     zoom_y=3.0,
...     pdf_text_threshold_chars=1000,
...     labels_to_exclude=["header", "footer"],
...     prefer_pdf_text=False
... )
>>>
>>> # High-quality extraction for small documents
>>> config = ExtractionConfig(
...     page_limit=10,
...     zoom_x=4.0,
...     zoom_y=4.0,
...     pdf_text_threshold_chars=500,
...     labels_to_exclude=[],
...     prefer_pdf_text=True
... )

page_limit: int | None = None¶

zoom_x: float = 3.0¶

zoom_y: float = 3.0¶

pdf_text_threshold_chars: int = 1000¶

labels_to_exclude: list[str]¶

prefer_pdf_text: bool = False¶

__init__(page_limit=None, zoom_x=3.0, zoom_y=3.0, pdf_text_threshold_chars=1000, labels_to_exclude=<factory>, prefer_pdf_text=False)¶

Parameters:

page_limit (int | None)
zoom_x (float)
zoom_y (float)
pdf_text_threshold_chars (int)
labels_to_exclude (list[str])
prefer_pdf_text (bool)

Return type:

None

Detection Configuration¶

class docviz.types.detection_config.DetectionConfig(imagesize, confidence, device, layout_detection_backend, model_path)[source]¶

Bases: object

Configuration for document layout detection and analysis.

This configuration class controls the behavior of the document layout detection system, which identifies and locates different content types (text, tables, figures, equations) within document pages.

The detection system uses computer vision models to analyze document images and identify regions of interest. This configuration allows fine-tuning of the detection process for optimal performance and accuracy.

Variables:

imagesize (int) – The size of the image to process for detection. Larger images generally provide better accuracy but require more computational resources. Common values are 512, 1024, or 2048. Default is typically 1024.
confidence (float) – The confidence threshold for detection results. Only detections with confidence scores above this threshold are included in results. Range: 0.0 to 1.0. Higher values are more selective but may miss valid content. Lower values include more content but may include false positives.
device (str) – The computing device to use for detection. Options include “cpu”, “cuda”, “mps” (Apple Silicon), or specific device identifiers like “cuda:0”. Use “cpu” for compatibility, “cuda” for NVIDIA GPUs.
layout_detection_backend (DetectionBackendEnum) – The detection backend to use. Different backends may use different models or algorithms for layout detection. Options include various YOLO-based models and other detection frameworks.
model_path (str) – Path to the detection model file. This should point to a valid model file compatible with the specified backend. The model file contains the trained weights and architecture for the detection system.

Parameters:

imagesize (int)
confidence (float)
device (str)
layout_detection_backend (DetectionBackendEnum)
model_path (str)

Example

>>> # Basic CPU configuration
>>> config = DetectionConfig(
...     imagesize=1024,
...     confidence=0.5,
...     device="cpu",
...     layout_detection_backend=DetectionBackendEnum.DOCLAYOUT_YOLO,
...     model_path="/path/to/model.pt"
... )
>>>
>>> # High-accuracy GPU configuration
>>> config = DetectionConfig(
...     imagesize=2048,
...     confidence=0.7,
...     device="cuda",
...     layout_detection_backend=DetectionBackendEnum.DOCLAYOUT_YOLO,
...     model_path="/path/to/model.pt"
... )

imagesize: int¶

confidence: float¶

device: str¶

layout_detection_backend: DetectionBackendEnum¶

model_path: str¶

__init__(imagesize, confidence, device, layout_detection_backend, model_path)¶

Parameters:

imagesize (int)
confidence (float)
device (str)
layout_detection_backend (DetectionBackendEnum)
model_path (str)

Return type:

None

LLM Configuration¶

class docviz.types.llm_config.LLMConfig(model, api_key, base_url)[source]¶

Bases: object

Data class representing a single LLM config.

Variables:

model (str) – The model to use for the LLM.
api_key (str) – The API key to use for the LLM.
base_url (str) – The base URL to use for the LLM.

Parameters:

model (str)
api_key (str)
base_url (str)

model: str¶

api_key: str¶

base_url: str¶

__init__(model, api_key, base_url)¶

Parameters:

model (str)
api_key (str)
base_url (str)

Return type:

None

OCR Configuration¶

class docviz.types.ocr_config.OCRConfig(lang, chart_labels, labels_to_exclude)[source]¶

Bases: object

Configuration for OCR.

Variables:

lang (str) – The language to use for OCR.
chart_labels (list[str]) – The labels to use for chart OCR.
labels_to_exclude (list[str]) – The labels to exclude from OCR.

Parameters:

lang (str)
chart_labels (list[str])
labels_to_exclude (list[str])

lang: str¶

chart_labels: list[str]¶

labels_to_exclude: list[str]¶

__init__(lang, chart_labels, labels_to_exclude)¶

Parameters:

lang (str)
chart_labels (list[str])
labels_to_exclude (list[str])

Return type:

None

Save Formats¶

class docviz.types.save_format.SaveFormat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: Enum

JSON = 'json'¶

CSV = 'csv'¶

EXCEL = 'excel'¶

XML = 'xml'¶

Extraction Result¶

class docviz.types.extraction_result.ExtractionEntry(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)[source]¶

Bases: object

Extraction entry.

Variables:

text (str) – The text of the entry.
class (str) – The class of the entry.
confidence (float) – The confidence of the entry.
bbox (list[float]) – The bounding box of the entry.
page_number (int) – The page number of the entry.

Parameters:

text (str)
class_ (str)
confidence (float)
bbox (list[float])
page_number (int)

text: str¶

class_: str¶

confidence: float = -1.0¶

bbox: list[float]¶

page_number: int = -1¶

__init__(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)¶

Parameters:

text (str)
class_ (str)
confidence (float)
bbox (list[float])
page_number (int)

Return type:

None

class docviz.types.extraction_result.ExtractionResult(entries, page_number)[source]¶

Bases: object

Parameters:

entries (list[ExtractionEntry])
page_number (int)

__init__(entries, page_number)[source]¶

Parameters:

entries (list[ExtractionEntry])
page_number (int)

to_json(file_path)[source]¶

Save the extraction result to a JSON file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_csv(file_path)[source]¶

Save the extraction result to a CSV file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_excel(file_path)[source]¶

Save the extraction result to an Excel file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_xml(file_path)[source]¶

Save the extraction result to an XML file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

save(file_path_without_ext, save_format)[source]¶

Save the extraction result to a file. Its important to note that the file path is without extension.

Parameters:

file_path (str | Path) – The path to the file to save the result to without extension.
save_format (SaveFormat | list[SaveFormat]) – The format to save the result in.
file_path_without_ext (str | Path)

Raises:

ValueError – If provided save format is not presented in SaveFormat enum.

to_dict()[source]¶

Convert the extraction result to a dictionary.

Returns:: Dictionary representation of the extraction result.
Return type:: dict

to_dataframe()[source]¶

Convert the extraction result to a pandas DataFrame.

Returns:: DataFrame representation of the extraction result.
Return type:: pd.DataFrame

__str__()[source]¶

Return a human-readable string representation of the ExtractionResult.

Returns:: Pretty-printed summary of the extraction result.
Return type:: str

Extraction Entry¶

class docviz.types.extraction_result.ExtractionEntry(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)[source]¶

Bases: object

Extraction entry.

Variables:

text (str) – The text of the entry.
class (str) – The class of the entry.
confidence (float) – The confidence of the entry.
bbox (list[float]) – The bounding box of the entry.
page_number (int) – The page number of the entry.

Parameters:

text (str)
class_ (str)
confidence (float)
bbox (list[float])
page_number (int)

text: str¶

class_: str¶

confidence: float = -1.0¶

bbox: list[float]¶

page_number: int = -1¶

__init__(text, class_, confidence=-1.0, bbox=<factory>, page_number=-1)¶

Parameters:

text (str)
class_ (str)
confidence (float)
bbox (list[float])
page_number (int)

Return type:

None

class docviz.types.extraction_result.ExtractionResult(entries, page_number)[source]¶

Bases: object

Parameters:

entries (list[ExtractionEntry])
page_number (int)

__init__(entries, page_number)[source]¶

Parameters:

entries (list[ExtractionEntry])
page_number (int)

to_json(file_path)[source]¶

Save the extraction result to a JSON file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_csv(file_path)[source]¶

Save the extraction result to a CSV file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_excel(file_path)[source]¶

Save the extraction result to an Excel file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

to_xml(file_path)[source]¶

Save the extraction result to an XML file.

Parameters:: file_path (str | Path) – The path to the file to save the result to without extension.

save(file_path_without_ext, save_format)[source]¶

Save the extraction result to a file. Its important to note that the file path is without extension.

Parameters:

file_path (str | Path) – The path to the file to save the result to without extension.
save_format (SaveFormat | list[SaveFormat]) – The format to save the result in.
file_path_without_ext (str | Path)

Raises:

ValueError – If provided save format is not presented in SaveFormat enum.

to_dict()[source]¶

Convert the extraction result to a dictionary.

Returns:: Dictionary representation of the extraction result.
Return type:: dict

to_dataframe()[source]¶

Convert the extraction result to a pandas DataFrame.

Returns:: DataFrame representation of the extraction result.
Return type:: pd.DataFrame

__str__()[source]¶

Return a human-readable string representation of the ExtractionResult.

Returns:: Pretty-printed summary of the extraction result.
Return type:: str

Extraction Chunk¶

class docviz.types.extraction_chunk.ExtractionChunk(result, start_page, end_page)[source]¶

Bases: object

Represents a chunk of extraction results from streaming processing.

Variables:

result (ExtractionResult) – The extraction results for this chunk.
page_range (str) – The page range this chunk covers (e.g., “1-10”).
start_page (int) – The starting page number (1-indexed).
end_page (int) – The ending page number (1-indexed).

Parameters:

result (ExtractionResult)
start_page (int)
end_page (int)

result: ExtractionResult¶

start_page: int¶

end_page: int¶

property page_range: str¶: Get the page range as a string.

save(file_path, save_format)[source]¶

Save the chunk results to a file.

Parameters:

file_path (str | Path) – The path to save the file.
save_format (SaveFormat | list[SaveFormat]) – The format(s) to save in.

__init__(result, start_page, end_page)¶

Parameters:

result (ExtractionResult)
start_page (int)
end_page (int)

Return type:

None

Detection Result¶

class docviz.types.detection_result.DetectionResult(label, label_name, bbox, confidence)[source]¶

Bases: object

Data class representing a single detection result. Kept for backward compatibility.

Variables:

label (int) – The class index of the detected object.
label_name (str) – The class name of the detected object.
bbox (List[float]) – Bounding box coordinates in pixel values [x1, y1, x2, y2].
confidence (float) – Confidence score of the detection.

Parameters:

label (int)
label_name (str)
bbox (list[float])
confidence (float)

label: int¶

label_name: str¶

bbox: list[float]¶

confidence: float¶

__init__(label, label_name, bbox, confidence)¶

Parameters:

label (int)
label_name (str)
bbox (list[float])
confidence (float)

Return type:

None

Type Aliases¶

docviz.types.aliases.numeric: TypeAlias = int | float¶: A number that can be either an integer or a float.

docviz.types.aliases.RectangleTuple¶

A rectangle defined by (x1, y1, x2, y2) coordinates.

alias of tuple[int | float, int | float, int | float, int | float]

docviz.types.aliases.RectangleList¶

A list of rectangles defined by (x1, y1, x2, y2) coordinates.

alias of list[float]

docviz.types.aliases.RectangleUnion: TypeAlias = tuple[int | float, int | float, int | float, int | float] | list[float]¶: A rectangle defined by (x1, y1, x2, y2) coordinates or a list of rectangles.

docviz.types.aliases.Color¶

An RGB color represented as a tuple (R, G, B).

alias of tuple[int, int, int]