Package Structure¶

This diagram shows the structure of the docviz-python package and its main components.

        graph LR
subgraph "External Dependencies"
    TT[PyMuPDF<br/>PDF Processing]
    UU[OpenCV<br/>Computer Vision]
    VV[Ultralytics<br/>Object Detection]
    WW[OpenAI<br/>LLM Integration]
    XX[Tesseract<br/>OCR Engine]
    YY[Pandas<br/>Data Analysis]
end

subgraph "docviz-python"
    subgraph "Project Configuration"
        A[pyproject.toml]
        B[README.md]
        C[LICENSE]
        D[uv.lock]
    end

    subgraph "Entry Points"
        direction TB
        I[CLI Interface<br/>cli/__main__.py]
        E[Package Init<br/>__init__.py]
    end

    subgraph "Core Infrastructure"
        direction TB
        F[constants.py]
        G[environment.py]
        H[logging.py]
    end

    subgraph "Type System"
        direction TB
        K[types/__init__.py]
        L[aliases.py]
        M[detection_config.py]
        N[detection_result.py]
        O[extraction_chunk.py]
        P[extraction_config.py]
        Q[extraction_result.py]
        R[extraction_type.py]
        S[llm_config.py]
        T[ocr_config.py]
        U[save_format.py]
    end

    subgraph "Processing Pipeline"
        direction TB

        subgraph "Input Layer"
            W[Document Handler<br/>document/class_.py]
            X[Document Utils<br/>document/utils.py]
        end

        subgraph "Detection Layer"
            Y[Detection Core<br/>detection/__init__.py]
            Z[Detection Backends<br/>backends/]
            AA[Deduplication<br/>deduplication.py]
            BB[Detection Frontend<br/>frontend.py]
            CC[Labels<br/>labels.py]
        end

        subgraph "Processing Layer"
            II[Image Preprocessing<br/>preprocessing.py]
            HH[Image Annotation<br/>annotate.py]
            LL[PDF Converter<br/>convert.py]
            MM[PDF Analyzer<br/>analyzer.py]
            NN[Text Extraction<br/>text_extraction.py]
        end

        subgraph "Extraction Layer"
            DD[Extraction Core<br/>extraction/__init__.py]
            EE[Extraction Pipeline<br/>pipeline.py]
            FF[Extraction Utils<br/>utils.py]
            JJ[Image Summarizer<br/>summarizer.py]
        end

        subgraph "Output Layer"
            V[Library Core<br/>lib/__init__.py]
            OO[Common Functions<br/>functions.py]
        end
    end

    subgraph "Resources"
        PP[Documentation<br/>docs/]
        QQ[Examples<br/>examples/]
        RR[Models<br/>models/]
        SS[Tools<br/>tools/]
    end
end

%% Entry point connections
I --> E
E --> F
E --> G
E --> H
E --> K

%% Type system connections
K --> L
K --> M
K --> N
K --> O
K --> P
K --> Q
K --> R
K --> S
K --> T
K --> U

%% Processing flow
W --> Y
X --> Y
Y --> Z
Y --> AA
Y --> BB
Y --> CC

Z --> II
Z --> LL

II --> HH
II --> DD
LL --> MM
MM --> NN
NN --> DD

HH --> EE
DD --> EE
EE --> FF
EE --> JJ

FF --> V
JJ --> V
V --> OO

%% External dependency connections
TT -.-> W
TT -.-> LL
TT -.-> MM
TT -.-> NN

UU -.-> II
UU -.-> HH

VV -.-> Z
VV -.-> Y

WW -.-> JJ
WW -.-> EE

XX -.-> NN

YY -.-> Q
YY -.-> OO

%% Resource connections
PP -.-> QQ
QQ -.-> RR

%% Enhanced color scheme
%%{init: {'theme':'dark'}}%%
classDef config fill:#2D3748,color:#FFFFFF,stroke:#1A202C,stroke-width:2px
classDef entry fill:#3182CE,color:#FFFFFF,stroke:#2C5AA0,stroke-width:3px
classDef core fill:#38A169,color:#FFFFFF,stroke:#2F855A,stroke-width:2px
classDef types fill:#805AD5,color:#FFFFFF,stroke:#6B46C1,stroke-width:2px
classDef input fill:#E53E3E,color:#FFFFFF,stroke:#C53030,stroke-width:2px
classDef detection fill:#0BC5EA,color:#000000,stroke:#00B5D8,stroke-width:2px
classDef processing fill:#ED8936,color:#FFFFFF,stroke:#DD6B20,stroke-width:2px
classDef extraction fill:#38B2AC,color:#FFFFFF,stroke:#319795,stroke-width:2px
classDef output fill:#9F7AEA,color:#FFFFFF,stroke:#805AD5,stroke-width:2px
classDef resources fill:#718096,color:#FFFFFF,stroke:#4A5568,stroke-width:2px
classDef dependencies fill:#FBD38D,color:#000000,stroke:#F6AD55,stroke-width:2px

class A,B,C,D config
class I,E entry
class F,G,H core
class K,L,M,N,O,P,Q,R,S,T,U types
class W,X input
class Y,Z,AA,BB,CC detection
class II,HH,LL,MM,NN processing
class DD,EE,FF,JJ extraction
class V,OO output
class PP,QQ,RR,SS resources
class TT,UU,VV,WW,XX,YY dependencies

Repository Structure¶

Package Overview¶

Core Components¶

Main Package (__init__.py) - Exports main classes and functions - Handles dependency checking - Provides public API
Types Module - ExtractionType: Enum for content types (TEXT, TABLE, FIGURE, EQUATION, OTHER) - SaveFormat: Enum for output formats (JSON, CSV, EXCEL, XML) - ExtractionResult: Main result container - ExtractionEntry: Individual extracted content items - Configuration classes for detection, extraction, LLM, and OCR
Library Module (lib/) - Document Processing: Document class and utilities - Detection: YOLO-based layout detection - Extraction: Main extraction pipeline - Image Processing: Image analysis and chart summarization - PDF Processing: PDF conversion and text extraction
CLI Module - Command-line interface for document processing - Supports single file and batch processing - Rich output formatting

Key Features¶

Document Analysis: PDF to image conversion, layout detection
Content Extraction: Text, tables, figures, equations
AI Integration: Optional LLM-powered content summarization
Multiple Formats: JSON, CSV, Excel, XML output
Batch Processing: Handle multiple documents efficiently
Streaming: Process large documents page by page
Async Support: Both synchronous and asynchronous APIs

External Dependencies¶

PyMuPDF: PDF processing and conversion
OpenCV: Image processing and analysis
Ultralytics: YOLO model inference
OpenAI: LLM integration for content summarization
Tesseract: OCR for text extraction
Pandas: Data manipulation and Excel export