Package Structure¶
This diagram shows the structure of the docviz-python package and its main components.
graph LR
subgraph "External Dependencies"
TT[PyMuPDF<br/>PDF Processing]
UU[OpenCV<br/>Computer Vision]
VV[Ultralytics<br/>Object Detection]
WW[OpenAI<br/>LLM Integration]
XX[Tesseract<br/>OCR Engine]
YY[Pandas<br/>Data Analysis]
end
subgraph "docviz-python"
subgraph "Project Configuration"
A[pyproject.toml]
B[README.md]
C[LICENSE]
D[uv.lock]
end
subgraph "Entry Points"
direction TB
I[CLI Interface<br/>cli/__main__.py]
E[Package Init<br/>__init__.py]
end
subgraph "Core Infrastructure"
direction TB
F[constants.py]
G[environment.py]
H[logging.py]
end
subgraph "Type System"
direction TB
K[types/__init__.py]
L[aliases.py]
M[detection_config.py]
N[detection_result.py]
O[extraction_chunk.py]
P[extraction_config.py]
Q[extraction_result.py]
R[extraction_type.py]
S[llm_config.py]
T[ocr_config.py]
U[save_format.py]
end
subgraph "Processing Pipeline"
direction TB
subgraph "Input Layer"
W[Document Handler<br/>document/class_.py]
X[Document Utils<br/>document/utils.py]
end
subgraph "Detection Layer"
Y[Detection Core<br/>detection/__init__.py]
Z[Detection Backends<br/>backends/]
AA[Deduplication<br/>deduplication.py]
BB[Detection Frontend<br/>frontend.py]
CC[Labels<br/>labels.py]
end
subgraph "Processing Layer"
II[Image Preprocessing<br/>preprocessing.py]
HH[Image Annotation<br/>annotate.py]
LL[PDF Converter<br/>convert.py]
MM[PDF Analyzer<br/>analyzer.py]
NN[Text Extraction<br/>text_extraction.py]
end
subgraph "Extraction Layer"
DD[Extraction Core<br/>extraction/__init__.py]
EE[Extraction Pipeline<br/>pipeline.py]
FF[Extraction Utils<br/>utils.py]
JJ[Image Summarizer<br/>summarizer.py]
end
subgraph "Output Layer"
V[Library Core<br/>lib/__init__.py]
OO[Common Functions<br/>functions.py]
end
end
subgraph "Resources"
PP[Documentation<br/>docs/]
QQ[Examples<br/>examples/]
RR[Models<br/>models/]
SS[Tools<br/>tools/]
end
end
%% Entry point connections
I --> E
E --> F
E --> G
E --> H
E --> K
%% Type system connections
K --> L
K --> M
K --> N
K --> O
K --> P
K --> Q
K --> R
K --> S
K --> T
K --> U
%% Processing flow
W --> Y
X --> Y
Y --> Z
Y --> AA
Y --> BB
Y --> CC
Z --> II
Z --> LL
II --> HH
II --> DD
LL --> MM
MM --> NN
NN --> DD
HH --> EE
DD --> EE
EE --> FF
EE --> JJ
FF --> V
JJ --> V
V --> OO
%% External dependency connections
TT -.-> W
TT -.-> LL
TT -.-> MM
TT -.-> NN
UU -.-> II
UU -.-> HH
VV -.-> Z
VV -.-> Y
WW -.-> JJ
WW -.-> EE
XX -.-> NN
YY -.-> Q
YY -.-> OO
%% Resource connections
PP -.-> QQ
QQ -.-> RR
%% Enhanced color scheme
%%{init: {'theme':'dark'}}%%
classDef config fill:#2D3748,color:#FFFFFF,stroke:#1A202C,stroke-width:2px
classDef entry fill:#3182CE,color:#FFFFFF,stroke:#2C5AA0,stroke-width:3px
classDef core fill:#38A169,color:#FFFFFF,stroke:#2F855A,stroke-width:2px
classDef types fill:#805AD5,color:#FFFFFF,stroke:#6B46C1,stroke-width:2px
classDef input fill:#E53E3E,color:#FFFFFF,stroke:#C53030,stroke-width:2px
classDef detection fill:#0BC5EA,color:#000000,stroke:#00B5D8,stroke-width:2px
classDef processing fill:#ED8936,color:#FFFFFF,stroke:#DD6B20,stroke-width:2px
classDef extraction fill:#38B2AC,color:#FFFFFF,stroke:#319795,stroke-width:2px
classDef output fill:#9F7AEA,color:#FFFFFF,stroke:#805AD5,stroke-width:2px
classDef resources fill:#718096,color:#FFFFFF,stroke:#4A5568,stroke-width:2px
classDef dependencies fill:#FBD38D,color:#000000,stroke:#F6AD55,stroke-width:2px
class A,B,C,D config
class I,E entry
class F,G,H core
class K,L,M,N,O,P,Q,R,S,T,U types
class W,X input
class Y,Z,AA,BB,CC detection
class II,HH,LL,MM,NN processing
class DD,EE,FF,JJ extraction
class V,OO output
class PP,QQ,RR,SS resources
class TT,UU,VV,WW,XX,YY dependencies
Repository Structure¶
Package Overview¶
Core Components¶
Main Package (
__init__.py) - Exports main classes and functions - Handles dependency checking - Provides public APITypes Module -
ExtractionType: Enum for content types (TEXT, TABLE, FIGURE, EQUATION, OTHER) -SaveFormat: Enum for output formats (JSON, CSV, EXCEL, XML) -ExtractionResult: Main result container -ExtractionEntry: Individual extracted content items - Configuration classes for detection, extraction, LLM, and OCRLibrary Module (
lib/) - Document Processing: Document class and utilities - Detection: YOLO-based layout detection - Extraction: Main extraction pipeline - Image Processing: Image analysis and chart summarization - PDF Processing: PDF conversion and text extractionCLI Module - Command-line interface for document processing - Supports single file and batch processing - Rich output formatting
Key Features¶
Document Analysis: PDF to image conversion, layout detection
Content Extraction: Text, tables, figures, equations
AI Integration: Optional LLM-powered content summarization
Multiple Formats: JSON, CSV, Excel, XML output
Batch Processing: Handle multiple documents efficiently
Streaming: Process large documents page by page
Async Support: Both synchronous and asynchronous APIs
External Dependencies¶
PyMuPDF: PDF processing and conversion
OpenCV: Image processing and analysis
Ultralytics: YOLO model inference
OpenAI: LLM integration for content summarization
Tesseract: OCR for text extraction
Pandas: Data manipulation and Excel export