Output Formats¶
This guide covers the various output formats supported by docviz-python and how to use them effectively for different use cases.
Supported Formats¶
docviz-python supports the following output formats:
JSON: Structured data format with full metadata
CSV: Tabular format for spreadsheet applications
Excel: Microsoft Excel format with multiple sheets
XML: Extensible markup language format
Each format has different strengths and is suitable for different use cases and downstream processing needs.
JSON Format¶
JSON is the most comprehensive format, preserving all metadata and structure information.
Basic Usage¶
import docviz
document = docviz.Document("document.pdf")
extractions = document.extract_content_sync()
# Save as JSON
extractions.save("results", save_format=docviz.SaveFormat.JSON)
JSON Structure¶
The JSON output contains the following structure:
{
"entries": [
{
"text": "This is extracted text content...",
"class": "text",
"confidence": 0.95,
"bbox": [100, 200, 500, 250],
"page_number": 1
},
{
"text": "| Column 1 | Column 2 | Column 3 |\n|----------|----------|----------|",
"class": "table",
"confidence": 0.88,
"bbox": [50, 300, 550, 450],
"page_number": 1
}
]
}
JSON Processing¶
# Access JSON data programmatically
result_dict = extractions.to_dict()
# Filter entries by type
text_entries = [
entry for entry in result_dict["entries"]
if entry["class"] == "text"
]
# Print structure information
print(f"Total entries: {len(result_dict['entries'])}")
for entry in result_dict["entries"]:
print(f"Page {entry['page_number']}: {entry['class']} (confidence: {entry['confidence']:.2f})")
Filtering JSON Content¶
# Save only specific content types
filtered_entries = [
entry for entry in extractions.entries
if entry.class_ in ["table", "figure"]
]
filtered_extractions = docviz.ExtractionResult(
entries=filtered_entries,
page_number=extractions.page_number
)
filtered_extractions.save(
"tables_and_figures",
save_format=docviz.SaveFormat.JSON
)
CSV Format¶
CSV format provides a tabular view of extracted content, suitable for analysis in spreadsheet applications.
Basic Usage¶
import docviz
document = docviz.Document("document.pdf")
extractions = document.extract_content_sync()
# Save as CSV
extractions.save("results", save_format=docviz.SaveFormat.CSV)
CSV Structure¶
The CSV output contains the following columns:
Column |
Description |
Example |
---|---|---|
page_number |
Page where content was found |
1 |
class |
Type of content |
text, table, figure, equation |
text |
Extracted text content |
This is the extracted text… |
confidence |
Detection confidence score |
0.95 |
bbox |
Bounding box coordinates [x1, y1, x2, y2] |
[100, 200, 500, 250] |
Working with CSV Data¶
# After saving as CSV, you can read it with pandas
import pandas as pd
# Read the generated CSV
df = pd.read_csv("results.csv")
# Analyze the data
print(f"Total entries: {len(df)}")
print(f"Content types: {df['class_'].unique()}")
print(f"Average confidence: {df['confidence'].mean():.2f}")
Filtering CSV Content¶
# Save only high-confidence extractions
high_confidence = [
entry for entry in extractions.entries
if entry.confidence > 0.8
]
# Create filtered result
filtered_result = docviz.ExtractionResult(
entries=high_confidence,
page_number=extractions.page_number
)
filtered_result.save("high_confidence", save_format=docviz.SaveFormat.CSV)
Excel Format¶
Excel format provides rich formatting options and supports multiple sheets for organized data presentation.
Basic Usage¶
import docviz
document = docviz.Document("document.pdf")
extractions = document.extract_content_sync()
# Save as Excel
extractions.save("results", save_format=docviz.SaveFormat.EXCEL)
Excel Structure¶
The Excel output creates a single sheet with all extraction data in tabular format, similar to CSV but with Excel formatting capabilities.
Working with Excel Data¶
# After saving as Excel, you can read it with pandas
import pandas as pd
# Read the generated Excel file
df = pd.read_excel("results.xlsx")
# Analyze the data
print(f"Total entries: {len(df)}")
print(f"Content types: {df['class_'].unique()}")
# Create pivot table
pivot = df.pivot_table(
values='confidence',
index='class_',
aggfunc=['count', 'mean']
)
print(pivot)
XML Format¶
XML format provides structured markup suitable for integration with other systems.
Basic Usage¶
import docviz
document = docviz.Document("document.pdf")
extractions = document.extract_content_sync()
# Save as XML
extractions.save("results", save_format=docviz.SaveFormat.XML)
XML Structure¶
<?xml version="1.0" encoding="UTF-8"?>
<ExtractionResults>
<ExtractionEntry>
<text>This is extracted text content...</text>
<class_>text</class_>
<confidence>0.95</confidence>
<bbox>[100, 200, 500, 250]</bbox>
<page_number>1</page_number>
</ExtractionEntry>
<ExtractionEntry>
<text>| Column 1 | Column 2 | Column 3 |</text>
<class_>table</class_>
<confidence>0.88</confidence>
<bbox>[50, 300, 550, 450]</bbox>
<page_number>1</page_number>
</ExtractionEntry>
</ExtractionResults>
Working with XML Data¶
# Parse the generated XML file
import xml.etree.ElementTree as ET
tree = ET.parse("results.xml")
root = tree.getroot()
# Extract data from XML
for entry in root.findall('ExtractionEntry'):
text = entry.find('text').text
class_ = entry.find('class_').text
confidence = float(entry.find('confidence').text)
page_number = int(entry.find('page_number').text)
print(f"Page {page_number}: {class_} (confidence: {confidence:.2f})")
Multiple Format Output¶
Save extraction results in multiple formats simultaneously:
Basic Multi-Format¶
import docviz
document = docviz.Document("document.pdf")
extractions = document.extract_content_sync()
# Save in multiple formats
extractions.save("results", save_format=[
docviz.SaveFormat.JSON,
docviz.SaveFormat.CSV,
docviz.SaveFormat.EXCEL
])
This creates:
- results.json
- results.csv
- results.xlsx
Format-Specific Output¶
# Different content for different formats
base_name = "document_extractions"
# Full data as JSON
extractions.save(f"{base_name}_full", save_format=docviz.SaveFormat.JSON)
# Tables only as Excel
tables_only = docviz.ExtractionResult(
entries=[e for e in extractions.entries if e.class_ == "table"],
page_number=extractions.page_number
)
tables_only.save(f"{base_name}_tables", save_format=docviz.SaveFormat.EXCEL)
# Summary as CSV
summary_data = [
entry for entry in extractions.entries
if entry.confidence > 0.7
]
summary_result = docviz.ExtractionResult(
entries=summary_data,
page_number=extractions.page_number
)
summary_result.save(f"{base_name}_summary", save_format=docviz.SaveFormat.CSV)
Custom Output Processing¶
Post-Process Extraction Data¶
import pandas as pd
import json
from pathlib import Path
def create_analysis_report(extractions, output_dir):
"""Create comprehensive analysis report."""
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
# Extract statistics
stats = {
"total_entries": len(extractions.entries),
"pages_processed": extractions.page_number,
"content_types": {},
"average_confidence": 0,
"high_confidence_count": 0
}
# Calculate statistics
for entry in extractions.entries:
content_type = entry.class_
stats["content_types"][content_type] = stats["content_types"].get(content_type, 0) + 1
stats["average_confidence"] += entry.confidence
if entry.confidence > 0.8:
stats["high_confidence_count"] += 1
stats["average_confidence"] /= len(extractions.entries)
# Save statistics
with open(output_path / "statistics.json", "w") as f:
json.dump(stats, f, indent=2)
# Create detailed DataFrame
detailed_data = []
for entry in extractions.entries:
detailed_data.append({
"page": entry.page_number,
"type": entry.class_,
"confidence": entry.confidence,
"text_length": len(entry.text),
"bbox_area": (entry.bbox[2] - entry.bbox[0]) * (entry.bbox[3] - entry.bbox[1]),
"preview": entry.text[:50] + "..." if len(entry.text) > 50 else entry.text
})
df = pd.DataFrame(detailed_data)
# Save detailed analysis
df.to_excel(output_path / "detailed_analysis.xlsx", index=False)
df.to_csv(output_path / "detailed_analysis.csv", index=False)
# Create pivot tables
pivot_by_type = df.pivot_table(
values=['confidence', 'text_length'],
index='type',
aggfunc=['mean', 'count']
)
pivot_by_type.to_excel(output_path / "analysis_by_type.xlsx")
pivot_by_page = df.pivot_table(
values=['confidence', 'text_length'],
index='page',
columns='type',
aggfunc=['mean', 'count'],
fill_value=0
)
pivot_by_page.to_excel(output_path / "analysis_by_page.xlsx")
# Usage
document = docviz.Document("document.pdf")
extractions = document.extract_content_sync()
create_analysis_report(extractions, "analysis_output")
Format Conversion Utilities¶
def convert_formats(input_json_path, output_formats):
"""Convert existing JSON results to other formats."""
# Load JSON data
with open(input_json_path, 'r') as f:
data = json.load(f)
# Reconstruct ExtractionResult
entries = []
for entry_data in data['entries']:
entry = docviz.ExtractionEntry(
text=entry_data['text'],
class_=entry_data['class'],
confidence=entry_data['confidence'],
bbox=entry_data['bbox'],
page_number=entry_data['page_number']
)
entries.append(entry)
extractions = docviz.ExtractionResult(
entries=entries,
page_number=data.get('page_number', len(entries))
)
# Save in requested formats
base_name = Path(input_json_path).stem
extractions.save(base_name, save_format=output_formats)
# Usage
convert_formats(
"existing_results.json",
[docviz.SaveFormat.CSV, docviz.SaveFormat.EXCEL]
)
Next Steps¶
Basic Usage - Basic usage guide
Advanced Usage - Advanced features
Configuration - Configuration options
API Reference - Complete API reference