Contributing to docviz-python¶

Thank you for your interest in contributing to docviz-python! This document provides guidelines and information for contributors.

Getting Started¶

Fork the repository on GitHub
Clone your fork locally
Create a virtual environment using uv
Install dependencies in development mode

# Fork and clone
git clone https://github.com/YOUR_USERNAME/docviz.git
cd docviz

# Create virtual environment and install
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"

Development Setup¶

Install development dependencies:

# Install all development dependencies
uv pip install -e ".[dev,showcase,cli]"

# Install pre-commit hooks
pre-commit install

Code Style¶

docviz-python follows strict coding standards:

Type hints: All functions must have type hints
Docstrings: Use Google-style docstrings
Formatting: Use ruff for formatting and linting
Line length: Maximum 100 characters
Naming: snake_case for functions, PascalCase for classes

Example of proper code style:

def extract_content(
    document_path: str,
    config: ExtractionConfig | None = None,
) -> ExtractionResult:
    """
    Extract content from a document.

    Args:
        document_path: Path to the document file
        config: Optional extraction configuration

    Returns:
        ExtractionResult containing extracted content

    Raises:
        FileNotFoundError: If document file doesn't exist
        ValueError: If document format is not supported
    """
    if not Path(document_path).exists():
        raise FileNotFoundError(f"Document not found: {document_path}")

    # Implementation here
    return result

Running Tests¶

Run the test suite:

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=docviz

# Run specific test file
uv run pytest tests/test_document.py

# Run with verbose output
uv run pytest -v

Code Quality Checks¶

Run quality checks:

# Run ruff linter
uv run ruff check .

# Run ruff formatter
uv run ruff format .

# Run type checking
uv run pyright

# Run all quality checks
uv run python tools/quality-check.py src/

Making Changes¶

Create a feature branch: .. code-block:: bash

git checkout -b feature/your-feature-name
Make your changes following the coding standards
Add tests for new functionality
Update documentation if needed
Run quality checks: .. code-block:: bash

uv run python tools/quality-check.py src/
Commit your changes with a descriptive message: .. code-block:: bash

git commit -m “feat: add new extraction feature”
Push to your fork: .. code-block:: bash

git push origin feature/your-feature-name
Create a pull request on GitHub

Commit Message Format¶

Use conventional commit messages:

feat: - New features
fix: - Bug fixes
docs: - Documentation changes
style: - Code style changes
refactor: - Code refactoring
test: - Test additions or changes
chore: - Maintenance tasks

Examples: .. code-block:: bash

feat: add support for Excel output format fix: resolve memory leak in batch processing docs: update installation instructions test: add tests for URL document loading

Pull Request Guidelines¶

Title: Clear and descriptive
Description: Explain what the PR does and why
Tests: Include tests for new functionality
Documentation: Update docs if needed
Quality: All quality checks must pass

Example PR description:

## Description

Adds support for extracting equations from PDF documents using OCR.

## Changes

- Add equation detection using Tesseract OCR
- Add ExtractionType.EQUATION enum value
- Update Document class to handle equation extraction
- Add tests for equation extraction

## Testing

- [x] Added unit tests for equation extraction
- [x] Added integration tests with sample PDFs
- [x] All existing tests pass

## Documentation

- [x] Updated API documentation
- [x] Added examples for equation extraction

Issue Reporting¶

When reporting issues, please include:

Environment: Python version, OS, docviz version
Reproduction: Steps to reproduce the issue
Expected behavior: What should happen
Actual behavior: What actually happens
Error messages: Full error traceback
Sample files: If applicable, provide sample documents

Example issue:

## Environment

- Python: 3.11.0
- OS: Ubuntu 22.04
- docviz-python: 0.7.0

## Issue

When extracting tables from PDFs with complex layouts, some table cells are missing.

## Steps to Reproduce

1. Load a PDF with complex table layout
2. Extract content with `includes=[ExtractionType.TABLE]`
3. Check the extracted table data

## Expected Behavior

All table cells should be extracted correctly.

## Actual Behavior

Some cells in the middle of tables are empty or missing.

## Error Messages

No errors, but incomplete data extraction.

Documentation¶

When contributing documentation:

Use RST format for Sphinx documentation
Include code examples that work
Update API docs for new features
Add type hints in docstrings
Test documentation builds correctly

Building Documentation¶

# Install documentation dependencies
uv pip install sphinx sphinx-rtd-theme sphinx-copybutton

# Build documentation
cd docs
make html

# View documentation
open _build/html/index.html

Release Process¶

For maintainers, the release process involves:

Update version in pyproject.toml
Update changelog with new features/fixes
Create release tag on GitHub
Build and publish to PyPI

# Update version
# Edit pyproject.toml

# Build package
uv build

# Publish to PyPI
uv publish

Getting Help¶

GitHub Issues: For bug reports and feature requests
GitHub Discussions: For questions and general discussion
Documentation: Check the docs for usage examples

Thank you for contributing to docviz-python!