Contributing to docviz-python¶
Thank you for your interest in contributing to docviz-python! This document provides guidelines and information for contributors.
Getting Started¶
Fork the repository on GitHub
Clone your fork locally
Create a virtual environment using uv
Install dependencies in development mode
# Fork and clone
git clone https://github.com/YOUR_USERNAME/docviz.git
cd docviz
# Create virtual environment and install
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"
Development Setup¶
Install development dependencies:
# Install all development dependencies
uv pip install -e ".[dev,showcase,cli]"
# Install pre-commit hooks
pre-commit install
Code Style¶
docviz-python follows strict coding standards:
Type hints: All functions must have type hints
Docstrings: Use Google-style docstrings
Formatting: Use ruff for formatting and linting
Line length: Maximum 100 characters
Naming: snake_case for functions, PascalCase for classes
Example of proper code style:
def extract_content(
document_path: str,
config: ExtractionConfig | None = None,
) -> ExtractionResult:
"""
Extract content from a document.
Args:
document_path: Path to the document file
config: Optional extraction configuration
Returns:
ExtractionResult containing extracted content
Raises:
FileNotFoundError: If document file doesn't exist
ValueError: If document format is not supported
"""
if not Path(document_path).exists():
raise FileNotFoundError(f"Document not found: {document_path}")
# Implementation here
return result
Running Tests¶
Run the test suite:
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=docviz
# Run specific test file
uv run pytest tests/test_document.py
# Run with verbose output
uv run pytest -v
Code Quality Checks¶
Run quality checks:
# Run ruff linter
uv run ruff check .
# Run ruff formatter
uv run ruff format .
# Run type checking
uv run pyright
# Run all quality checks
uv run python tools/quality-check.py src/
Making Changes¶
Create a feature branch: .. code-block:: bash
git checkout -b feature/your-feature-name
Make your changes following the coding standards
Add tests for new functionality
Update documentation if needed
Run quality checks: .. code-block:: bash
uv run python tools/quality-check.py src/
Commit your changes with a descriptive message: .. code-block:: bash
git commit -m “feat: add new extraction feature”
Push to your fork: .. code-block:: bash
git push origin feature/your-feature-name
Create a pull request on GitHub
Commit Message Format¶
Use conventional commit messages:
feat: - New features
fix: - Bug fixes
docs: - Documentation changes
style: - Code style changes
refactor: - Code refactoring
test: - Test additions or changes
chore: - Maintenance tasks
Examples: .. code-block:: bash
feat: add support for Excel output format fix: resolve memory leak in batch processing docs: update installation instructions test: add tests for URL document loading
Pull Request Guidelines¶
Title: Clear and descriptive
Description: Explain what the PR does and why
Tests: Include tests for new functionality
Documentation: Update docs if needed
Quality: All quality checks must pass
Example PR description:
## Description
Adds support for extracting equations from PDF documents using OCR.
## Changes
- Add equation detection using Tesseract OCR
- Add ExtractionType.EQUATION enum value
- Update Document class to handle equation extraction
- Add tests for equation extraction
## Testing
- [x] Added unit tests for equation extraction
- [x] Added integration tests with sample PDFs
- [x] All existing tests pass
## Documentation
- [x] Updated API documentation
- [x] Added examples for equation extraction
Issue Reporting¶
When reporting issues, please include:
Environment: Python version, OS, docviz version
Reproduction: Steps to reproduce the issue
Expected behavior: What should happen
Actual behavior: What actually happens
Error messages: Full error traceback
Sample files: If applicable, provide sample documents
Example issue:
## Environment
- Python: 3.11.0
- OS: Ubuntu 22.04
- docviz-python: 0.7.0
## Issue
When extracting tables from PDFs with complex layouts, some table cells are missing.
## Steps to Reproduce
1. Load a PDF with complex table layout
2. Extract content with `includes=[ExtractionType.TABLE]`
3. Check the extracted table data
## Expected Behavior
All table cells should be extracted correctly.
## Actual Behavior
Some cells in the middle of tables are empty or missing.
## Error Messages
No errors, but incomplete data extraction.
Documentation¶
When contributing documentation:
Use RST format for Sphinx documentation
Include code examples that work
Update API docs for new features
Add type hints in docstrings
Test documentation builds correctly
Building Documentation¶
# Install documentation dependencies
uv pip install sphinx sphinx-rtd-theme sphinx-copybutton
# Build documentation
cd docs
make html
# View documentation
open _build/html/index.html
Release Process¶
For maintainers, the release process involves:
Update version in pyproject.toml
Update changelog with new features/fixes
Create release tag on GitHub
Build and publish to PyPI
# Update version
# Edit pyproject.toml
# Build package
uv build
# Publish to PyPI
uv publish
Getting Help¶
GitHub Issues: For bug reports and feature requests
GitHub Discussions: For questions and general discussion
Documentation: Check the docs for usage examples
Thank you for contributing to docviz-python!