Skip to content

pagesmith

Splitting HTML into pages, preserving HTML tags while respecting the original document structure and text integrity.

Utilize blazingly fast lxml parser.

How It Works

The HtmlPageSplitter class intelligently splits HTML content into appropriately sized pages while ensuring all HTML tags remain properly closed and valid. This preserves both the document structure and styling.

You can use refine_html for refining HTML.

Also contains class for splitting to pages and extracting Table of Content from pure text

How It Works

The ChapterDetector class analyzes text to find standard chapter heading formats. It automatically identifies the position of each chapter and extracts the title.

Installation

pip install pagesmith

Usage