pagesmith.HtmlPageSplitter

Split HTML into pages, preserving HTML tags while respecting the original document structure.

Attributes

pagesmith.HtmlPageSplitter.max_error `instance-attribute`

max_error = max_size - target_length

pagesmith.HtmlPageSplitter.max_size `instance-attribute`

max_size = int(target_length * 1 + error_tolerance)

pagesmith.HtmlPageSplitter.root `instance-attribute`

root = root

pagesmith.HtmlPageSplitter.target_page_size `instance-attribute`

target_page_size = target_length

Functions

pagesmith.HtmlPageSplitter.pages

pages() -> Iterator[str]

Split content into pages.

pagesmith.PageSplitter

Split pure text into pages at natural break points such as paragraphs or sentences.

Attributes

pagesmith.PageSplitter.end `instance-attribute`

end = len(text) if end == 0 else end

pagesmith.PageSplitter.error_tolerance `instance-attribute`

error_tolerance = error_tolerance

pagesmith.PageSplitter.max_length `instance-attribute`

max_length = int(target_length * 1 + error_tolerance)

pagesmith.PageSplitter.min_length `instance-attribute`

min_length = int(target_length * 1 - error_tolerance)

pagesmith.PageSplitter.start `instance-attribute`

start = start

pagesmith.PageSplitter.target_length `instance-attribute`

target_length = target_length

pagesmith.PageSplitter.text `instance-attribute`

text = text

Functions

pagesmith.PageSplitter.find_nearest_page_end

find_nearest_page_end(page_start_index: int) -> int

Find the nearest page end.

pagesmith.PageSplitter.find_nearest_page_end_match

find_nearest_page_end_match(page_start_index: int, pattern: Pattern[str]) -> int | None

Find the nearest regex match around expected end of page.

In no such match in the vicinity, return None. Calculate the vicinity based on the expected PAGE_LENGTH_TARGET and PAGE_LENGTH_ERROR_TOLERANCE.

pagesmith.PageSplitter.handle_p_tag_split

handle_p_tag_split(page_start_index: int, nearest_page_end: int) -> int

Find the position of the last closing

tag before the split.

pagesmith.PageSplitter.normalize

normalize(text: str) -> str

pagesmith.PageSplitter.pages

pages() -> Iterator[str]

Split a text into pages of approximately equal length.

Also clear headings and recollect them during pages generation.

pagesmith.ChapterDetector

Detect chapters in pure text to create a Table of Contents.

Attributes

pagesmith.ChapterDetector.min_chapter_distance `instance-attribute`

min_chapter_distance = min_chapter_distance

Functions

pagesmith.ChapterDetector.get_chapters

get_chapters(page_text: str) -> list[Chapter]

Detect chapter headings in the text.

Return a list of Chapter objects containing: - title: The chapter title - position: The character position in the text where the chapter starts

pagesmith.ChapterDetector.prepare_chapter_patterns

prepare_chapter_patterns() -> list[Pattern[str]]

Prepare regex patterns for detecting chapter headings.

pagesmith.refine_html

Attributes

pagesmith.refine_html.ALLOWED_TAGS `module-attribute`

ALLOWED_TAGS = ('p', 'div', 'span', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'li', 'a', 'img', 'br', 'hr', 'table', 'tr', 'td', 'th', 'thead', 'tbody', 'b', 'i', 'strong', 'em', 'code', 'pre', 'blockquote', 'sub', 'small', 'sup')

pagesmith.refine_html.CDATA_END `module-attribute`

CDATA_END = ']]>'

pagesmith.refine_html.CDATA_START `module-attribute`

CDATA_START = '<![CDATA['

pagesmith.refine_html.KEEP_EMPTY_TAGS `module-attribute`

KEEP_EMPTY_TAGS = ('img', 'br', 'hr', 'input', 'a')

pagesmith.refine_html.REMOVE_WITH_CONTENT `module-attribute`

REMOVE_WITH_CONTENT = ('script', 'style', 'head', 'iframe', 'noscript')

pagesmith.refine_html.TAGS_WITH_CLASSES `module-attribute`

TAGS_WITH_CLASSES = {'h1': 'display-4 fw-semibold text-primary mb-4', 'h2': 'display-5 fw-semibold text-secondary mb-3', 'h3': 'h3 fw-normal text-dark mb-3', 'h4': 'h4 fw-normal text-dark mb-2', 'h5': 'h5 fw-normal text-dark mb-2'}

pagesmith.refine_html.input_html `module-attribute`

input_html = "<![CDATA[This is CDATA content with <tags> that shouldn't be parsed]]>"

pagesmith.refine_html.logger `module-attribute`

logger = getLogger(__name__)

pagesmith.refine_html.result `module-attribute`

result = refine_html(input_html)

Functions

pagesmith.refine_html.collapse_consecutive_br

collapse_consecutive_br(root: Element, keep_empty_tags_set: set[str], ids_to_keep_set: set[str]) -> None

From
tags sequence, keep only the first one.

This function searches for consecutive
tags and removes all but the first one in each sequence. Whitespace between
tags is ignored for determining consecutive tags.

Parameters:

Name	Type	Description	Default
`root`	`Element`	The root element of the lxml tree	required

pagesmith.refine_html.has_meaningful_content

has_meaningful_content(element: Element, keep_empty_tags_set: set[str], ids_to_keep_set: set[str], check_tail: bool = True) -> bool

Check if element/children has non-whitespace content or in the keep_empty_tags_set.

pagesmith.refine_html.process_class_and_style

process_class_and_style(root: Element, tags_with_classes: dict[str, str]) -> None

Remove class and style attributes from elements not in tags_with_classes.

pagesmith.refine_html.refine_html

refine_html(input_html: str | None = None, *, root: Optional[Element] = None, allowed_tags: Iterable[str] = ALLOWED_TAGS, tags_to_remove_with_content: Iterable[str] = REMOVE_WITH_CONTENT, keep_empty_tags: Iterable[str] = KEEP_EMPTY_TAGS, ids_to_keep: Iterable[str] = (), tags_with_classes: dict[str, str] | None = None) -> str

Sanitize and normalize HTML content.

Parameters:

Name	Type	Description	Default
`input_html`	`str \| None`	HTML string to clean	`None`
`root`	`Optional[Element]`	Alternatively instead of input_html - lxml tree root element	`None`
`allowed_tags`	`Iterable[str]`	Tags that are allowed in the output HTML	`ALLOWED_TAGS`
`tags_to_remove_with_content`	`Iterable[str]`	Tags to be completely removed along with their content	`REMOVE_WITH_CONTENT`
`keep_empty_tags`	`Iterable[str]`	Tags that should be kept even if they have no content	`KEEP_EMPTY_TAGS`
`ids_to_keep`	`Iterable[str]`	IDs that should be kept even if their tags are not in allowed_tags	`()`
`tags_with_classes`	`dict[str, str] \| None`	Dictionary mapping tag names to class strings to add	`None`

Returns:

Type	Description
`str`	Cleaned HTML string

pagesmith.refine_html.remove_empty_elements

remove_empty_elements(ids_to_keep_set: set[str], keep_empty_tags_set: set[str], root: Element) -> None

Remove empty elements and divs that contain only
tags and whitespace.

Parameters:

Name	Type	Description	Default
`ids_to_keep_set`	`set[str]`	Set of element IDs that should be preserved	required
`keep_empty_tags_set`	`set[str]`	Set of tags that should be kept even when empty	required
`root`	`Element`	The root element of the lxml tree	required

Returns:

Type	Description
`None`	List of removed elements

pagesmith.refine_html.remove_tags_with_content

remove_tags_with_content(root: Element, tags_to_remove_set: set[str]) -> None

Remove specified tags along with their content.

pagesmith.refine_html.unwrap_unknow_tags

unwrap_unknow_tags(allowed_tags_set: set[str], ids_to_keep_set: set[str], root: Element) -> None

Unwrap tags that are not in the allowed set but preserve their content.

pagesmith.HtmlPageSplitter

Attributes

pagesmith.HtmlPageSplitter.max_error instance-attribute

pagesmith.HtmlPageSplitter.max_size instance-attribute

pagesmith.HtmlPageSplitter.root instance-attribute

pagesmith.HtmlPageSplitter.target_page_size instance-attribute

Functions

pagesmith.HtmlPageSplitter.pages

pagesmith.PageSplitter

Attributes

pagesmith.PageSplitter.end instance-attribute

pagesmith.PageSplitter.error_tolerance instance-attribute

pagesmith.PageSplitter.max_length instance-attribute

pagesmith.PageSplitter.min_length instance-attribute

pagesmith.PageSplitter.start instance-attribute

pagesmith.PageSplitter.target_length instance-attribute

pagesmith.PageSplitter.text instance-attribute

Functions

pagesmith.PageSplitter.find_nearest_page_end

pagesmith.PageSplitter.find_nearest_page_end_match

pagesmith.PageSplitter.handle_p_tag_split

pagesmith.PageSplitter.normalize

pagesmith.PageSplitter.pages

pagesmith.ChapterDetector

Attributes

pagesmith.ChapterDetector.min_chapter_distance instance-attribute

Functions

pagesmith.ChapterDetector.get_chapters

pagesmith.ChapterDetector.prepare_chapter_patterns

pagesmith.refine_html

Attributes

pagesmith.refine_html.ALLOWED_TAGS module-attribute

pagesmith.refine_html.CDATA_END module-attribute

pagesmith.refine_html.CDATA_START module-attribute

pagesmith.refine_html.KEEP_EMPTY_TAGS module-attribute

pagesmith.refine_html.REMOVE_WITH_CONTENT module-attribute

pagesmith.refine_html.TAGS_WITH_CLASSES module-attribute

pagesmith.refine_html.input_html module-attribute

pagesmith.refine_html.logger module-attribute

pagesmith.refine_html.result module-attribute

Functions

pagesmith.refine_html.collapse_consecutive_br

pagesmith.refine_html.has_meaningful_content

pagesmith.refine_html.process_class_and_style

pagesmith.refine_html.refine_html

pagesmith.refine_html.remove_empty_elements

pagesmith.refine_html.remove_tags_with_content

pagesmith.refine_html.unwrap_unknow_tags

pagesmith.HtmlPageSplitter.max_error `instance-attribute`

pagesmith.HtmlPageSplitter.max_size `instance-attribute`

pagesmith.HtmlPageSplitter.root `instance-attribute`

pagesmith.HtmlPageSplitter.target_page_size `instance-attribute`

pagesmith.PageSplitter.end `instance-attribute`

pagesmith.PageSplitter.error_tolerance `instance-attribute`

pagesmith.PageSplitter.max_length `instance-attribute`

pagesmith.PageSplitter.min_length `instance-attribute`

pagesmith.PageSplitter.start `instance-attribute`

pagesmith.PageSplitter.target_length `instance-attribute`

pagesmith.PageSplitter.text `instance-attribute`

pagesmith.ChapterDetector.min_chapter_distance `instance-attribute`

pagesmith.refine_html.ALLOWED_TAGS `module-attribute`

pagesmith.refine_html.CDATA_END `module-attribute`

pagesmith.refine_html.CDATA_START `module-attribute`

pagesmith.refine_html.KEEP_EMPTY_TAGS `module-attribute`

pagesmith.refine_html.REMOVE_WITH_CONTENT `module-attribute`

pagesmith.refine_html.TAGS_WITH_CLASSES `module-attribute`

pagesmith.refine_html.input_html `module-attribute`

pagesmith.refine_html.logger `module-attribute`

pagesmith.refine_html.result `module-attribute`