Skip to content

pagesmith.HtmlPageSplitter

Split HTML into pages, preserving HTML tags while respecting the original document structure.

Attributes

pagesmith.HtmlPageSplitter.max_error instance-attribute

max_error = max_size - target_length

pagesmith.HtmlPageSplitter.max_size instance-attribute

max_size = int(target_length * 1 + error_tolerance)

pagesmith.HtmlPageSplitter.root instance-attribute

root = root

pagesmith.HtmlPageSplitter.target_page_size instance-attribute

target_page_size = target_length

Functions

pagesmith.HtmlPageSplitter.pages

pages() -> Iterator[str]

Split content into pages.

pagesmith.PageSplitter

Split pure text into pages at natural break points such as paragraphs or sentences.

Attributes

pagesmith.PageSplitter.end instance-attribute

end = len(text) if end == 0 else end

pagesmith.PageSplitter.error_tolerance instance-attribute

error_tolerance = error_tolerance

pagesmith.PageSplitter.max_length instance-attribute

max_length = int(target_length * 1 + error_tolerance)

pagesmith.PageSplitter.min_length instance-attribute

min_length = int(target_length * 1 - error_tolerance)

pagesmith.PageSplitter.start instance-attribute

start = start

pagesmith.PageSplitter.target_length instance-attribute

target_length = target_length

pagesmith.PageSplitter.text instance-attribute

text = text

Functions

pagesmith.PageSplitter.find_nearest_page_end

find_nearest_page_end(page_start_index: int) -> int

Find the nearest page end.

pagesmith.PageSplitter.find_nearest_page_end_match

find_nearest_page_end_match(page_start_index: int, pattern: Pattern[str]) -> int | None

Find the nearest regex match around expected end of page.

In no such match in the vicinity, return None. Calculate the vicinity based on the expected PAGE_LENGTH_TARGET and PAGE_LENGTH_ERROR_TOLERANCE.

pagesmith.PageSplitter.handle_p_tag_split

handle_p_tag_split(page_start_index: int, nearest_page_end: int) -> int

Find the position of the last closing

tag before the split.

pagesmith.PageSplitter.normalize

normalize(text: str) -> str

pagesmith.PageSplitter.pages

pages() -> Iterator[str]

Split a text into pages of approximately equal length.

Also clear headings and recollect them during pages generation.

pagesmith.ChapterDetector

Detect chapters in pure text to create a Table of Contents.

Attributes

pagesmith.ChapterDetector.min_chapter_distance instance-attribute

min_chapter_distance = min_chapter_distance

Functions

pagesmith.ChapterDetector.get_chapters

get_chapters(page_text: str) -> list[Chapter]

Detect chapter headings in the text.

Return a list of Chapter objects containing: - title: The chapter title - position: The character position in the text where the chapter starts

pagesmith.ChapterDetector.prepare_chapter_patterns

prepare_chapter_patterns() -> list[Pattern[str]]

Prepare regex patterns for detecting chapter headings.

pagesmith.refine_html

Attributes

pagesmith.refine_html.ALLOWED_TAGS module-attribute

ALLOWED_TAGS = ('p', 'div', 'span', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'li', 'a', 'img', 'br', 'hr', 'table', 'tr', 'td', 'th', 'thead', 'tbody', 'b', 'i', 'strong', 'em', 'code', 'pre', 'blockquote', 'sub', 'small', 'sup')

pagesmith.refine_html.CDATA_END module-attribute

CDATA_END = ']]>'

pagesmith.refine_html.CDATA_START module-attribute

CDATA_START = '<![CDATA['

pagesmith.refine_html.KEEP_EMPTY_TAGS module-attribute

KEEP_EMPTY_TAGS = ('img', 'br', 'hr', 'input', 'a')

pagesmith.refine_html.REMOVE_WITH_CONTENT module-attribute

REMOVE_WITH_CONTENT = ('script', 'style', 'head', 'iframe', 'noscript')

pagesmith.refine_html.TAGS_WITH_CLASSES module-attribute

TAGS_WITH_CLASSES = {'h1': 'display-4 fw-semibold text-primary mb-4', 'h2': 'display-5 fw-semibold text-secondary mb-3', 'h3': 'h3 fw-normal text-dark mb-3', 'h4': 'h4 fw-normal text-dark mb-2', 'h5': 'h5 fw-normal text-dark mb-2'}

pagesmith.refine_html.input_html module-attribute

input_html = "<![CDATA[This is CDATA content with <tags> that shouldn't be parsed]]>"

pagesmith.refine_html.logger module-attribute

logger = getLogger(__name__)

pagesmith.refine_html.result module-attribute

Functions

pagesmith.refine_html.collapse_consecutive_br

collapse_consecutive_br(root: Element, keep_empty_tags_set: set[str], ids_to_keep_set: set[str]) -> None

From
tags sequence, keep only the first one.

This function searches for consecutive
tags and removes all but the first one in each sequence. Whitespace between
tags is ignored for determining consecutive tags.

Parameters:

Name Type Description Default
root Element

The root element of the lxml tree

required

pagesmith.refine_html.has_meaningful_content

has_meaningful_content(element: Element, keep_empty_tags_set: set[str], ids_to_keep_set: set[str], check_tail: bool = True) -> bool

Check if element/children has non-whitespace content or in the keep_empty_tags_set.

pagesmith.refine_html.process_class_and_style

process_class_and_style(root: Element, tags_with_classes: dict[str, str]) -> None

Remove class and style attributes from elements not in tags_with_classes.

pagesmith.refine_html.refine_html

refine_html(input_html: str | None = None, *, root: Optional[Element] = None, allowed_tags: Iterable[str] = ALLOWED_TAGS, tags_to_remove_with_content: Iterable[str] = REMOVE_WITH_CONTENT, keep_empty_tags: Iterable[str] = KEEP_EMPTY_TAGS, ids_to_keep: Iterable[str] = (), tags_with_classes: dict[str, str] | None = None) -> str

Sanitize and normalize HTML content.

Parameters:

Name Type Description Default
input_html str | None

HTML string to clean

None
root Optional[Element]

Alternatively instead of input_html - lxml tree root element

None
allowed_tags Iterable[str]

Tags that are allowed in the output HTML

ALLOWED_TAGS
tags_to_remove_with_content Iterable[str]

Tags to be completely removed along with their content

REMOVE_WITH_CONTENT
keep_empty_tags Iterable[str]

Tags that should be kept even if they have no content

KEEP_EMPTY_TAGS
ids_to_keep Iterable[str]

IDs that should be kept even if their tags are not in allowed_tags

()
tags_with_classes dict[str, str] | None

Dictionary mapping tag names to class strings to add

None

Returns:

Type Description
str

Cleaned HTML string

pagesmith.refine_html.remove_empty_elements

remove_empty_elements(ids_to_keep_set: set[str], keep_empty_tags_set: set[str], root: Element) -> None

Remove empty elements and divs that contain only
tags and whitespace.

Parameters:

Name Type Description Default
ids_to_keep_set set[str]

Set of element IDs that should be preserved

required
keep_empty_tags_set set[str]

Set of tags that should be kept even when empty

required
root Element

The root element of the lxml tree

required

Returns:

Type Description
None

List of removed elements

pagesmith.refine_html.remove_tags_with_content

remove_tags_with_content(root: Element, tags_to_remove_set: set[str]) -> None

Remove specified tags along with their content.

pagesmith.refine_html.unwrap_unknow_tags

unwrap_unknow_tags(allowed_tags_set: set[str], ids_to_keep_set: set[str], root: Element) -> None

Unwrap tags that are not in the allowed set but preserve their content.