pagesmith.HtmlPageSplitter
Split HTML into pages, preserving HTML tags while respecting the original document structure.
Attributes
pagesmith.HtmlPageSplitter.max_error
instance-attribute
max_error = max_size - target_length
pagesmith.HtmlPageSplitter.max_size
instance-attribute
max_size = int(target_length * 1 + error_tolerance)
pagesmith.HtmlPageSplitter.root
instance-attribute
root = root
pagesmith.HtmlPageSplitter.target_page_size
instance-attribute
target_page_size = target_length
Functions
pagesmith.HtmlPageSplitter.pages
pages() -> Iterator[str]
Split content into pages.
pagesmith.PageSplitter
Split pure text into pages at natural break points such as paragraphs or sentences.
Attributes
pagesmith.PageSplitter.end
instance-attribute
end = len(text) if end == 0 else end
pagesmith.PageSplitter.error_tolerance
instance-attribute
error_tolerance = error_tolerance
pagesmith.PageSplitter.max_length
instance-attribute
max_length = int(target_length * 1 + error_tolerance)
pagesmith.PageSplitter.min_length
instance-attribute
min_length = int(target_length * 1 - error_tolerance)
pagesmith.PageSplitter.start
instance-attribute
start = start
pagesmith.PageSplitter.target_length
instance-attribute
target_length = target_length
pagesmith.PageSplitter.text
instance-attribute
text = text
Functions
pagesmith.PageSplitter.find_nearest_page_end
find_nearest_page_end(page_start_index: int) -> int
Find the nearest page end.
pagesmith.PageSplitter.find_nearest_page_end_match
find_nearest_page_end_match(page_start_index: int, pattern: Pattern[str]) -> int | None
Find the nearest regex match around expected end of page.
In no such match in the vicinity, return None. Calculate the vicinity based on the expected PAGE_LENGTH_TARGET and PAGE_LENGTH_ERROR_TOLERANCE.
pagesmith.PageSplitter.handle_p_tag_split
handle_p_tag_split(page_start_index: int, nearest_page_end: int) -> int
Find the position of the last closing
tag before the split.
pagesmith.PageSplitter.normalize
normalize(text: str) -> str
pagesmith.PageSplitter.pages
pages() -> Iterator[str]
Split a text into pages of approximately equal length.
Also clear headings and recollect them during pages generation.
pagesmith.ChapterDetector
Detect chapters in pure text to create a Table of Contents.
Attributes
pagesmith.ChapterDetector.min_chapter_distance
instance-attribute
min_chapter_distance = min_chapter_distance
Functions
pagesmith.ChapterDetector.get_chapters
get_chapters(page_text: str) -> list[Chapter]
Detect chapter headings in the text.
Return a list of Chapter objects containing: - title: The chapter title - position: The character position in the text where the chapter starts
pagesmith.ChapterDetector.prepare_chapter_patterns
prepare_chapter_patterns() -> list[Pattern[str]]
Prepare regex patterns for detecting chapter headings.
pagesmith.refine_html
Attributes
pagesmith.refine_html.ALLOWED_TAGS
module-attribute
ALLOWED_TAGS = ('p', 'div', 'span', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'li', 'a', 'img', 'br', 'hr', 'table', 'tr', 'td', 'th', 'thead', 'tbody', 'b', 'i', 'strong', 'em', 'code', 'pre', 'blockquote', 'sub', 'small', 'sup')
pagesmith.refine_html.CDATA_END
module-attribute
CDATA_END = ']]>'
pagesmith.refine_html.CDATA_START
module-attribute
CDATA_START = '<![CDATA['
pagesmith.refine_html.KEEP_EMPTY_TAGS
module-attribute
KEEP_EMPTY_TAGS = ('img', 'br', 'hr', 'input', 'a')
pagesmith.refine_html.REMOVE_WITH_CONTENT
module-attribute
REMOVE_WITH_CONTENT = ('script', 'style', 'head', 'iframe', 'noscript')
pagesmith.refine_html.TAGS_WITH_CLASSES
module-attribute
TAGS_WITH_CLASSES = {'h1': 'display-4 fw-semibold text-primary mb-4', 'h2': 'display-5 fw-semibold text-secondary mb-3', 'h3': 'h3 fw-normal text-dark mb-3', 'h4': 'h4 fw-normal text-dark mb-2', 'h5': 'h5 fw-normal text-dark mb-2'}
pagesmith.refine_html.input_html
module-attribute
input_html = "<![CDATA[This is CDATA content with <tags> that shouldn't be parsed]]>"
pagesmith.refine_html.logger
module-attribute
logger = getLogger(__name__)
Functions
pagesmith.refine_html.collapse_consecutive_br
collapse_consecutive_br(root: Element, keep_empty_tags_set: set[str], ids_to_keep_set: set[str]) -> None
From
tags sequence, keep only the first one.
This function searches for consecutive
tags and removes all but the first one
in each sequence. Whitespace between
tags is ignored for determining consecutive tags.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
root
|
Element
|
The root element of the lxml tree |
required |
pagesmith.refine_html.has_meaningful_content
has_meaningful_content(element: Element, keep_empty_tags_set: set[str], ids_to_keep_set: set[str], check_tail: bool = True) -> bool
Check if element/children has non-whitespace content or in the keep_empty_tags_set
.
pagesmith.refine_html.process_class_and_style
process_class_and_style(root: Element, tags_with_classes: dict[str, str]) -> None
Remove class and style attributes from elements not in tags_with_classes.
pagesmith.refine_html.refine_html
refine_html(input_html: str | None = None, *, root: Optional[Element] = None, allowed_tags: Iterable[str] = ALLOWED_TAGS, tags_to_remove_with_content: Iterable[str] = REMOVE_WITH_CONTENT, keep_empty_tags: Iterable[str] = KEEP_EMPTY_TAGS, ids_to_keep: Iterable[str] = (), tags_with_classes: dict[str, str] | None = None) -> str
Sanitize and normalize HTML content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_html
|
str | None
|
HTML string to clean |
None
|
root
|
Optional[Element]
|
Alternatively instead of input_html - lxml tree root element |
None
|
allowed_tags
|
Iterable[str]
|
Tags that are allowed in the output HTML |
ALLOWED_TAGS
|
tags_to_remove_with_content
|
Iterable[str]
|
Tags to be completely removed along with their content |
REMOVE_WITH_CONTENT
|
keep_empty_tags
|
Iterable[str]
|
Tags that should be kept even if they have no content |
KEEP_EMPTY_TAGS
|
ids_to_keep
|
Iterable[str]
|
IDs that should be kept even if their tags are not in allowed_tags |
()
|
tags_with_classes
|
dict[str, str] | None
|
Dictionary mapping tag names to class strings to add |
None
|
Returns:
Type | Description |
---|---|
str
|
Cleaned HTML string |
pagesmith.refine_html.remove_empty_elements
remove_empty_elements(ids_to_keep_set: set[str], keep_empty_tags_set: set[str], root: Element) -> None
Remove empty elements and divs that contain only
tags and whitespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ids_to_keep_set
|
set[str]
|
Set of element IDs that should be preserved |
required |
keep_empty_tags_set
|
set[str]
|
Set of tags that should be kept even when empty |
required |
root
|
Element
|
The root element of the lxml tree |
required |
Returns:
Type | Description |
---|---|
None
|
List of removed elements |
pagesmith.refine_html.remove_tags_with_content
remove_tags_with_content(root: Element, tags_to_remove_set: set[str]) -> None
Remove specified tags along with their content.
pagesmith.refine_html.unwrap_unknow_tags
unwrap_unknow_tags(allowed_tags_set: set[str], ids_to_keep_set: set[str], root: Element) -> None
Unwrap tags that are not in the allowed set but preserve their content.