Unicode Text Handling
apsw.unicode - Up to date Unicode aware methods and lookups
This module helps with Full text search and general Unicode, addressing the following:
The standard library
unicodedatahas limited information available (eg no information about emoji), and is only updated to new Unicode versions on a new Python version.Multiple consecutive codepoints can combine into a single user perceived character (grapheme cluster), such as combining accents, vowels and marks in some writing systems, variant selectors, joiners and linkers, etc. That means you can’t use indexes into
strsafely without potentially breaking them.The standard library provides no help in splitting text into grapheme clusters, words, and sentences, or into breaking text into multiple lines.
Text processing is performance sensitive - FTS5 easily handles hundreds of megabytes to gigabytes of text, and so should this module. It also affects the latency of each query as that is tokenized, and results can highlight words and sentences.
This module is independent of the main apsw module, and loading it does not load any database functionality. The majority of the functionality is implemented in C for size and performance reasons.
See unicode_version for the implemented version.
Grapheme cluster, word, and sentence splitting
Unicode Technical Report #29 rules for finding grapheme clusters, words, and sentences are implemented. Tr29 specifies break points which can be found via
grapheme_next_break(),word_next_break(), andsentence_next_break().Building on those are iterators providing optional offsets and the text. This is used for tokenization (getting character and word boundaries correct), and for result highlighting (showing words/sentences before and after match).
Line break splitting
Unicode Technical Report #14 rules for finding where text cam be broken and resumed on the next line. Tr14 specifies break points which can be found via
line_break_next_break().Building on those are iterators providing optional offsets and the text. This is used for
text_wrap().
Unicode lookups
Category information
category()Is an emoji or similar
is_extended_pictographic()Flag characters
is_regional_indicator()Codepoint names
codepoint_name()
Case folding, accent removal
casefold()is used to do case insensitive comparisons.
strip()is used to remove accents, marks, punctuation, joiners etc
Helpers
These are aware of grapheme cluster boundaries which Python’s builtin string operations are not. The text width functions take into account how wide the text is when displayed on most terminals.
grapheme_length()to get the number of grapheme clusters in a string
grapheme_substr()to get substrings using grapheme cluster indexing
grapheme_find()to find a substring
split_lines()to split text into lines using all the Unicode hard line break codepoints
text_width()to count how wide the text is
expand_tabs()to expand tabs using text width
text_width_substr()to extract substrings based on text width
text_wrap()to wrap paragraphs using Unicode words, line breaking, and text width
guess_paragraphs()to establish paragraph boundaries for text that has line breaks in paragraphs like many plain text and similar markup formats.
Size
Performance
There some pure Python alternatives, with less functionality. They take 5 to 15 times more CPU time to process the same text. Use
python3 -m apsw.unicode benchmark --help.
- apsw.unicode.unicode_version = '17.0'
The Unicode version that the rules and data tables implement
- apsw.unicode.category(codepoint: int | str) str[source]
Returns the general category - eg
Lufor Letter UppercaseSee
apsw.fts5.unicode_categoriesfor descriptions mapping
- apsw.unicode.is_extended_pictographic(text: str) bool[source]
Returns True if any of the text has the extended pictographic property (Emoji and similar)
- apsw.unicode.is_regional_indicator(text: str) bool[source]
Returns True if any of the text is one of the 26 regional indicators used in pairs to represent country flags
- apsw.unicode.casefold(text: str) str[source]
Returns the text for equality comparison without case distinction
Case folding maps text to a canonical form where case differences are removed allowing case insensitive comparison. Unlike upper, lower, and title case, the result is not intended to be displayed to people.
- apsw.unicode.strip(text: str) str[source]
Returns the text for less exact comparison with accents, punctuation, marks etc removed
It will strip diacritics leaving the underlying characters so
áççéñțśbecomesaccents, punctuation soe.g.becomeseganddon'tbecomesdont, marks soदेवनागरीbecomesदवनगर, as well as all spacing, formatting, variation selectors and similar codepoints.Codepoints are also converted to their compatibility representation. For example the single codepoint Roman numeral
ⅢbecomesIII(three separate regular upper case I), and🄷🄴🄻🄻🄾becomesHELLO.The resulting text should not be shown to people, and is intended for doing relaxed equality comparisons, at the expense of false positives when the accents, marks, punctuation etc were intended.
You should do
case foldingafter this.Emoji are preserved but variation selectors, fitzpatrick and joiners are stripped.
Regional indicators are preserved.
- apsw.unicode.split_lines(text: str, offset: int = 0) Iterator[str][source]
Each line, using hard line break rules
This is a iterator yielding a line at a time. The end of line yielded will not include the hard line break characters.
- apsw.unicode.expand_tabs(text: str, tabsize: int = 8, invalid: str = '.') str[source]
Turns tabs into spaces aligning on tabsize boundaries, similar to
str.expandtabs()This is aware of grapheme clusters and text width. Codepoints that have an invalid width are also replaced by
invalid. Control characters are an example of an invalid character. Line breaks are replaced with newline.
- apsw.unicode.grapheme_length(text: str, offset: int = 0) int[source]
Returns number of grapheme clusters in the text. Unicode aware version of len
- apsw.unicode.grapheme_substr(text: str, start: int | None = None, stop: int | None = None) str[source]
Like
text[start:end]but in grapheme cluster unitsstartandendcan be negative to index from the end, or outside the bounds of the text but are never an invalid combination (you get empty string returned).To get one grapheme cluster, make stop one more than start. For example to get the 3rd last grapheme cluster:
grapheme_substr(text, -3, -3 + 1)
- apsw.unicode.grapheme_endswith(text: str, substring: str) bool[source]
Returns True if text ends with substring being aware of grapheme cluster boundaries
- apsw.unicode.grapheme_startswith(text: str, substring: str) bool[source]
Returns True if text starts with substring being aware of grapheme cluster boundaries
- apsw.unicode.grapheme_find(text: str, substring: str, start: int = 0, end: int | None = None) int[source]
Returns the offset in text where substring can be found, being aware of grapheme clusters. The start and end of the substring have to be at a grapheme cluster boundary.
- Parameters:
start – Where in text to start the search (default beginning)
end – Where to stop the search exclusive (default remaining text)
- Returns:
offset into text, or -1 if not found or substring is zero length
- apsw.unicode.text_width(text: str, offset: int = 0) int[source]
Returns how many columns the text would be if displayed in a terminal
You should
split_lines()first and then operate on each line separately.If the text contains new lines, control characters, and similar unrepresentable codepoints then minus 1 is returned.
Terminals aren’t entirely consistent with each other, and Unicode has many kinds of codepoints, and combinations. Consequently this is right the vast majority of the time, but not always.
Note that web browsers do variable widths even in monospaced sections like
<pre>so they won’t always agree with the terminal either.
- apsw.unicode.text_width_substr(text: str, width: int, offset: int = 0) tuple[int, str][source]
Extracts substring width or less wide being aware of grapheme cluster boundaries. For example you could use this to get a substring that is 80 (or less) wide.
- Returns:
A tuple of how wide the substring is, and the substring
- apsw.unicode.guess_paragraphs(text: str, tabsize: int = 8) str[source]
Given text that contains paragraphs containing newlines, guesses where the paragraphs end. The returned
strwill have\n(newline) removed where it was determined to not mark a paragraph end.If you have text like this, where paragraphs have newlines in them, then each line gets wrapped separately by text_wrap. This function tries to guess where the paragraphs end. Blank lines like above are definite. Indented lines that continue preserving the indent are considered the same paragraph, and a change of indent (in or out) is a new paragraph. So this will be a new paragraph, And this will be a new paragraph. * Punctuation/numbers at the start of line followed by indented text are considered the same paragraph 2. So this is a new paragraph, while this line is part of the line above 3. Optional numbers followed by punctuation then space - are considered new paragraphs
- apsw.unicode.text_wrap(text: str, width: int = 70, *, tabsize: int = 8, hyphen: str = '-', combine_space: bool = True, invalid: str = '?') Iterator[str][source]
Similar to
textwrap.wrap()but Unicode grapheme cluster and line break awareNote
Newlines in the text are treated as end of paragraph. If your text has paragraphs with newlines in them, then call
guess_paragraphs()first.- Parameters:
text – string to process
width – width of yielded lines, if rendered using a monospace font such as to a terminal
tabsize – Tab stop spacing as tabs are expanded
hyphen – Used to show a segment was broken because it was wider than
widthcombine_space – Leading space on each (indent) is always preserved. Other spaces where multiple occur are combined into one space.
invalid – If invalid codepoints are encountered such as control characters and surrogates then they are replaced with this.
This yields one line of
strat a time, which will be exactlywidthwhen output to a terminal. It will be right padded with spaces if necessary and not have a trailing newline.apsw.ext.format_query_table()uses this method to ensure each column is the desired width.
- apsw.unicode.codepoint_name(codepoint: int | str) str | None[source]
Name or
Noneif it doesn’t have oneFor example codepoint 65 is named
LATIN CAPITAL LETTER Awhile codepoint U+D1234 is not assigned and would returnNone.
- apsw.unicode.version_added(codepoint: int | str) str | None[source]
Returns the unicode version the codepoint was added
- apsw.unicode.version_dates = {'1.0': (1991, 10, 1), '1.1': (1993, 6, 1), '10.0': (2017, 6, 20), '11.0': (2018, 6, 5), '12.0': (2019, 3, 5), '12.1': (2019, 5, 7), '13.0': (2020, 3, 10), '14.0': (2021, 9, 14), '15.0': (2022, 9, 13), '15.1': (2023, 9, 12), '16.0': (2024, 9, 10), '17.0': (2025, 9, 9), '2.0': (1996, 7, 1), '2.1': (1998, 5, 1), '3.0': (1999, 9, 1), '3.1': (2001, 3, 1), '3.2': (2002, 3, 1), '4.0': (2003, 4, 1), '4.1': (2005, 3, 31), '5.0': (2006, 7, 14), '5.1': (2008, 4, 4), '5.2': (2009, 10, 1), '6.0': (2010, 10, 11), '6.1': (2012, 1, 31), '6.2': (2012, 9, 26), '6.3': (2013, 9, 30), '7.0': (2014, 6, 16), '8.0': (2015, 6, 17), '9.0': (2016, 6, 21)}
Release date (year, month, day) for each unicode version intended for use with
version_added()
- apsw.unicode.grapheme_next_break(text: str, offset: int = 0) int[source]
Returns end of Grapheme cluster / User Perceived Character
For example regional indicators are in pairs, and a base codepoint can be combined with zero or more additional codepoints providing diacritics, marks, and variations. Break points are defined in the TR29 spec.
- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Index of first codepoint not part of the grapheme cluster starting at offset. You should extract
text[offset:span]
- apsw.unicode.grapheme_next(text: str, offset: int = 0) tuple[int, int][source]
Returns span of next grapheme cluster
- apsw.unicode.grapheme_iter(text: str, offset: int = 0) Iterator[str][source]
Iterator providing text of each grapheme cluster
- apsw.unicode.grapheme_iter_with_offsets(text: str, offset: int = 0) Iterator[tuple[int, int, str]][source]
Iterator providing start, end, text of each grapheme cluster
- apsw.unicode.grapheme_iter_with_offsets_filtered(text: str, offset: int = 0, *, categories: Iterable[str], emoji: bool = False, regional_indicator: bool = False) Iterator[tuple[int, int, str]][source]
Iterator providing start, end, text of each grapheme cluster, providing it includes codepoints from categories, emoji, or regional indicator
- apsw.unicode.word_next_break(text: str, offset: int = 0) int[source]
Returns end of next word or non-word
Finds the next break point according to the TR29 spec. Note that the segment returned may be a word, or a non-word (spaces, punctuation etc). Use
word_next()to get words.- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Next break point
- apsw.unicode.word_default_categories = {'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'Nd', 'Nl', 'No'}
Default categories for selecting word segments - letters and numbers
- apsw.unicode.word_next(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) tuple[int, int][source]
Returns span of next word
A segment is considered a word if it contains at least one codepoint corresponding to any of the categories, plus:
emoji (Extended_Pictographic in Unicode specs)
regional indicator - two character sequence for flags like 🇧🇷🇨🇦
- apsw.unicode.word_iter(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) Iterator[str][source]
Iterator providing text of each word
- apsw.unicode.word_iter_with_offsets(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) Iterator[str][source]
Iterator providing start, end, text of each word
- apsw.unicode.sentence_next_break(text: str, offset: int = 0) int[source]
Returns end of sentence location.
Finds the next break point according to the TR29 spec. Note that the segment returned includes leading and trailing white space.
- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Next break point
- apsw.unicode.sentence_next(text: str, offset: int = 0) tuple[int, int][source]
Returns span of next sentence
- apsw.unicode.sentence_iter(text: str, offset: int = 0) Iterator[str][source]
Iterator providing text of each sentence
- apsw.unicode.sentence_iter_with_offsets(text: str, offset: int = 0) Iterator[tuple[int, int, str]][source]
Iterator providing start, end, text of each sentence
- apsw.unicode.line_break_next_break(text: str, offset: int = 0) int[source]
Returns next opportunity to break a line
Finds the next break point according to the TR14 spec.
- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Next break point
- apsw.unicode.line_break_next(text: str, offset: int = 0) tuple[int, int][source]
Returns span of next line