Unicode Text Handling

apsw.unicode - Up to date Unicode aware methods and lookups

This module helps with Full text search and general Unicode, addressing the following:

The standard library unicodedata has limited information available (eg no information about emoji), and is only updated to new Unicode versions on a new Python version.
Multiple consecutive codepoints can combine into a single user perceived character (grapheme cluster), such as combining accents, vowels and marks in some writing systems, variant selectors, joiners and linkers, etc. That means you can’t use indexes into str safely without potentially breaking them.
The standard library provides no help in splitting text into grapheme clusters, words, and sentences, or into breaking text into multiple lines.
Text processing is performance sensitive - FTS5 easily handles hundreds of megabytes to gigabytes of text, and so should this module. It also affects the latency of each query as that is tokenized, and results can highlight words and sentences.

This module is independent of the main apsw module, and loading it does not load any database functionality. The majority of the functionality is implemented in C for size and performance reasons.

See unicode_version for the implemented version.

Grapheme cluster, word, and sentence splitting

Unicode Technical Report #29 rules for finding grapheme clusters, words, and sentences are implemented. Tr29 specifies break points which can be found via grapheme_next_break(), word_next_break(), and sentence_next_break().

Building on those are iterators providing optional offsets and the text. This is used for tokenization (getting character and word boundaries correct), and for result highlighting (showing words/sentences before and after match).

Line break splitting

Unicode Technical Report #14 rules for finding where text cam be broken and resumed on the next line. Tr14 specifies break points which can be found via line_break_next_break().

Building on those are iterators providing optional offsets and the text. This is used for text_wrap().

Unicode lookups

Category information category()

Is an emoji or similar is_extended_pictographic()

Flag characters is_regional_indicator()

Codepoint names codepoint_name()

Case folding, accent removal

casefold() is used to do case insensitive comparisons.

strip() is used to remove accents, marks, punctuation, joiners etc

Helpers

These are aware of grapheme cluster boundaries which Python’s builtin string operations are not. The text width functions take into account how wide the text is when displayed on most terminals.

grapheme_length() to get the number of grapheme clusters in a string

grapheme_substr() to get substrings using grapheme cluster indexing

grapheme_startswith() and grapheme_endswith()

grapheme_find() to find a substring

split_lines() to split text into lines using all the Unicode hard line break codepoints

text_width() to count how wide the text is

expand_tabs() to expand tabs using text width

text_width_substr() to extract substrings based on text width

text_wrap() to wrap paragraphs using Unicode words, line breaking, and text width

guess_paragraphs() to establish paragraph boundaries for text that has line breaks in paragraphs like many plain text and similar markup formats.

Size

Using the ICU extension is 5MB of code that then links to shared libraries containing another 5MB of code, and 30MB of data. This module is 0.5MB, 5 to 50% faster, and has no dependencies. (ICU includes numerous extra customisations, formatting, locale helpers etc.)

Performance

There some pure Python alternatives, with less functionality. They take 5 to 15 times more CPU time to process the same text. Use python3 -m apsw.unicode benchmark --help.

apsw.unicode.unicode_version = '17.0': The Unicode version that the rules and data tables implement

apsw.unicode.category(codepoint: int | str) → str[source]

Returns the general category - eg Lu for Letter Uppercase

See apsw.fts5.unicode_categories for descriptions mapping

apsw.unicode.is_extended_pictographic(text: str) → bool[source]: Returns True if any of the text has the extended pictographic property (Emoji and similar)

apsw.unicode.is_regional_indicator(text: str) → bool[source]: Returns True if any of the text is one of the 26 regional indicators used in pairs to represent country flags

apsw.unicode.casefold(text: str) → str[source]

Returns the text for equality comparison without case distinction

Case folding maps text to a canonical form where case differences are removed allowing case insensitive comparison. Unlike upper, lower, and title case, the result is not intended to be displayed to people.

apsw.unicode.strip(text: str) → str[source]

Returns the text for less exact comparison with accents, punctuation, marks etc removed

It will strip diacritics leaving the underlying characters so áççéñțś becomes accents, punctuation so e.g. becomes eg and don't becomes dont, marks so देवनागरी becomes दवनगर, as well as all spacing, formatting, variation selectors and similar codepoints.

Codepoints are also converted to their compatibility representation. For example the single codepoint Roman numeral Ⅲ becomes III (three separate regular upper case I), and 🄷🄴🄻🄻🄾 becomes HELLO.

The resulting text should not be shown to people, and is intended for doing relaxed equality comparisons, at the expense of false positives when the accents, marks, punctuation etc were intended.

You should do case folding after this.

Emoji are preserved but variation selectors, fitzpatrick and joiners are stripped.

Regional indicators are preserved.

apsw.unicode.split_lines(text: str, offset: int = 0) → Iterator[str][source]

Each line, using hard line break rules

This is a iterator yielding a line at a time. The end of line yielded will not include the hard line break characters.

apsw.unicode.expand_tabs(text: str, tabsize: int = 8, invalid: str = '.') → str[source]

Turns tabs into spaces aligning on tabsize boundaries, similar to str.expandtabs()

This is aware of grapheme clusters and text width. Codepoints that have an invalid width are also replaced by invalid. Control characters are an example of an invalid character. Line breaks are replaced with newline.

apsw.unicode.grapheme_length(text: str, offset: int = 0) → int[source]: Returns number of grapheme clusters in the text. Unicode aware version of len

apsw.unicode.grapheme_substr(text: str, start: int | None = None, stop: int | None = None) → str[source]

Like text[start:end] but in grapheme cluster units

start and end can be negative to index from the end, or outside the bounds of the text but are never an invalid combination (you get empty string returned).

To get one grapheme cluster, make stop one more than start. For example to get the 3rd last grapheme cluster:

grapheme_substr(text, -3, -3 + 1)

apsw.unicode.grapheme_endswith(text: str, substring: str) → bool[source]: Returns True if text ends with substring being aware of grapheme cluster boundaries

apsw.unicode.grapheme_startswith(text: str, substring: str) → bool[source]: Returns True if text starts with substring being aware of grapheme cluster boundaries

apsw.unicode.grapheme_find(text: str, substring: str, start: int = 0, end: int | None = None) → int[source]

Returns the offset in text where substring can be found, being aware of grapheme clusters. The start and end of the substring have to be at a grapheme cluster boundary.

Parameters:

start – Where in text to start the search (default beginning)
end – Where to stop the search exclusive (default remaining text)

Returns:

offset into text, or -1 if not found or substring is zero length

apsw.unicode.text_width(text: str, offset: int = 0) → int[source]

Returns how many columns the text would be if displayed in a terminal

You should split_lines() first and then operate on each line separately.

If the text contains new lines, control characters, and similar unrepresentable codepoints then minus 1 is returned.

Terminals aren’t entirely consistent with each other, and Unicode has many kinds of codepoints, and combinations. Consequently this is right the vast majority of the time, but not always.

Note that web browsers do variable widths even in monospaced sections like <pre> so they won’t always agree with the terminal either.

apsw.unicode.text_width_substr(text: str, width: int, offset: int = 0) → tuple[int, str][source]

Extracts substring width or less wide being aware of grapheme cluster boundaries. For example you could use this to get a substring that is 80 (or less) wide.

Returns:: A tuple of how wide the substring is, and the substring

apsw.unicode.guess_paragraphs(text: str, tabsize: int = 8) → str[source]

Given text that contains paragraphs containing newlines, guesses where the paragraphs end. The returned str will have \n (newline) removed where it was determined to not mark a paragraph end.

If you have text like this, where paragraphs have newlines in
them, then each line gets wrapped separately by text_wrap.
This function tries to guess where the paragraphs end.

Blank lines like above are definite.
  Indented lines that continue preserving the indent
  are considered the same paragraph, and a change of indent
  (in or out) is a new paragraph.
    So this will be a new paragraph,
And this will be a new paragraph.

 * Punctuation/numbers at the start of line
   followed by indented text are considered the same
   paragraph
2. So this is a new paragraph, while
   this line is part of the line above

3. Optional numbers followed by punctuation then space
   - are considered new paragraphs

apsw.unicode.text_wrap(text: str, width: int = 70, *, tabsize: int = 8, hyphen: str = '-', combine_space: bool = True, invalid: str = '?') → Iterator[str][source]

Similar to textwrap.wrap() but Unicode grapheme cluster and line break aware

Note

Newlines in the text are treated as end of paragraph. If your text has paragraphs with newlines in them, then call guess_paragraphs() first.

Parameters:

text – string to process
width – width of yielded lines, if rendered using a monospace font such as to a terminal
tabsize – Tab stop spacing as tabs are expanded
hyphen – Used to show a segment was broken because it was wider than width
combine_space – Leading space on each (indent) is always preserved. Other spaces where multiple occur are combined into one space.
invalid – If invalid codepoints are encountered such as control characters and surrogates then they are replaced with this.

This yields one line of str at a time, which will be exactly width when output to a terminal. It will be right padded with spaces if necessary and not have a trailing newline.

apsw.ext.format_query_table() uses this method to ensure each column is the desired width.

apsw.unicode.codepoint_name(codepoint: int | str) → str | None[source]

Name or None if it doesn’t have one

For example codepoint 65 is named LATIN CAPITAL LETTER A while codepoint U+D1234 is not assigned and would return None.

apsw.unicode.version_added(codepoint: int | str) → str | None[source]: Returns the unicode version the codepoint was added

apsw.unicode.version_dates = {'1.0': (1991, 10, 1), '1.1': (1993, 6, 1), '10.0': (2017, 6, 20), '11.0': (2018, 6, 5), '12.0': (2019, 3, 5), '12.1': (2019, 5, 7), '13.0': (2020, 3, 10), '14.0': (2021, 9, 14), '15.0': (2022, 9, 13), '15.1': (2023, 9, 12), '16.0': (2024, 9, 10), '17.0': (2025, 9, 9), '2.0': (1996, 7, 1), '2.1': (1998, 5, 1), '3.0': (1999, 9, 1), '3.1': (2001, 3, 1), '3.2': (2002, 3, 1), '4.0': (2003, 4, 1), '4.1': (2005, 3, 31), '5.0': (2006, 7, 14), '5.1': (2008, 4, 4), '5.2': (2009, 10, 1), '6.0': (2010, 10, 11), '6.1': (2012, 1, 31), '6.2': (2012, 9, 26), '6.3': (2013, 9, 30), '7.0': (2014, 6, 16), '8.0': (2015, 6, 17), '9.0': (2016, 6, 21)}: Release date (year, month, day) for each unicode version intended for use with version_added()

apsw.unicode.grapheme_next_break(text: str, offset: int = 0) → int[source]

Returns end of Grapheme cluster / User Perceived Character

For example regional indicators are in pairs, and a base codepoint can be combined with zero or more additional codepoints providing diacritics, marks, and variations. Break points are defined in the TR29 spec.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Index of first codepoint not part of the grapheme cluster starting at offset. You should extract text[offset:span]

apsw.unicode.grapheme_next(text: str, offset: int = 0) → tuple[int, int][source]: Returns span of next grapheme cluster

apsw.unicode.grapheme_iter(text: str, offset: int = 0) → Iterator[str][source]: Iterator providing text of each grapheme cluster

apsw.unicode.grapheme_iter_with_offsets(text: str, offset: int = 0) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each grapheme cluster

apsw.unicode.grapheme_iter_with_offsets_filtered(text: str, offset: int = 0, *, categories: Iterable[str], emoji: bool = False, regional_indicator: bool = False) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each grapheme cluster, providing it includes codepoints from categories, emoji, or regional indicator

apsw.unicode.word_next_break(text: str, offset: int = 0) → int[source]

Returns end of next word or non-word

Finds the next break point according to the TR29 spec. Note that the segment returned may be a word, or a non-word (spaces, punctuation etc). Use word_next() to get words.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Next break point

apsw.unicode.word_default_categories = {'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'Nd', 'Nl', 'No'}: Default categories for selecting word segments - letters and numbers

apsw.unicode.word_next(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) → tuple[int, int][source]

Returns span of next word

A segment is considered a word if it contains at least one codepoint corresponding to any of the categories, plus:

emoji (Extended_Pictographic in Unicode specs)
regional indicator - two character sequence for flags like 🇧🇷🇨🇦

apsw.unicode.word_iter(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) → Iterator[str][source]: Iterator providing text of each word

apsw.unicode.word_iter_with_offsets(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) → Iterator[str][source]: Iterator providing start, end, text of each word

apsw.unicode.sentence_next_break(text: str, offset: int = 0) → int[source]

Returns end of sentence location.

Finds the next break point according to the TR29 spec. Note that the segment returned includes leading and trailing white space.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Next break point

apsw.unicode.sentence_next(text: str, offset: int = 0) → tuple[int, int][source]: Returns span of next sentence

apsw.unicode.sentence_iter(text: str, offset: int = 0) → Iterator[str][source]: Iterator providing text of each sentence

apsw.unicode.sentence_iter_with_offsets(text: str, offset: int = 0) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each sentence

apsw.unicode.line_break_next_break(text: str, offset: int = 0) → int[source]

Returns next opportunity to break a line

Finds the next break point according to the TR14 spec.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Next break point

apsw.unicode.line_break_next(text: str, offset: int = 0) → tuple[int, int][source]: Returns span of next line

apsw.unicode.line_break_iter(text: str, offset: int = 0) → Iterator[str][source]: Iterator providing text of each line

apsw.unicode.line_break_iter_with_offsets(text: str, offset: int = 0) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each line