Full text search

APSW provides complete access to SQLite’s full text search extension 5, and extensive related functionality. The tour demonstrates some of what is available. Highlights include:

Access to all the FTS5 C APIs to register a tokenizer, retrieve a tokenizer, call a tokenizer, register an auxiliary function, and all of the extension API. This includes the locale option added in SQLite 3.47.
The ftsq shell command to do a FTS5 query
apsw.fts5.Table for a Pythonic interface to a FTS5 table, including getting the structure
A create() method to create the table that handles the SQL quoting, triggers, and all FTS5 options.
upsert() to insert or update content, and delete() to delete, that both understand external content tables
query_suggest() that improves a query by correcting spelling, and suggesting more popular search terms
key_tokens() for statistically significant content in a row
more_like() to provide statistically similar rows
Unicode Word tokenizer that better determines word boundaries across all codepoints, punctuation, and conventions.
Tokenizers that work with regular expressions, HTML, and JSON
Helpers for writing your own tokenizers including argument parsing, and handling conversion between the UTF8 offsets used by FTS5 and str offsets used in Python.
SimplifyTokenizer() can handle case folding and accent removal on behalf of other tokenizers, using the latest Unicode standard.
apsw.fts5 module for working with FTS5
apsw.fts5query module for generating, parsing, and modifying FTS5 queries
apsw.fts5aux module with auxiliary functions and helpers
apsw.unicode module supporting the latest version of Unicode:
- Splitting text into user perceived characters (grapheme clusters), words, sentences, and line breaks.
- Methods to work on strings in grapheme cluster units rather than Python’s individual codepoints
- Case folding
- Removing accents, combining marks, and using compatibility codepoints
- Codepoint names, regional indicator, extended pictographic
- Helpers for outputting text to terminals that understand grapheme cluster boundaries, how wide the text will be, and using line breaking to choose the best location to split lines

Key Concepts 

How it works

A series of tokens typically corresponding to words is produced for each text value. For each token, FTS5 indexes:

The rowid the token occurred in

The column the token was in

Which token number it was

A query is turned into tokens, and the FTS5 index consulted to find the rows and columns those exact same query tokens occur in. Phrases can be found by looking for consecutive token numbers.

Ranking

Once matches are found, you want the most relevant ones first. A ranking function is used to assign each match a numerical score typically taking into account how rare the tokens are, and how densely they appear in that row. You can usually weight each column so for example matches in a title column count for more.

You can change the ranking function on a per query basis or via config_rank() for all queries.

Tokens

While tokens typically correspond to words, there is no requirement that they do so. Tokens are not shown to the user. Generating appropriate tokens for your text is the key to making searches effective. FTS5 has a tokendata to store extra information with each token. You should consider:

Extracting meaningful tokens first. An example shows extracting product ids and then treating what remains as regular text.

Mapping equivalent text to the same token by using the techniques described below (stemming, synonyms)

Consider alternate approaches. For example the unidecode algorithm turns Unicode text into ascii text that sounds approximately similar

Processing content to normalize it. For example unifying spelling so colour and color become the same token. You can use dictionaries to ensure content is consistent.

Stemming

Queries only work with exact matches on the tokens. It is often desirable to make related words produce the same token. Stemming is doing this such as removing singular vs plural so dog and dogs become the same token, and determining the base of a word so likes, liked, likely, and liking become the same token. Third party libraries provide this for various languages.

Synonyms

Synonyms are words that mean the same thing. FTS5 calls them colocated tokens. In a search you may want first to find that as well as 1st, or dog to also find canine, k9, and puppy. While you can provide additional tokens when content is being tokenized for the index, a better place is when a query is being tokenized. The SynonymTokenizer() provides an implementation.

Stop words

Search terms are only useful if they narrow down which rows are potential matches. Something occurring in almost every row increases the size of the index, and the ranking function has to be run on more rows for each search. See the example for determining how many rows tokens occur in, and StopWordsTokenizer() for removing them.

Locale

SQlite 3.47 added support for locale - an arbitrary string that can be used to mark text. It is typically used to denote a language and region - for example Portuguese in Portugal has some differences than Portuguese in Brazil, and American English has differences from British English. You can use the locale for other purposes - for example if your text includes code then the locale could be used to mark what programming language it is.

Tokenizer order and parameters 

The tokenize option specifies how tokenization happens. The string specifies a list of items. You can use apsw.fts5.Table.create() to provide the list and have all quoting done correctly, and use apsw.fts5.Table.structure to see what an existing table specifies.

Tokenizers often take parameters, which are provided as a separate name and value:

tokenizer_name param1 value1 param2 value2 param3 value3

Some tokenizers work in conjunction with others. For example the HTMLTokenizer() passes on the text, excluding HTML tags, and the StopWordsTokenizer() removes tokens coming back from another tokenizer. When they see a parameter name they do not understand, they treat that as the name of the next tokenizer, and following items as parameters to that tokenizer.

The overall flow is that the text to be tokenized flows from left to right amongst the named tokenizers. The resulting token stream then flows from right to left.

This means it matters what order the tokenizers are given, and you should ensure the order is what is expected.

Recommendations 

Unicode normalization 

For backwards compatibility Unicode allows multiple different ways of specifying what will be drawn as the same character. For example Ç can be

One codepoint U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA
Two codepoints U+0043 LATIN CAPITAL LETTER C, and U+0327 COMBINING CEDILLA

There are more complex examples and description at Unicode TR15, which describes the solution of normalization.

If you have text from multiple sources it is possible that it is in multiple normalization forms. You should use unicodedata.normalize() to ensure your text is all in the same form for indexing, and also ensure query text is in that same form. If you do not do this, then searches will be confusing and not match when it visually looks like they should.

Form NFC is recommended. If you use SimplifyTokenizer() with strip enabled then it won’t matter as that removes combing marks and uses compatibility codepoints.

Tokenizer order 

For general text, use the following:

simplify casefold true strip true unicodewords

simplify:

Uses compatibility codepoints

Removes marks and diacritics

Neutralizes case distinctions

unicodewords:

Finds words using the Unicode algorithm

Makes emoji be individually searchable

Makes regional indicators be individually searchable

External content tables 

Using external content tables is well handled by apsw.fts5.Table. The create() method has a parameter to generate triggers that keep the FTS5 table up to date with the content table.

The major advantage of using external content tables is that you can have multiple FTS5 tables sharing the same content table. For example for the same content table you could have FTS5 tables for different purposes:

ngram for doing autocomplete
case folded, accent stripped, stop words, and synonyms for broad searching
full fidelity index preserving case, accents, stop words, and no synonyms for doing exact match searches.

If you do not use an external content table then FTS5 by default makes one of its own. The content is used for auxiliary functions such as highlighting or snippets from matches.

Your external content table can have more columns, triggers, and other SQLite functionality that the FTS5 internal content table does not support. It is also possible to have no content table.

Ranking functions 

You can fine tune how matches are scored by applying column weights to existing ranking functions, or by writing your own ranking functions. See apsw.fts5aux for some examples.

Performance 

Search queries are processed in two logical steps:

Find all the rows matching the relevant query tokens

Run the ranking function on each row to sort the best matches first

FTS5 performs very well. If you need to improve performance then closely analyse the find all rows step. The fewer rows query tokens match the fewer ranking function calls happen, and less overall work has to be done.

A typical cause of too many matching rows is having too few different tokens. If tokens are case folded, accent stripped, and stemmed then there may not be that many different tokens.

Initial indexing of your content will take a while as it involves a lot of text processing. Profiling will show bottlenecks.

Outgrowing FTS5 

FTS5 depends on exact matches between query tokens and content indexed tokens. This constrains search queries to exact matching after optional steps like stemming and similar processing.

If you have a large amount of text and want to do similarity searching then you will need to use a solution outside of FTS5.

The approach used is to convert words and sentences into a fixed length list of floating point values - a vector. To find matches, the closest vectors have to be found to the query which approximately means comparing the query vector to all of the content vectors finding the smallest overall difference. This is highly parallel with implementations using hardware/GPU functionality.

Producing the vectors requires access to a multi-gigabyte model, either locally or via a networked service. In general the bigger the model, the better vectors it can provide. For example a model will have been trained so that the vectors for runner and jogger are close to each other, while orange is further away.

This is all well outside the scope of SQLite and FTS5.

The process of producing vectors is known as word embedding and sentence embedding. Gensim is a good package to start with, with its tutorial giving a good overview of what you have to do.

Available tokenizers 

SQLite includes 4 builtin tokenizers while APSW provides several more.

Name	Purpose
`unicode61`	SQLite builtin using Unicode categories to generate tokens
`ascii`	SQLite builtin using ASCII to generate tokens
`porter`	SQLite builtin wrapper applying the porter stemming algorithm to supplied tokens
`trigram`	SQLite builtin that turns the entire text into trigrams. Note it does not turn tokens into trigrams, but the entire text including all spaces and punctuation.
`UnicodeWordsTokenizer()`	Use Unicode algorithm for determining word segments.
`SimplifyTokenizer()`	Wrapper that transforms the token stream by neutralizing case, and removing diacritics and similar marks
`RegexTokenizer()`	Use `regular expressions` to generate tokens
`RegexPreTokenizer()`	Use `regular expressions` to find tokens (eg identifiers) and use a different tokenizer for the text between the regular expressions
`NGramTokenizer()`	Generates ngrams from the text, where you can specify the sizes and unicode categories. Useful for doing autocomplete as you type, and substring searches. Unlike `trigram` this works on units of Unicode grapheme clusters not individual codepoints.
`HTMLTokenizer()`	Wrapper that converts HTML to plan text for a further tokenizer to generate tokens
`JSONTokenizer()`	Wrapper that converts JSON to plain text for a further tokenizer to generate tokens
`SynonymTokenizer()`	Wrapper that provides additional tokens for existing ones such as `first` for `1st`
`StopWordsTokenizer()`	Wrapper that removes tokens from the token stream that occur too often to be useful, such as `the` in English text
`TransformTokenizer()`	Wrapper to transform tokens, such as when stemming.
`QueryTokensTokenizer()`	Wrapper that recognises `apsw.fts5query.QueryTokens` allowing queries using tokens directly. This is useful if you want to add tokens directly to a query without having to find the text to produce the token.
`StringTokenizer()`	A decorator for your own tokenizers so that they operate on `str`, with the decorator performing the mapping to UTF8 byte offsets for you. If you have a string and want to call another tokenizer, use `string_tokenize()`.

Third party libraries 

There are several libraries available on PyPI that can be pip installed (pip name in parentheses). You can use them with the tokenizers APSW provides.

NLTK (nltk)

Natural Language Toolkit has several useful methods to help with search. You can use it do stemming in many different languages, and different algorithms:

stemmer = apsw.fts5.TransformTokenizer(
  nltk.stem.snowball.EnglishStemmer().stem
)
connection.register_fts5_tokenizer("english_stemmer", english_stemmer)

You can use wordnet to get synonyms:

from nltk.corpus import wordnet

def synonyms(word):
  return [syn.name() for syn in wordnet.synsets(word)]

wrapper = apsw.fts5.SynonymTokenizer(synonyms)
connection.register_fts5_tokenizer("english_synonyms", wrapper)

Snowball Stemmer (snowballstemmer)

Snowball is a successor to the Porter stemming algorithm (included in FTS5), and supports many more languages. It is also included as part of nltk:

stemmer = apsw.fts5.TransformTokenizer(
  snowballstemmer.stemmer("english").stemWord
)
connection.register_fts5_tokenizer("english_stemmer", english_stemmer)

Unidecode (unidecode)

The algorithm turns Unicode text into ascii text that sounds approximately similar:

transform = apsw.fts5.TransformTokenizer(
  unidecode.unidecode
)

connection.register_fts5_tokenizer("unidecode", transform)

Available auxiliary functions 

SQLite includes X builtin auxiliary functions, with APSW providing some more.

Name	Purpose
`bm25`	SQLite builtin standard algorithm for ranking matches. It balances how rare the search tokens are with how densely they occur.
`highlight`	SQLite builtin that returns the whole text value with the search terms highlighted
`snippet`	SQLite builtin that returns a small portion of the text containing the highlighted search terms
`bm25()`	A Python implementation of bm25. This is useful as an example of how to write your own ranking function
`position_rank()`	Uses bm25 as a base, increasing rank the earlier in the content the search terms occur
`subsequence()`	Uses bm25 as a base, increasing rank when the search phrases occur in the same order and closer to each other. A regular bm25 rank for the query `hello world` gives the same rank for the content having those words in that order, in the opposite order, and with any number of other words in between.

Command line tools 

FTS5 Tokenization viewer 

Use python3 -m apsw.fts5 --help to see detailed help information. This tool produces a HTML file showing how a tokenizer performs on text you supply, or builtin test text. This is useful if you are developing your own tokenizer, or want to work out the best tokenizer and parameters for you. (Note the tips in the bottom right of the HTML.)

The builtin test text includes lots of complicated text from across all of Unicode including all forms of spaces, numbers, multiple codepoint sequences, homoglyphs, various popular languages, and hard to tokenize text.

It is useful to compare the default unicode61 tokenizer against the recommended simplify casefold 1 strip 1 unicodewords.

Unicode 

Use python3 -m apsw.unicode --help to see detailed help information. Of interest are the codepoint subcommand to see exactly which codepoints make up some text and textwrap to line wrap text for terminal width.

FTS5 module 

apsw.fts5 Various classes and functions to work with full text search.

This includes Table for creating and working with FTS5 tables in a Pythonic way, numerous tokenizers, and related functionality.

class apsw.fts5.FTS5TableStructure(name: str, columns: tuple[str], unindexed: set[str], tokenize: tuple[str], prefix: set[int], content: str | None, content_rowid: str | None, contentless_delete: bool | None, contentless_unindexed: bool | None, columnsize: bool, tokendata: bool, locale: bool, detail: Literal['full', 'column', 'none'])[source]

Table structure from SQL declaration available as Table.structure

See the example

columns: tuple[str]: All column names

columnsize: bool: Columnsize option

content: str | None: External content/content less or None for regular

content_rowid: str | None: Rowid if external content table else None

contentless_delete: bool | None: Contentless delete option if contentless table else None

contentless_unindexed: bool | None: Contentless unindexed option if contentless table else None

detail: Literal['full', 'column', 'none']: Detail option

locale: bool: Locale option

name: str: Table nane

prefix: set[int]: Prefix values

tokendata: bool: Tokendata option

tokenize: tuple[str]: Tokenize split into arguments

unindexed: set[str]: Which columns are unindexed

apsw.fts5.HTMLTokenizer(con: Connection, args: list[str]) → Tokenizer[source]

Extracts text from HTML suitable for passing on to other tokenizers

This should be before the actual tokenizer in the tokenizer list. Behind the scenes it extracts text from the HTML, and manages the offset mapping between the HTML and the text passed on to other tokenizers. It also expands entities and charrefs. Content inside SVG tags is ignored.

If the html doesn’t start with optional whitespace then < or &, it is not considered HTML and will be passed on unprocessed. This would typically be the case for queries.

html.parser is used for the HTML processing.

See the example.

apsw.fts5.JSONTokenizer(con: Connection, args: list[str]) → Tokenizer[source]

Extracts text from JSON suitable for passing to a tokenizer

The following tokenizer arguments are accepted:

include_keys: 0 (default) or 1 if keys are extracted in addition to values

If the JSON doesn’t start with optional whitespace then { or [, it is not considered JSON and will be passed on unprocessed. This would typically be the case for queries.

See the example.

class apsw.fts5.MatchInfo(query_info: QueryInfo, rowid: int, column_size: tuple[int], phrase_columns: tuple[tuple[int], ...])[source]

Information about a matched row, returned by Table.search()

column_size: tuple[int]: Size of each column in tokens

phrase_columns: tuple[tuple[int], ...]: For each phrase a tuple of which columns it occurs in

query_info: QueryInfo: Overall query information

rowid: int: Rowid

apsw.fts5.NGramTokenizer(con: Connection, args: list[str]) → Tokenizer[source]

Generates ngrams from the text

For example if doing 3 (trigram) then a big dog would result in 'a b', ' bi', 'big', 'ig ', 'g d', ' do`, 'dog'

This is useful for queries where less than an entire word has been provided such as doing completions, substring, or suffix matches. For example a query of ing would find all occurrences even at the end of words with ngrams, but not with the UnicodeWordsTokenizer() which requires the query to provide complete words.

This tokenizer works on units of user perceived characters (grapheme clusters) where more than one codepoint can make up one user perceived character.

The following tokenizer arguments are accepted

ngrams: Numeric ranges to generate. Smaller values allow showing results with less input but a larger index, while larger values will result in quicker searches as the input grows. Default is 3. You can specify multiple values.
categories: Which Unicode categories to include, by default all. You could include everything except punctuation and separators with * !P* !Z*.
emoji: 0 or 1 (default) if emoji are included, even if categories would exclude them.
regional_indicator: 0 or 1 (default) if regional indicators are included, even if categories would exclude them.

See the example.

class apsw.fts5.QueryInfo(phrases: tuple[tuple[str | None, ...], ...])[source]

Information relevant to the query as a whole, returned by Table.search()

phrases: tuple[tuple[str | None, ...], ...]

Phrases from the query

a OR b NOT c AND d would result in a, b, c, d as 4 separate phrases.

apsw.fts5.QueryTokensTokenizer(con: Connection, args: list[str]) → Tokenizer[source]

Recognises a special tokens marker and returns those tokens for a query. This is useful for making queries directly using tokens, instead of pre-tokenized text.

It must be the first tokenizer in the list. Any text not using the special marker is passed to the following tokenizer.

See apsw.fts5query.QueryTokens for more details on the marker format.

apsw.fts5.RegexPreTokenizer(con: Connection, args: list[str], *, pattern: str | Pattern, flags: int = 0) → Tokenizer[source]

Combines regular expressions and another tokenizer

RegexTokenizer() only finds tokens matching a regular expression, and ignores all other text. This tokenizer calls another tokenizer to handle the gaps between the patterns it finds. This is useful to extract identifiers and other known patterns, while still doing word search on the rest of the text.

Parameters:

pattern – The regular expression. For example w+ is all alphanumeric and underscore characters.
flags – Regular expression flags. Ignored if pattern is an already compiled pattern

You must specify an additional tokenizer name and arguments.

See the example

apsw.fts5.RegexTokenizer(con: Connection, args: list[str], *, pattern: str | Pattern, flags: int = 0) → Tokenizer[source]

Finds tokens using a regular expression

Parameters:

pattern – The regular expression. For example w+ is all alphanumeric and underscore characters.
flags – Regular expression flags. Ignored if pattern is an already compiled pattern

See the example

apsw.fts5.SimplifyTokenizer(con: Connection, args: list[str]) → Tokenizer[source]

Tokenizer wrapper that simplifies tokens by neutralizing case, canonicalization, and diacritic/mark removal

Put this before another tokenizer to simplify its output. For example:

simplify casefold true unicodewords

The following tokenizer arguments are accepted, and are applied to each token in this order. If you do not specify an argument then it is off.

strip: Codepoints become their compatibility representation - for example the Roman numeral Ⅲ becomes III. Diacritics, marks, and similar are removed. See apsw.unicode.strip().
casefold: Neutralizes case distinction. See apsw.unicode.casefold().

See the example.

apsw.fts5.StopWordsTokenizer(test: Callable[[str], bool] | None = None) → FTS5TokenizerFactory[source]

Removes tokens that are too frequent to be useful

To use you need a callable that takes a str, and returns a boolean. If True then the token is ignored.

The following tokenizer arguments are accepted, or use as a decorator.

test: Specify a test

See the example.

apsw.fts5.StringTokenizer(func: FTS5TokenizerFactory) → Tokenizer[source]

Decorator for tokenizers that operate on strings

FTS5 tokenizers operate on UTF8 bytes for the text and offsets. This decorator provides your tokenizer with text and expects text offsets back, performing the conversions back to UTF8 byte offsets.

apsw.fts5.SynonymTokenizer(get: Callable[[str], None | str | tuple[str]] | None = None) → FTS5TokenizerFactory[source]

Adds colocated tokens such as 1st for first.

To use you need a callable that takes a str, and returns a str, a sequence of str, or None. For example dict.get() does that.

The following tokenizer arguments are accepted:

reasons: Which tokenize tokenize_reasons you want the lookups to happen in as a space separated list. Default is QUERY.
get: Specify a get, or use as a decorator.

See the example.

class apsw.fts5.Table(db: Connection, name: str, schema: str = 'main')[source]

A helpful wrapper around a FTS5 table

The table must already exist. You can use the class method create() to create a new FTS5 table.

Parameters:

db – Connection to use
name – Table name
schema – Which attached database to use

property change_cookie: int

An int that changes if the content of the table has changed.

This is useful to validate cached information.

closest_tokens(token: str, *, n: int = 10, cutoff: float = 0.6, min_docs: int = 1, all_tokens: Iterable[tuple[str, int]] | None = None) → list[tuple[float, str]][source]

Returns closest known tokens to token with score for each

This uses difflib.get_close_matches() algorithm to find close matches. Note that it is a statistical operation, and has no understanding of the tokens and their meaning.

Parameters:

token – Token to use
n – Maximum number of tokens to return
cutoff – Passed to difflib.get_close_matches(). Larger values require closer matches and decrease computation time.
min_docs – Only test against other tokens that appear in at least this many rows. Experience is that about a third of tokens appear only in one row. Larger values significantly decrease computation time, but reduce the candidates.
all_tokens – A sequence of tuples of candidate token and number of rows it occurs in. If not provided then tokens is used.

column_named(name: str) → str | None[source]

Returns the column matching name or None if it doesn’t exist

SQLite is ascii case-insensitive, so this tells you the declared name, or None if it doesn’t exist.

property columns: tuple[str, ...][source]: All columns of this table, including unindexed ones. Unindexed columns are ignored in queries.

property columns_indexed: tuple[str, ...][source]: All columns of this table, excluding unindexed ones

command_delete(rowid: int, *column_values: str)[source]

Does delete

See delete() for regular row deletion.

If you are using an external content table, it is better to use triggers on that table.

command_delete_all() → None[source]

Does delete all

If you are using an external content table, it is better to use triggers on that table.

command_integrity_check(external_content: bool = True) → None[source]

Does integrity check

If external_content is True, then the FTS index is compared to the external content.

command_merge(n: int) → int[source]

Does merge

See the documentation for what positive and negative values of n mean.

Returns:: The difference between sqlite3_total_changes() before and after running the command.

command_optimize() → None[source]: Does optimize

command_rebuild()[source]: Does rebuild

config(name: str, value: SQLiteValue = None, *, prefix: str = 'x-apsw-') → SQLiteValue[source]

Optionally sets, and gets a config value

If the value is not None, then it is changed. It is not recommended to change SQLite’s own values.

The prefix is to ensure your own config names don’t clash with those used by SQLite. For example you could remember the Unicode version used by your tokenizer, and rebuild if the version is updated.

The advantage of using this is that the names/values will survive the FTS5 table being renamed, backed up, restored etc.

config_automerge(val: int | None = None) → int[source]: Optionally sets, and returns automerge

config_crisismerge(val: int | None = None) → int[source]: Optionally sets, and returns crisismerge

config_deletemerge(val: int | None = None) → int[source]: Optionally sets, and returns deletemerge

config_insttoken(val: bool | None = None) → bool[source]: Optionally sets, and returns insttoken

config_pgsz(val: int | None = None) → int[source]: Optionally sets, and returns page size

config_rank(val: str | None = None) → str[source]

Optionally sets, and returns rank

When setting rank it must consist of a function name, open parentheses, zero or more SQLite value literals that will be arguments to the function, and a close parenthesis, For example my_func(3, x'aabb', 'hello')

config_secure_delete(val: bool | None = None) → bool[source]: Optionally sets, and returns secure-delete

config_usermerge(val: int | None = None) → int[source]: Optionally sets, and returns usermerge

classmethod create(db: Connection, name: str, columns: Iterable[str] | None, *, schema: str = 'main', unindexed: Iterable[str] | None = None, tokenize: Iterable[str] | None = None, support_query_tokens: bool = False, rank: str | None = None, prefix: Iterable[int] | int | None = None, content: str | None = None, content_rowid: str | None = None, contentless_delete: bool = False, contentless_unindexed: bool = False, columnsize: bool = True, detail: Literal['full', 'column', 'none'] = 'full', tokendata: bool = False, locale: bool = False, generate_triggers: bool = False, drop_if_exists: bool = False) → Self[source]

Creates the table, returning a Table on success

You can use apsw.Connection.table_exists() to check if a table already exists.

Parameters:

db – connection to create the table on
name – name of table
columns – A sequence of column names. If you are using an external content table (recommended) you can supply None and the column names will be from the table named by the content parameter
schema – Which attached database the table is being created in
unindexed – Columns that will be unindexed
tokenize – The tokenize option. Supply as a sequence of strings which will be correctly quoted together.
support_query_tokens – Configure the tokenize option to allow queries using tokens.
rank – The rank option if not using the default. See config_rank() for required syntax.
prefix – The prefix option. Supply an int, or a sequence of int.
content – Name of the external content table. The external content table must be in the same database as the FTS5 table.
content_rowid – Name of the content rowid column if not using the default when using an external content table
contentless_delete – Set the contentless delete option for contentless tables.
contentless_unindexed – Set the contentless unindexed option for contentless tables
columnsize – Indicate if the column size tracking should be disabled to save space
detail – Indicate if detail should be reduced to save space
tokendata – Indicate if tokens have separate data after a null char
locale – Indicate if a locale is available to tokenizers and stored in the table
generate_triggers – If using an external content table and this is True, then triggers are created to keep this table updated with changes to the external content table. These require a table not a view.
drop_if_exists – The FTS5 table will be dropped if it already exists, and then created.

If you create with an external content table, then command_rebuild() and command_optimize() will be run to populate the contents.

delete(rowid: int) → bool[source]

Deletes the identified row

If you are using an external content table then the delete is directed to that table.

Returns:: True if a row was deleted

fts5vocab_name(type: Literal['row', 'col', 'instance']) → str[source]: Creates a fts5vocab table in temp and returns fully quoted name

is_token(token: str) → bool[source]: Returns True if it is a known token

key_tokens(rowid: int, *, limit: int = 10, columns: str | Sequence[str] | None = None) → Sequence[tuple[float, str]][source]

Finds tokens that are dense in this row, but rare in other rows

This is purely statistical and has no understanding of the tokens. Tokens that occur only in this row are ignored.

Parameters:

rowid – Which row to examine
limit – Maximum number to return
columns – If provided then only look at specified column(s), else all indexed columns.

Returns:

A sequence of tuples where each is a tuple of token and float score with bigger meaning more unique, sorted highest score first.

See the example.

See also

text_for_token() to get original document text corresponding to a token

more_like(ids: Sequence[int], *, columns: str | Sequence[str] | None = None, token_limit: int = 3) → Iterator[MatchInfo][source]

Like search() providing results similar to the provided ids.

This is useful for providing infinite scrolling. Do a search remembering the rowids. When you get to the end, call this method with those rowids.

key_tokens() is used to get key tokens from rows which is purely statistical and has no understanding of the text.

Parameters:

ids – rowids to consider
columns – If provided then only look at specified column(s), else all indexed columns.
token_limit – How many tokens are extracted from each row. Bigger values result in a broader search, while smaller values narrow it.

See the example.

query_suggest(query: str, threshold: float = 0.01, *, tft_docs: int = 2, locale: str | None = None) → str | None[source]

Suggests alternate query

This is useful if a query returns no or few matches. It is purely a statistical operation based on the tokens in the query and index. There is no guarantee that there will be more (or any) matches. The query structure (AND, OR, column filters etc) is maintained.

Transformations include:

Ensuring column names in column filters are of the closest indexed column name
Combining such as some thing to something
Splitting such as noone to no one
Replacing unknown/rare words with more popular ones

The query is parsed, tokenized, replacement tokens established, and original text via text_for_token() used to reconstitute the query.

Parameters:

query – A valid query string
threshold – Fraction of rows between 0.0 and 1.00 to be rare - eg 0.01 means a token occurring in less than 1% of rows is considered for replacement. Larger fractions increase the likelihood of replacements, while smaller reduces it. A value of 0 will only replace tokens that are not in the index at all - essentially spelling correction only
tft_docs – Passed to text_for_token() as the doc_limit parameter. Larger values produce more representative text, but also increase processing time.
locale – Locale used to tokenize the query.

Returns:

None if no suitable changes were found, or a replacement query string.

property quoted_table_name: str[source]

Provides the full table name for composing your own queries

It includes the attached database name and quotes special characters like spaces.

You can’t use bindings for table names in queries, so use this when constructing a query string:

my_table = apsw.fts5.Table(con, 'my_table')

sql = f"""SELECT ... FROM { my_table.quoted_table_name }
             WHERE ...."""

row_by_id(id: int, column: str | Sequence[str]) → SQLiteValue | tuple[SQLiteValue][source]

Returns the contents of the row id

You can request one column,, or several columns. If one column is requested then just that value is returned, and a tuple of values for more than column.

KeyError is raised if the row does not exist.

See the example.

property row_count: int: Number of rows in the table

search(query: str, locale: str | None = None) → Iterator[MatchInfo][source]

Iterates query matches, best matches first

This avoids the need to write SQL. See the example.

property structure: FTS5TableStructure[source]: Structure of the table from the declared SQL

property supports_query_tokens: bool[source]: True if you can use apsw.fts5query.QueryTokens with this table

text_for_token(token: str, doc_limit: int) → str[source]

Provides the original text used to produce token

Different text produces the same token because case can be ignored, accents and punctuation removed, synonyms and other processing.

This method finds the text that produced a token, by re-tokenizing the documents containing the token. Highest rowids are examined first so this biases towards the newest content.

Parameters:

token – The token to find
doc_limit – Maximum number of documents to examine. The higher the limit the longer it takes, but the more representative the text is.

Returns:

The most popular text used to produce the token in the examined documents

See the example.

property token_count: int: Total number of tokens across all indexed columns in all rows

token_doc_frequency(count: int = 10) → list[tuple[str, int]][source]

Most frequent occurring tokens, useful for building a stop words list

This counts the total number of documents containing the token, so appearing 1,000 times in 1 document counts as 1, while once each in 1,000 documents counts as 1,000.

See also

token_frequency()
Example

token_frequency(count: int = 10) → list[tuple[str, int]][source]

Most frequent tokens, useful for building a stop words list

This counts the total occurrences of the token, so appearing 1,000 times in 1 document counts the same as once each in 1,000 documents.

See also

token_doc_frequency()
Example

tokenize(utf8: bytes, reason: int = apsw.FTS5_TOKENIZE_DOCUMENT, locale: str | None = None, include_offsets=True, include_colocated=True)[source]: Tokenize the supplied utf8

property tokenizer: FTS5Tokenizer[source]: Tokenizer instance as used by this table

property tokens: dict[str, int]

All the tokens as a dict with token as key, and the value being how many rows they are in

This can take some time on a large corpus - eg 2 seconds on a gigabyte dataset with half a million documents and 650,000 tokens. It is cached until the next content change.

property tokens_per_column: list[int]: Count of tokens in each column, across all rows. Unindexed columns have a value of zero

upsert(*args: SQLiteValue, **kwargs: SQLiteValue) → int[source]

Insert or update with columns by positional and keyword arguments

You can mix and match positional and keyword arguments:

table.upsert("hello")
table.upsert("hello", header="world")
table.upsert(header="world")

If you specify a rowid keyword argument that is used as the rowid for the insert. If the corresponding row already exists then the row is modified with the provided values. rowids are always integers.

The rowid of the inserted/modified row is returned.

If you are using an [external content](https://www.sqlite.org/fts5.html#external_content_tables) table:

The insert will be directed to the external content table
rowid will map to the content_rowid option if used
The column names and positions of the FTS5 table, not the external content table is used
The FTS5 table is not updated - you should use triggers on the external content table to do that. See the generate_triggers option on create().

See the example

class apsw.fts5.TokenizerArgument(default: Any = None, choices: Sequence[Any] | None = None, convertor: Callable[[str], Any] | None = None, convert_default: bool = False)[source]

Used as spec values to parse_tokenizer_args() - example

choices: Sequence[Any] | None = None: Value must be one of these, after conversion

convert_default: bool = False: True if the default value should be run through the convertor

convertor: Callable[[str], Any] | None = None: Function to convert string value to desired value

default: Any = None: Value - set to default before parsing

apsw.fts5.TransformTokenizer(transform: Callable[[str], str | Sequence[str]] | None = None) → FTS5TokenizerFactory[source]

Transforms tokens to a different token, such as stemming

To use you need a callable that takes a str, and returns a list of str, or just a str to use as replacements. You can return an empty list to remove the token.

The following tokenizer arguments are accepted.

transform: Specify a transform, or use as a decorator

See the example.

apsw.fts5.UnicodeWordsTokenizer(con: Connection, args: list[str]) → Tokenizer[source]

Uses Unicode segmentation to extract words

The following tokenizer parameters are accepted. A segment is considered a word if a codepoint matching any of the categories, emoji, or regional indicator is present.

categories: Default L* N* to include letters, and numbers. You should consider Pd for punctuation dash if you want words separated with dashes to be considered one word. Sm for maths symbols and Sc for currency symbols may also be relevant,
emoji: 0 or 1 (default) if emoji are included. They will be a word by themselves.
regional_indicator: 0 or 1 (default) if regional indicators like 🇬🇧 🇵🇭 are included. They will be a word by themselves.

This does a lot better than the unicode61 tokenizer builtin to FTS5. It understands user perceived characters (made of many codepoints), and punctuation within words (eg don't is considered two words don and t by unicode61), as well as how various languages work.

For languages where there is no spacing or similar between words, only a dictionary can determine actual word boundaries. Examples include Japanese, Chinese, and Khmer. In this case the algorithm returns the user perceived characters individually making it similar to NGramTokenizer() which will provide good search experience at the cost of a slightly larger index.

Use the SimplifyTokenizer() to make case insensitive, remove diacritics, combining marks, and use compatibility code points.

See the example

apsw.fts5.convert_boolean(value: str) → bool[source]

Converts to boolean

Accepts 0, 1, false, and true

apsw.fts5.convert_number_ranges(numbers: str) → set[int][source]

Converts comma separated number ranges

Takes input like 2,3-5,17 and converts to {2, 3, 4, 5, 17}

apsw.fts5.convert_string_to_python(expr: str) → Any[source]

Converts a string to a Python object

This is useful to process command line arguments and arguments to tokenizers. It automatically imports the necessary modules.

Warning

The string is ultimately evaluated allowing arbitrary code execution and side effects.

Some examples of what is accepted are:

3 + 4
apsw.fts5.RegexTokenizer
snowballstemmer.stemmer(“english”).stemWord
nltk.stem.snowball.EnglishStemmer().stem
shutil.rmtree(“a/directory/location”) COULD DELETE ALL FILES

apsw.fts5.convert_tokenize_reason(value: str) → set[int][source]

Converts a space separated list of tokenize_reasons into a set of corresponding values

Use with parse_tokenizer_args()

apsw.fts5.convert_unicode_categories(patterns: str) → set[str][source]

Returns Unicode categories matching space separated values

fnmatch.fnmatchcase() is used to check matches. An example pattern is L* Pc would return {'Pc', 'Lm', 'Lo', 'Lu', 'Lt', 'Ll'}

You can also put ! in front to exclude categories, so * !*m would be all categories except those ending in m.

apsw.fts5.map_functions = {'position_rank': 'apsw.fts5aux.position_rank', 'subsequence': 'apsw.fts5aux.subsequence'}: APSW provided auxiliary functions for use with register_functions()

apsw.fts5.map_tokenizers = {'html': <function HTMLTokenizer>, 'json': <function JSONTokenizer>, 'ngram': <function NGramTokenizer>, 'querytokens': <function QueryTokensTokenizer>, 'simplify': <function SimplifyTokenizer>, 'unicodewords': <function UnicodeWordsTokenizer>}: APSW provided tokenizers for use with register_tokenizers()

apsw.fts5.parse_tokenizer_args(spec: dict[str, TokenizerArgument | Any], con: Connection, args: list[str]) → dict[str, Any][source]

Parses the arguments to a tokenizer based on spec returning corresponding values

Parameters:

spec – A dictionary where the key is a string, and the value is either the corresponding default, or TokenizerArgument.
con – Used to lookup other tokenizers
args – A list of strings as received by apsw.FTS5TokenizerFactory

For example to parse ["arg1", "3", "big", "ship", "unicode61", "yes", "two"]

# spec on input
{
    # Converts to integer
    "arg1": TokenizerArgument(convertor=int, default=7),
    # Limit allowed values
    "big": TokenizerArgument(choices=("ship", "plane")),
    # Accepts any string, with a default
    "small": "hello",
    # gathers up remaining arguments, if you intend
    # to process the results of another tokenizer
    "+": None
}

# options on output
{
    "arg1": 3,
    "big": "ship",
    "small": "hello",
    "+": db.Tokenizer("unicode61", ["yes", "two"])
}

# Using "+" in your ``tokenize`` functions
def tokenize(utf8, flags, locale):
    tok = options["+"]
    for start, end, *tokens in tok(utf8, flags, locale):
        # do something
        yield start, end, *tokens

See also

Some useful convertors

convert_unicode_categories()
convert_tokenize_reason()
convert_string_to_python()
convert_number_ranges()

See the example.

apsw.fts5.register_functions(db: Connection, map: dict[str, str | Callable])[source]

Registers auxiliary functions named in map with the connection, if not already registered

The map contains the function name, and either the callable or a string which will be automatically imported.

See map_functions

apsw.fts5.register_tokenizers(db: Connection, map: dict[str, str | Callable])[source]

Registers tokenizers named in map with the connection, if not already registered

The map contains the tokenizer name, and either the callable or a string which will be automatically imported.

See map_tokenizers

apsw.fts5.string_tokenize(tokenizer: FTS5Tokenizer, text: str, flags: int, locale: str | None)[source]

Tokenizer caller to get string offsets back

Calls the tokenizer doing the conversion of text to UTF8, and converting the received UTF8 offsets back to text offsets.

apsw.fts5.tokenize_reasons: dict[str, int] = {'AUX': 8, 'DOCUMENT': 4, 'QUERY': 1, 'QUERY_PREFIX': 3}: Mapping between friendly strings and constants for xTokenize flags

apsw.fts5.tokenizer_test_strings(filename: str | Path | None = None) → tuple[tuple[bytes, str], ...][source]

Provides utf-8 bytes sequences for interesting test strings

Parameters:: filename – File to load. If None then the builtin one is used
Returns:: A tuple where each item is a tuple of utf8 bytes and comment str

The test file should be UTF-8 encoded text.

If it starts with a # then it is considered to be multiple text sections where a # line contains a description of the section. Any lines beginning ## are ignored.

apsw.fts5.unicode_categories = {'Cc': 'Other control', 'Cf': 'Other format', 'Cn': 'Other not assigned', 'Co': 'Other private use', 'Cs': 'Other surrogate', 'Ll': 'Letter Lowercase', 'Lm': 'Letter modifier', 'Lo': 'Letter other', 'Lt': 'Letter titlecase', 'Lu': 'Letter Uppercase', 'Mc': 'Mark spacing combining', 'Me': 'Mark enclosing', 'Mn': 'Mark nonspacing', 'Nd': 'Number decimal digit', 'Nl': 'Number letter', 'No': 'Number other', 'Pc': 'Punctuation connector', 'Pd': 'Punctuation dash', 'Pe': 'Punctuation close', 'Pf': 'Punctuation final quote', 'Pi': 'Punctuation initial quote', 'Po': 'Punctuation other', 'Ps': 'Punctuation open', 'Sc': 'Symbol currency', 'Sk': 'Symbol modifier', 'Sm': 'Symbol math', 'So': 'Symbol other', 'Zl': 'Separator line', 'Zp': 'Separator paragraph', 'Zs': 'Separator space'}: Unicode categories and descriptions for reference

FTS5 Query module 

apsw.fts5query Create, parse, and modify queries

There are 3 representations of a query available:

query string

This the string syntax accepted by FTS5 where you represent AND, OR, NEAR, column filtering etc inline in the string. An example is:
love AND (title:^"big world" NOT summary:"sunset cruise")

parsed

This is a hierarchical representation using dataclasses with all fields present. Represented as QUERY, it uses PHRASE, NEAR, COLUMNFILTER, AND, NOT, and OR. The string example above is:

AND(queries=[PHRASE(phrase='love', initial=False, prefix=False, plus=None),
            NOT(match=COLUMNFILTER(columns=['title'],
                                    filter='include',
                                    query=PHRASE(phrase='big world',
                                                initial=True,
                                                prefix=False,
                                                plus=None)),
                no_match=COLUMNFILTER(columns=['summary'],
                                    filter='include',
                                    query=PHRASE(phrase='sunset cruise',
                                                    initial=False,
                                                    prefix=False,
                                                    plus=None)))])

dict

This is a hierarchical representation using Python dictionaries which is easy for logging, storing as JSON, and manipulating. Fields containing default values are omitted. When provided to methods in this module, you do not need to provide intermediate PHRASE - just Python lists and strings directly. This is the easiest form to programmatically compose and modify queries in. The string example above is:

{'@': 'AND',
'queries': [{'@': 'PHRASE', 'phrase': 'love'},
            {'@': 'NOT',
            'match': {'@': 'COLUMNFILTER',
                        'columns': ['title'],
                        'filter': 'include',
                        'query': {'@': 'PHRASE',
                                'initial': True,
                                'phrase': 'big world'}},
            'no_match': {'@': 'COLUMNFILTER',
                        'columns': ['summary'],
                        'filter': 'include',
                        'query': {'@': 'PHRASE',
                                    'phrase': 'sunset cruise'}}}]}

See the example.

Conversion functions
From type	To type	Conversion method
query string	parsed	`parse_query_string()`
parsed	dict	`to_dict()`
dict	parsed	`from_dict()`
parsed	query string	`to_query_string()`

Other helpful functionality includes:

quote() to appropriately double quote strings
extract_with_column_filters() to get a QUERY for a node within an existing QUERY but applying the intermediate column filters.
applicable_columns() to work out which columns apply to part of a QUERY
walk() to traverse a parsed query

class apsw.fts5query.AND(queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE])[source]

All queries must match

queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE]

class apsw.fts5query.COLUMNFILTER(columns: Sequence[str], filter: Literal['include', 'exclude'], query: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE)[source]

Limit query to certain columns

This always reduces the columns that phrase matching will be done against.

columns: Sequence[str]: Limit phrase matching by these columns

filter: Literal['include', 'exclude']: Including or excluding the columns

query: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE: query the filter applies to, including all nested queries

class apsw.fts5query.NEAR(phrases: Sequence[PHRASE], distance: int = 10)[source]

Near query

distance: int = 10: Maximum distance between the phrases

phrases: Sequence[PHRASE]: Two or more phrases

class apsw.fts5query.NOT(match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, no_match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE)[source]

match must match, but no_match must not

match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE

no_match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE

class apsw.fts5query.OR(queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE])[source]

Any query must match

queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE]

class apsw.fts5query.PHRASE(phrase: str | QueryTokens, initial: bool = False, prefix: bool = False, plus: PHRASE | None = None)[source]

One phrase

initial: bool = False: If True then the phrase must match the beginning of a column (^ was used)

phrase: str | QueryTokens: Text of the phrase. If + was used (eg one+two) then it will be a list of phrases

plus: PHRASE | None = None: Additional phrase segment, joined by + in queries

prefix: bool = False: If True then if it is a prefix search on the last token in phrase (* was used)

exception apsw.fts5query.ParseError(query: str, message: str, position: int)[source]

This exception is raised when an error parsing a query string is encountered

A simple printer:

print(exc.query)
print(" " * exc.position + "^", exc.message)

message: str: Description of error

position: int: Offset in query where the error occurred

query: str: The query that was being processed

apsw.fts5query.QUERY

Type representing all query types.

alias of COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE

apsw.fts5query.QUERY_TOKENS_MARKER = '$!Tokens~': Special marker at the start of a string to recognise it as a list of tokens for QueryTokens

class apsw.fts5query.QueryTokens(tokens: list[str | Sequence[str]])[source]

FTS5 query strings are passed to tokenizers which extract tokens, such as by splitting on whitespace, lower casing text, and removing characters like accents.

If you want to query tokens directly then use this class with the tokens member, using it where PHRASE.phrase goes and use to_query_string() to compose your query.

Your FTS5 table must use the apsw.fts5.QueryTokensTokenizer as the first tokenizer in the list. If the reason for tokenizing includes FTS5_TOKENIZE_QUERY and the text to be tokenized starts with the special marker, then the tokens are returned. apsw.fts5.Table.supports_query_tokens will tell you if query tokens are handled correctly. apsw.fts5.Table.create() parameter support_query_tokens will ensure the tokenize table option is correctly set, You can get the tokens from apsw.fts5.Table.tokens.

You can construct QueryTokens like this:

# One token
QueryTokens(["hello"])
# Token sequence
QueryTokens(["hello". "world", "today"])
# Colocated tokens use a nested list
QueryTokens(["hello", ["first", "1st"]])

To use in a query:

{"@": "NOT", "match": QueryTokens(["hello", "world"]),
             "no_match": QueryTokens([["first", "1st"]])}

That would be equivalent to a query of "Hello World" NOT "First" if tokens were lower cased, and a tokenizer added a colocated 1st on seeing first.

classmethod decode(data: str | bytes) → QueryTokens | None[source]: If the marker is present then returns the corresponding QueryTokens, otherwise None.

encode() → str[source]: Produces the tokens encoded with the marker and separator

tokens: list[str | Sequence[str]]: The tokens

apsw.fts5query.applicable_columns(node: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, columns: Sequence[str]) → set[str][source]

Return which columns apply to node

You can use apsw.fts5.Table.columns_indexed() to get the column list for a table. The column names are matched using SQLite semantics (ASCII case insensitive).

If a query column is not in the provided columns, then KeyError is raised.

apsw.fts5query.extract_with_column_filters(node: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) → COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE[source]

Return a new QUERY for a query rooted at start with child node, with intermediate COLUMNFILTER in between applied.

This is useful if you want to execute a node from a top level query ensuring the column filters apply.

apsw.fts5query.from_dict(d: dict[str, Any] | Sequence[str] | str | QueryTokens) → COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE[source]

Turns dict back into a QUERY

You can take shortcuts putting str or QueryTokens in places where PHRASE is expected. For example this is accepted:

{
    "@": "AND,
    "queries": ["hello", "world"]
}

apsw.fts5query.parse_query_string(query: str) → COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE[source]: Returns the corresponding QUERY for the query string

apsw.fts5query.quote(text: str | QueryTokens) → str[source]

Quotes text if necessary to keep it as one unit using FTS5 quoting rules

Some examples:

text	return
`hello`	`hello`
`one two`	`"one two"`
(empty string)	`""`
`one"two`	`"one""two"`

apsw.fts5query.to_dict(q: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) → dict[str, Any][source]

Converts structure to a dict

This is useful for pretty printing, logging, saving as JSON, modifying etc.

The dict has a key @ with value corresponding to the dataclass (eg NEAR, PHRASE, AND) and the same field names as the corresponding dataclasses. Only fields with non-default values are emitted.

apsw.fts5query.to_query_string(q: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) → str[source]: Returns the corresponding query in text format

apsw.fts5query.walk(start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) → Iterator[tuple[tuple[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, ...], COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE]][source]

Yields the parents and each node for a query recursively

The query tree is traversed top down. Use it like this:

for parents, node in walk(query):
   # parents will be a tuple of parent nodes
   # node will be current node
   if isinstance(node, PHRASE):
       print(node.phrase)

FTS5 Auxiliary functions module 

apsw.fts5aux Implementation of FTS5 auxiliary functions in Python.

Auxiliary functions are used for ranking results, and for processing search results.

apsw.fts5aux.bm25(api: FTS5ExtensionApi, *args: SQLiteValue) → float[source]

Perform the BM25 calculation for a matching row

It accepts weights for each column (default 1) which means how much a hit in that column counts for.

The builtin function is described here. This is a translation of the SQLite C version into Python for illustrative purposes.

apsw.fts5aux.inverse_document_frequency(api: FTS5ExtensionApi) → list[float][source]

Measures how rare each search phrase is in the content

This helper method is intended for use in your own ranking functions. The result is the idf for each phrase in the query.

A phrase occurring in almost every row will have a value close to zero, while less frequent phrases have increasingly large positive numbers.

The values will always be at least 0.000001 so you don’t have to worry about negative numbers or division by zero, even for phrases that are not found.

apsw.fts5aux.position_rank(api: FTS5ExtensionApi, *args: SQLiteValue)[source]

Ranking function boosting the earlier in a column phrases are located

bm25() doesn’t take into where phrases occur. It makes no difference if a phrase occurs at the beginning, middle, or end. This boost takes into account how early the phrase match is, suitable for content with more significant text towards the beginning.

If the query has phrases and operators (AND, OR, NOT) then those operators are not visible to this function, and only the location of each phrase is taken into consideration. See apsw.fts5.QueryInfo.phrases.

It accepts parameters giving the weights for each column (default 1).

apsw.fts5aux.subsequence(api: FTS5ExtensionApi, *args: SQLiteValue)[source]

Ranking function boosting rows where tokens are in order

bm25() doesn’t take into account ordering. Phrase matches like "big truck" must occur exactly together in that order. Matches for big truck scores the same providing both words exist anywhere. This function boosts matches where the order does match so big red truck gets a boost while truck, big does not for the same query.

If the query has phrases and operators (AND, OR, NOT) then those operators are not visible to this function, and it looks for ordering of each phrase. For example big OR truck NOT red will result in this function boosting big ... truck ... red in that order. See apsw.fts5.QueryInfo.phrases.

It accepts parameters giving the weights for each column (default 1).

FTS5Tokenizer class 

class apsw.FTS5Tokenizer: Wraps a registered tokenizer. Returned by Connection.fts5_tokenizer().

FTS5Tokenizer.__call__(utf8: Buffer, flags: int, locale: str | None, *, include_offsets: bool = True, include_colocated: bool = True) → TokenizerResult

Does a tokenization, returning a list of the results. If you have no interest in token offsets or colocated tokens then they can be omitted from the results.

Parameters:

utf8 – Input buffer
reason – Reason flag
include_offsets – Returned list includes offsets into utf8 for each token
include_colocated – Returned list can include colocated tokens

Example outputs

Tokenizing b"first place" where 1st has been provided as a colocated token for first.

(Default) include_offsets True, include_colocated True

[
  (0, 5, "first", "1st"),
  (6, 11, "place"),
]

include_offsets False, include_colocated True

[
  ("first", "1st"),
  ("place", ),
]

include_offsets True, include_colocated False

[
  (0, 5, "first"),
  (6, 11, "place"),
]

include_offsets False, include_colocated False

[
  "first",
  "place",
]

FTS5Tokenizer.args: tuple[str]: The arguments the tokenizer was created with.

FTS5Tokenizer.connection: Connection: The Connection this tokenizer is registered with.

FTS5Tokenizer.name: str: Tokenizer name

FTS5ExtensionApi class 

class apsw.FTS5ExtensionApi

Auxiliary functions run in the context of a FTS5 search, and can be used for ranking, highlighting, and similar operations. Auxiliary functions are registered via Connection.register_fts5_function(). This wraps the auxiliary functions API passed as the first parameter to auxiliary functions.

See the example.

FTS5ExtensionApi.aux_data: Any

You can store an object as auxiliary data which is available across matching rows. It starts out as None.

An example use is to do up front calculations once, rather than on every matched row, such as fts5aux.inverse_document_frequency().

FTS5ExtensionApi.column_count: int: Returns the number of columns in the table

FTS5ExtensionApi.column_locale(column: int) → str | None: Retrieves the locale for a column on this row.

FTS5ExtensionApi.column_size(col: int = -1) → int: Returns the total number of tokens in the current row for a specific column, or if col is negative then for all columns.

FTS5ExtensionApi.column_text(col: int) → bytes: Returns the utf8 bytes for the column of the current row.

FTS5ExtensionApi.column_total_size(col: int = -1) → int: Returns the total number of tokens in the table for a specific column, or if col is negative then for all columns.

FTS5ExtensionApi.inst_count: int: Returns the number of hits in the current row

FTS5ExtensionApi.inst_tokens(inst: int) → tuple[str, ...] | None: Access tokens of hit inst in current row. None is returned if the call is not supported.

FTS5ExtensionApi.phrase_column_offsets(phrase: int, column: int) → list[int]: Returns token offsets the phrase number occurs in in the specified column.

FTS5ExtensionApi.phrase_columns(phrase: int) → tuple[int]: Returns which columns the phrase number occurs in

FTS5ExtensionApi.phrase_count: int: Returns the number of phrases in the query

FTS5ExtensionApi.phrase_locations(phrase: int) → list[list[int]]

Returns which columns and token offsets the phrase number occurs in.

The returned list is the same length as the number of columns. Each member is a list of token offsets in that column, and will be empty if the phrase is not in that column.

FTS5ExtensionApi.phrases: tuple[tuple[str | None, ...], ...]

A tuple where each member is a phrase from the query. Each phrase is a tuple of str (or None when not available) per token of the phrase.

This combines the results of xPhraseCount, xPhraseSize and xQueryToken

FTS5ExtensionApi.query_phrase(phrase: int, callback: FTS5QueryPhrase, closure: Any) → None

Searches the table for the numbered query. The callback takes two parameters - a different apsw.FTS5ExtensionApi and closure.

An example usage for this method is to see how often the phrases occur in the table. Setup a tracking counter here, and then in the callback you can update it on each visited row. This is shown in the example.

FTS5ExtensionApi.row_count: int: Returns the number of rows in the table

FTS5ExtensionApi.rowid: int: Rowid of the current row

FTS5ExtensionApi.tokenize(utf8: Buffer, locale: str | None, *, include_offsets: bool = True, include_colocated: bool = True) → list: Tokenizes the utf8. FTS5 sets the reason to FTS5_TOKENIZE_AUX. See apsw.FTS5Tokenizer.__call__() for details.

Unicode Text Handling 

apsw.unicode - Up to date Unicode aware methods and lookups

This module helps with Full text search and general Unicode, addressing the following:

The standard library unicodedata has limited information available (eg no information about emoji), and is only updated to new Unicode versions on a new Python version.
Multiple consecutive codepoints can combine into a single user perceived character (grapheme cluster), such as combining accents, vowels and marks in some writing systems, variant selectors, joiners and linkers, etc. That means you can’t use indexes into str safely without potentially breaking them.
The standard library provides no help in splitting text into grapheme clusters, words, and sentences, or into breaking text into multiple lines.
Text processing is performance sensitive - FTS5 easily handles hundreds of megabytes to gigabytes of text, and so should this module. It also affects the latency of each query as that is tokenized, and results can highlight words and sentences.

This module is independent of the main apsw module, and loading it does not load any database functionality. The majority of the functionality is implemented in C for size and performance reasons.

See unicode_version for the implemented version.

Grapheme cluster, word, and sentence splitting

Unicode Technical Report #29 rules for finding grapheme clusters, words, and sentences are implemented. Tr29 specifies break points which can be found via grapheme_next_break(), word_next_break(), and sentence_next_break().

Building on those are iterators providing optional offsets and the text. This is used for tokenization (getting character and word boundaries correct), and for result highlighting (showing words/sentences before and after match).

Line break splitting

Unicode Technical Report #14 rules for finding where text cam be broken and resumed on the next line. Tr14 specifies break points which can be found via line_break_next_break().

Building on those are iterators providing optional offsets and the text. This is used for text_wrap().

Unicode lookups

Category information category()

Is an emoji or similar is_extended_pictographic()

Flag characters is_regional_indicator()

Codepoint names codepoint_name()

Case folding, accent removal

casefold() is used to do case insensitive comparisons.

strip() is used to remove accents, marks, punctuation, joiners etc

Helpers

These are aware of grapheme cluster boundaries which Python’s builtin string operations are not. The text width functions take into account how wide the text is when displayed on most terminals.

grapheme_length() to get the number of grapheme clusters in a string

grapheme_substr() to get substrings using grapheme cluster indexing

grapheme_startswith() and grapheme_endswith()

grapheme_find() to find a substring

split_lines() to split text into lines using all the Unicode hard line break codepoints

text_width() to count how wide the text is

expand_tabs() to expand tabs using text width

text_width_substr() to extract substrings based on text width

text_wrap() to wrap paragraphs using Unicode words, line breaking, and text width

guess_paragraphs() to establish paragraph boundaries for text that has line breaks in paragraphs like many plain text and similar markup formats.

Size

Using the ICU extension is 5MB of code that then links to shared libraries containing another 5MB of code, and 30MB of data. This module is 0.5MB, 5 to 50% faster, and has no dependencies. (ICU includes numerous extra customisations, formatting, locale helpers etc.)

Performance

There some pure Python alternatives, with less functionality. They take 5 to 15 times more CPU time to process the same text. Use python3 -m apsw.unicode benchmark --help.

apsw.unicode.unicode_version = '16.0': The Unicode version that the rules and data tables implement

apsw.unicode.category(codepoint: int | str) → str[source]

Returns the general category - eg Lu for Letter Uppercase

See apsw.fts5.unicode_categories for descriptions mapping

apsw.unicode.is_extended_pictographic(text: str) → bool[source]: Returns True if any of the text has the extended pictographic property (Emoji and similar)

apsw.unicode.is_regional_indicator(text: str) → bool[source]: Returns True if any of the text is one of the 26 regional indicators used in pairs to represent country flags

apsw.unicode.casefold(text: str) → str[source]

Returns the text for equality comparison without case distinction

Case folding maps text to a canonical form where case differences are removed allowing case insensitive comparison. Unlike upper, lower, and title case, the result is not intended to be displayed to people.

apsw.unicode.strip(text: str) → str[source]

Returns the text for less exact comparison with accents, punctuation, marks etc removed

It will strip diacritics leaving the underlying characters so áççéñțś becomes accents, punctuation so e.g. becomes eg and don't becomes dont, marks so देवनागरी becomes दवनगर, as well as all spacing, formatting, variation selectors and similar codepoints.

Codepoints are also converted to their compatibility representation. For example the single codepoint Roman numeral Ⅲ becomes III (three separate regular upper case I), and 🄷🄴🄻🄻🄾 becomes HELLO.

The resulting text should not be shown to people, and is intended for doing relaxed equality comparisons, at the expense of false positives when the accents, marks, punctuation etc were intended.

You should do case folding after this.

Emoji are preserved but variation selectors, fitzpatrick and joiners are stripped.

Regional indicators are preserved.

apsw.unicode.split_lines(text: str, offset: int = 0) → Iterator[str][source]

Each line, using hard line break rules

This is a iterator yielding a line at a time. The end of line yielded will not include the hard line break characters.

apsw.unicode.expand_tabs(text: str, tabsize: int = 8, invalid: str = '.') → str[source]

Turns tabs into spaces aligning on tabsize boundaries, similar to str.expandtabs()

This is aware of grapheme clusters and text width. Codepoints that have an invalid width are also replaced by invalid. Control characters are an example of an invalid character. Line breaks are replaced with newline.

apsw.unicode.grapheme_length(text: str, offset: int = 0) → int[source]: Returns number of grapheme clusters in the text. Unicode aware version of len

apsw.unicode.grapheme_substr(text: str, start: int | None = None, stop: int | None = None) → str[source]

Like text[start:end] but in grapheme cluster units

start and end can be negative to index from the end, or outside the bounds of the text but are never an invalid combination (you get empty string returned).

To get one grapheme cluster, make stop one more than start. For example to get the 3rd last grapheme cluster:

grapheme_substr(text, -3, -3 + 1)

apsw.unicode.grapheme_endswith(text: str, substring: str) → bool[source]: Returns True if text ends with substring being aware of grapheme cluster boundaries

apsw.unicode.grapheme_startswith(text: str, substring: str) → bool[source]: Returns True if text starts with substring being aware of grapheme cluster boundaries

apsw.unicode.grapheme_find(text: str, substring: str, start: int = 0, end: int | None = None) → int[source]

Returns the offset in text where substring can be found, being aware of grapheme clusters. The start and end of the substring have to be at a grapheme cluster boundary.

Parameters:

start – Where in text to start the search (default beginning)
end – Where to stop the search exclusive (default remaining text)

Returns:

offset into text, or -1 if not found or substring is zero length

apsw.unicode.text_width(text: str, offset: int = 0) → int[source]

Returns how many columns the text would be if displayed in a terminal

You should split_lines() first and then operate on each line separately.

If the text contains new lines, control characters, and similar unrepresentable codepoints then minus 1 is returned.

Terminals aren’t entirely consistent with each other, and Unicode has many kinds of codepoints, and combinations. Consequently this is right the vast majority of the time, but not always.

Note that web browsers do variable widths even in monospaced sections like <pre> so they won’t always agree with the terminal either.

apsw.unicode.text_width_substr(text: str, width: int, offset: int = 0) → tuple[int, str][source]

Extracts substring width or less wide being aware of grapheme cluster boundaries. For example you could use this to get a substring that is 80 (or less) wide.

Returns:: A tuple of how wide the substring is, and the substring

apsw.unicode.guess_paragraphs(text: str, tabsize: int = 8) → str[source]

Given text that contains paragraphs containing newlines, guesses where the paragraphs end.

The returned str will have ``

`` removed where it was

determined to not mark a paragraph end.

If you have text like this, where paragraphs have newlines in
them, then each line gets wrapped separately by text_wrap.
This function tries to guess where the paragraphs end.

Blank lines like above are definite.
  Indented lines that continue preserving the indent
  are considered the same paragraph, and a change of indent
  (in or out) is a new paragraph.
    So this will be a new paragraph,
And this will be a new paragraph.

 * Punctuation/numbers at the start of line
   followed by indented text are considered the same
   paragraph
2. So this is a new paragraph, while
   this line is part of the line above

3. Optional numbers followed by punctuation then space
- are considered new paragraphs

apsw.unicode.text_wrap(text: str, width: int = 70, *, tabsize: int = 8, hyphen: str = '-', combine_space: bool = True, invalid: str = '?') → Iterator[str][source]

Similar to textwrap.wrap() but Unicode grapheme cluster and line break aware

Note

Newlines in the text are treated as end of paragraph. If your text has paragraphs with newlines in them, then call guess_paragraphs() first.

Parameters:

text – string to process
width – width of yielded lines, if rendered using a monospace font such as to a terminal
tabsize – Tab stop spacing as tabs are expanded
hyphen – Used to show a segment was broken because it was wider than width
combine_space – Leading space on each (indent) is always preserved. Other spaces where multiple occur are combined into one space.
invalid – If invalid codepoints are encountered such as control characters and surrogates then they are replaced with this.

This yields one line of str at a time, which will be exactly width when output to a terminal. It will be right padded with spaces if necessary and not have a trailing newline.

apsw.ext.format_query_table() uses this method to ensure each column is the desired width.

apsw.unicode.codepoint_name(codepoint: int | str) → str | None[source]

Name or None if it doesn’t have one

For example codepoint 65 is named LATIN CAPITAL LETTER A while codepoint U+D1234 is not assigned and would return None.

apsw.unicode.version_added(codepoint: int | str) → str | None[source]: Returns the unicode version the codepoint was added

apsw.unicode.version_dates = {'1.0': (1991, 10, 1), '1.1': (1993, 6, 1), '10.0': (2017, 6, 20), '11.0': (2018, 6, 5), '12.0': (2019, 3, 5), '12.1': (2019, 5, 7), '13.0': (2020, 3, 10), '14.0': (2021, 9, 14), '15.0': (2022, 9, 13), '15.1': (2023, 9, 12), '16.0': (2024, 9, 10), '2.0': (1996, 7, 1), '2.1': (1998, 5, 1), '3.0': (1999, 9, 1), '3.1': (2001, 3, 1), '3.2': (2002, 3, 1), '4.0': (2003, 4, 1), '4.1': (2005, 3, 31), '5.0': (2006, 7, 14), '5.1': (2008, 4, 4), '5.2': (2009, 10, 1), '6.0': (2010, 10, 11), '6.1': (2012, 1, 31), '6.2': (2012, 9, 26), '6.3': (2013, 9, 30), '7.0': (2014, 6, 16), '8.0': (2015, 6, 17), '9.0': (2016, 6, 21)}: Release date (year, month, day) for each unicode version intended for use with version_added()

apsw.unicode.grapheme_next_break(text: str, offset: int = 0) → int[source]

Returns end of Grapheme cluster / User Perceived Character

For example regional indicators are in pairs, and a base codepoint can be combined with zero or more additional codepoints providing diacritics, marks, and variations. Break points are defined in the TR29 spec.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Index of first codepoint not part of the grapheme cluster starting at offset. You should extract text[offset:span]

apsw.unicode.grapheme_next(text: str, offset: int = 0) → tuple[int, int][source]: Returns span of next grapheme cluster

apsw.unicode.grapheme_iter(text: str, offset: int = 0) → Iterator[str][source]: Iterator providing text of each grapheme cluster

apsw.unicode.grapheme_iter_with_offsets(text: str, offset: int = 0) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each grapheme cluster

apsw.unicode.grapheme_iter_with_offsets_filtered(text: str, offset: int = 0, *, categories: Iterable[str], emoji: bool = False, regional_indicator: bool = False) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each grapheme cluster, providing it includes codepoints from categories, emoji, or regional indicator

apsw.unicode.word_next_break(text: str, offset: int = 0) → int[source]

Returns end of next word or non-word

Finds the next break point according to the TR29 spec. Note that the segment returned may be a word, or a non-word (spaces, punctuation etc). Use word_next() to get words.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Next break point

apsw.unicode.word_default_categories = {'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'Nd', 'Nl', 'No'}: Default categories for selecting word segments - letters and numbers

apsw.unicode.word_next(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) → tuple[int, int][source]

Returns span of next word

A segment is considered a word if it contains at least one codepoint corresponding to any of the categories, plus:

emoji (Extended_Pictographic in Unicode specs)
regional indicator - two character sequence for flags like 🇧🇷🇨🇦

apsw.unicode.word_iter(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) → Iterator[str][source]: Iterator providing text of each word

apsw.unicode.word_iter_with_offsets(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) → Iterator[str][source]: Iterator providing start, end, text of each word

apsw.unicode.sentence_next_break(text: str, offset: int = 0) → int[source]

Returns end of sentence location.

Finds the next break point according to the TR29 spec. Note that the segment returned includes leading and trailing white space.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Next break point

apsw.unicode.sentence_next(text: str, offset: int = 0) → tuple[int, int][source]: Returns span of next sentence

apsw.unicode.sentence_iter(text: str, offset: int = 0) → Iterator[str][source]: Iterator providing text of each sentence

apsw.unicode.sentence_iter_with_offsets(text: str, offset: int = 0) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each sentence

apsw.unicode.line_break_next_break(text: str, offset: int = 0) → int[source]

Returns next opportunity to break a line

Finds the next break point according to the TR14 spec.

Parameters:

text – The text to examine
offset – The first codepoint to examine

Returns:

Next break point

apsw.unicode.line_break_next(text: str, offset: int = 0) → tuple[int, int][source]: Returns span of next line

apsw.unicode.line_break_iter(text: str, offset: int = 0) → Iterator[str][source]: Iterator providing text of each line

apsw.unicode.line_break_iter_with_offsets(text: str, offset: int = 0) → Iterator[tuple[int, int, str]][source]: Iterator providing start, end, text of each line