Full text search
APSW provides complete access to SQLite’s full text search extension 5, and extensive related functionality. The tour demonstrates some of what is available. Highlights include:
Access to all the FTS5 C APIs to
register a tokenizer
,retrieve a tokenizer
,call a tokenizer
,register an auxiliary function
, and all of theextension API
. This includes the locale option added in SQLite 3.47.The ftsq shell command to do a FTS5 query
apsw.fts5.Table
for a Pythonic interface to a FTS5 table, including getting thestructure
A
create()
method to create the table that handles the SQL quoting, triggers, and all FTS5 options.upsert()
to insert or update content, anddelete()
to delete, that both understand external content tablesquery_suggest()
that improves a query by correcting spelling, and suggesting more popular search termskey_tokens()
for statistically significant content in a rowmore_like()
to provide statistically similar rowsUnicode Word tokenizer
that better determines word boundaries across all codepoints, punctuation, and conventions.Tokenizers that work with
regular expressions
,HTML
, andJSON
Helpers for writing your own tokenizers including
argument parsing
, andhandling conversion
between the UTF8 offsets used by FTS5 andstr
offsets used in Python.SimplifyTokenizer()
can handle case folding and accent removal on behalf of other tokenizers, using the latest Unicode standard.apsw.fts5
module for working with FTS5apsw.fts5query
module for generating, parsing, and modifying FTS5 queriesapsw.fts5aux
module with auxiliary functions and helpersapsw.unicode
module supporting the latest version of Unicode:Splitting text into user perceived characters (grapheme clusters), words, sentences, and line breaks.
Methods to work on strings in grapheme cluster units rather than Python’s individual codepoints
Case folding
Removing accents, combining marks, and using compatibility codepoints
Codepoint names, regional indicator, extended pictographic
Helpers for outputting text to terminals that understand grapheme cluster boundaries, how wide the text will be, and using line breaking to choose the best location to split lines
Key Concepts
How it works
A series of tokens typically corresponding to words is produced for each text value. For each token, FTS5 indexes:
The rowid the token occurred in
The column the token was in
Which token number it was
A query is turned into tokens, and the FTS5 index consulted to find the rows and columns those exact same query tokens occur in. Phrases can be found by looking for consecutive token numbers.
Ranking
Once matches are found, you want the most relevant ones first. A ranking function is used to assign each match a numerical score typically taking into account how rare the tokens are, and how densely they appear in that row. You can usually weight each column so for example matches in a title column count for more.
You can change the ranking function on a per query basis or via
config_rank()
for all queries.
Tokens
While tokens typically correspond to words, there is no requirement that they do so. Tokens are not shown to the user. Generating appropriate tokens for your text is the key to making searches effective. FTS5 has a tokendata to store extra information with each token. You should consider:
Extracting meaningful tokens first. An example shows extracting product ids and then treating what remains as regular text.
Mapping equivalent text to the same token by using the techniques described below (stemming, synonyms)
Consider alternate approaches. For example the unidecode algorithm turns Unicode text into ascii text that sounds approximately similar
Processing content to normalize it. For example unifying spelling so
colour
andcolor
become the same token. You can use dictionaries to ensure content is consistent.
Stemming
Queries only work with exact matches on the tokens. It is often desirable to make related words produce the same token. Stemming is doing this such as removing singular vs plural so
dog
anddogs
become the same token, and determining the base of a word solikes
,liked
,likely
, andliking
become the same token. Third party libraries provide this for various languages.
Synonyms
Synonyms are words that mean the same thing. FTS5 calls them colocated tokens. In a search you may want
first
to find that as well as1st
, ordog
to also findcanine
,k9
, andpuppy
. While you can provide additional tokens when content is being tokenized for the index, a better place is when a query is being tokenized. TheSynonymTokenizer()
provides an implementation.
Stop words
Search terms are only useful if they narrow down which rows are potential matches. Something occurring in almost every row increases the size of the index, and the ranking function has to be run on more rows for each search. See the example for determining how many rows tokens occur in, and
StopWordsTokenizer()
for removing them.
Locale
SQlite 3.47 added support for locale - an arbitrary string that can be used to mark text. It is typically used to denote a language and region - for example Portuguese in Portugal has some differences than Portuguese in Brazil, and American English has differences from British English. You can use the locale for other purposes - for example if your text includes code then the locale could be used to mark what programming language it is.
Tokenizer order and parameters
The tokenize option
specifies how tokenization happens. The string specifies a list of
items. You can use apsw.fts5.Table.create()
to provide the list
and have all quoting done correctly, and use
apsw.fts5.Table.structure
to see what an existing table
specifies.
Tokenizers often take parameters, which are provided as a separate name and value:
tokenizer_name param1 value1 param2 value2 param3 value3
Some tokenizers work in conjunction with others. For example the
HTMLTokenizer()
passes on the text, excluding HTML
tags, and the StopWordsTokenizer()
removes tokens
coming back from another tokenizer. When they see a parameter name
they do not understand, they treat that as the name of the next
tokenizer, and following items as parameters to that tokenizer.
The overall flow is that the text to be tokenized flows from left to right amongst the named tokenizers. The resulting token stream then flows from right to left.
This means it matters what order the tokenizers are given, and you should ensure the order is what is expected.
Recommendations
Unicode normalization
For backwards compatibility Unicode allows multiple different ways of specifying what will be drawn as the same character. For example Ç can be
One codepoint U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA
Two codepoints U+0043 LATIN CAPITAL LETTER C, and U+0327 COMBINING CEDILLA
There are more complex examples and description at Unicode TR15, which describes the solution of normalization.
If you have text from multiple sources it is possible that it is in
multiple normalization forms. You should use
unicodedata.normalize()
to ensure your text is all in the same
form for indexing, and also ensure query text is in that same form. If
you do not do this, then searches will be confusing and not match when
it visually looks like they should.
Form NFC
is recommended. If you use
SimplifyTokenizer()
with strip
enabled then it
won’t matter as that removes combing marks and uses compatibility
codepoints.
Tokenizer order
For general text, use the following:
simplify casefold true strip true unicodewords
Uses compatibility codepoints
Removes marks and diacritics
Neutralizes case distinctions
Finds words using the Unicode algorithm
Makes emoji be individually searchable
Makes regional indicators be individually searchable
External content tables
Using external content tables is well
handled by apsw.fts5.Table
. The
create()
method has a parameter to generate
triggers that keep the FTS5 table up to date with the content table.
The major advantage of using external content tables is that you can have multiple FTS5 tables sharing the same content table. For example for the same content table you could have FTS5 tables for different purposes:
ngram for doing autocomplete
case folded, accent stripped, stop words, and synonyms for broad searching
full fidelity index preserving case, accents, stop words, and no synonyms for doing exact match searches.
If you do not use an external content table then FTS5 by default makes one of its own. The content is used for auxiliary functions such as highlighting or snippets from matches.
Your external content table can have more columns, triggers, and other SQLite functionality that the FTS5 internal content table does not support. It is also possible to have no content table.
Ranking functions
You can fine tune how matches are scored by applying column weights to
existing ranking functions, or by writing your own ranking functions.
See apsw.fts5aux
for some examples.
Facets
It is often desirable to group results together. For example if searching media then grouping by books, music, and movies. Searching a book could group by chapter. Dated content could be grouped by items from the last month, the last year, and older than that. This is known as faceting.
This is easy to do in SQLite, and an example of when you would use
unindexed columns.
You can use GROUP BY
to group by a facet and LIMIT
to limit
how many results are available in each. In our media example where an
unindexed column named media
containing values like book
,
music
, and movie
exists you could do:
SELECT title, release_date, media AS facet
FROM search(?)
GROUP BY facet
ORDER BY rank
LIMIT 5;
If you were using date facets, then you can write an auxiliary
function that returns the facet (eg 0
for last month, 1
for
last year, and 2
for older than that).
SELECT title, date_facet(search, release_date) AS facet
FROM search(?)
GROUP BY facet
ORDER BY rank
LIMIT 5;
Multiple GROUP BY work, so you could facet by media type and date.
SELECT title, media AS media_facet,
date_facet(search, release_date) AS date_facet
FROM search(?)
GROUP BY media_facet, date_facet
ORDER BY rank
LIMIT 5;
You do not need to store the facet information in the FTS5 table - it can be in an external content or any other table, using JOIN on the rowid of the FTS5 table.
Performance
Search queries are processed in two logical steps:
Find all the rows matching the relevant query tokens
Run the ranking function on each row to sort the best matches first
FTS5 performs very well. If you need to improve performance then closely analyse the find all rows step. The fewer rows query tokens match the fewer ranking function calls happen, and less overall work has to be done.
A typical cause of too many matching rows is having too few different tokens. If tokens are case folded, accent stripped, and stemmed then there may not be that many different tokens.
Initial indexing of your content will take a while as it involves a lot of text processing. Profiling will show bottlenecks.
Outgrowing FTS5
FTS5 depends on exact matches between query tokens and content indexed tokens. This constrains search queries to exact matching after optional steps like stemming and similar processing.
If you have a large amount of text and want to do similarity searching then you will need to use a solution outside of FTS5.
The approach used is to convert words and sentences into a fixed length list of floating point values - a vector. To find matches, the closest vectors have to be found to the query which approximately means comparing the query vector to all of the content vectors finding the smallest overall difference. This is highly parallel with implementations using hardware/GPU functionality.
Producing the vectors requires access to a multi-gigabyte model,
either locally or via a networked service. In general the bigger the
model, the better vectors it can provide. For example a model will
have been trained so that the vectors for runner
and jogger
are close to each other, while orange
is further away.
This is all well outside the scope of SQLite and FTS5.
The process of producing vectors is known as word embedding and sentence embedding. Gensim is a good package to start with, with its tutorial giving a good overview of what you have to do.
Available tokenizers
SQLite includes 4 builtin tokenizers while APSW provides several more.
Name |
Purpose |
---|---|
|
SQLite builtin using Unicode categories to generate tokens |
|
SQLite builtin using ASCII to generate tokens |
|
SQLite builtin wrapper applying the porter stemming algorithm to supplied tokens |
|
SQLite builtin that turns the entire text into trigrams. Note it does not turn tokens into trigrams, but the entire text including all spaces and punctuation. |
Use Unicode algorithm for determining word segments. |
|
Wrapper that transforms the token stream by neutralizing case, and removing diacritics and similar marks |
|
Use |
|
Use |
|
Generates ngrams from the text, where you can specify the sizes and unicode categories. Useful for doing autocomplete as you type, and substring searches. Unlike``trigram`` this works on units of Unicode grapheme clusters not individual codepoints. |
|
Wrapper that converts HTML to plan text for a further tokenizer to generate tokens |
|
Wrapper that converts JSON to plain text for a further tokenizer to generate tokens |
|
Wrapper that provides additional tokens for existing ones such
as |
|
Wrapper that removes tokens from the token stream that occur too
often to be useful, such as |
|
Wrapper to transform tokens, such as when stemming. |
|
Wrapper that recognises |
|
A decorator for your own tokenizers so that they operate on
If you have a string and want to call another tokenizer, use
|
Third party libraries
There are several libraries available on PyPI that can be pip
installed (pip name in parentheses). You can use them with the
tokenizers APSW provides.
NLTK (nltk)
Natural Language Toolkit has several useful methods to help with search. You can use it do stemming in many different languages, and different algorithms:
stemmer = apsw.fts5.TransformTokenizer(
nltk.stem.snowball.EnglishStemmer().stem
)
connection.register_fts5_tokenizer("english_stemmer", english_stemmer)
You can use wordnet to get synonyms:
from nltk.corpus import wordnet
def synonyms(word):
return [syn.name() for syn in wordnet.synsets(word)]
wrapper = apsw.fts5.SynonymTokenizer(synonyms)
connection.register_fts5_tokenizer("english_synonyms", wrapper)
Snowball Stemmer (snowballstemmer)
Snowball is a successor to the Porter stemming algorithm (included in FTS5), and supports many more languages. It is also included as part of nltk:
stemmer = apsw.fts5.TransformTokenizer(
snowballstemmer.stemmer("english").stemWord
)
connection.register_fts5_tokenizer("english_stemmer", english_stemmer)
Unidecode (unidecode)
The algorithm turns Unicode text into ascii text that sounds approximately similar:
transform = apsw.fts5.TransformTokenizer(
unidecode.unidecode
)
connection.register_fts5_tokenizer("unidecode", transform)
Available auxiliary functions
SQLite includes X builtin auxiliary functions, with APSW providing some more.
Name |
Purpose |
---|---|
|
SQLite builtin standard algorithm for ranking matches. It balances how rare the search tokens are with how densely they occur. |
|
SQLite builtin that returns the whole text value with the search terms highlighted |
|
SQLite builtin that returns a small portion of the text containing the highlighted search terms |
A Python implementation of bm25. This is useful as an example of how to write your own ranking function |
|
Uses bm25 as a base, increasing rank the earlier in the content the search terms occur |
|
Uses bm25 as a base, increasing rank when the search phrases
occur in the same order and closer to each other. A regular
bm25 rank for the query |
Command line tools
FTS5 Tokenization viewer
Use python3 -m apsw.fts5 --help
to see detailed help
information. This tool produces a HTML file showing how a tokenizer
performs on text you supply, or builtin test text. This is useful if
you are developing your own tokenizer, or want to work out the best
tokenizer and parameters for you. (Note the tips in the bottom right
of the HTML.)
The builtin test text includes lots of complicated text from across all of Unicode including all forms of spaces, numbers, multiple codepoint sequences, homoglyphs, various popular languages, and hard to tokenize text.
It is useful to compare the default unicode61
tokenizer against
the recommended simplify casefold 1 strip 1 unicodewords
.
Unicode
Use python3 -m apsw.unicode --help
to see detailed help
information. Of interest are the codepoint
subcommand to see
exactly which codepoints make up some text and textwrap
to line
wrap text for terminal width.
FTS5 module
apsw.fts5
Various classes and functions to work with full text search.
This includes Table
for creating and working with FTS5 tables
in a Pythonic way, numerous tokenizers, and
related functionality.
- class apsw.fts5.FTS5TableStructure(name: str, columns: tuple[str], unindexed: set[str], tokenize: tuple[str], prefix: set[int], content: str | None, content_rowid: str | None, contentless_delete: bool | None, contentless_unindexed: bool | None, columnsize: bool, tokendata: bool, locale: bool, detail: Literal['full', 'column', 'none'])[source]
Table structure from SQL declaration available as
Table.structure
See the example
- content: str | None
External content/content less or
None
for regular
- contentless_delete: bool | None
Contentless delete option if contentless table else
None
- contentless_unindexed: bool | None
Contentless unindexed option if contentless table else
None
- apsw.fts5.HTMLTokenizer(con: apsw.Connection, args: list[str]) apsw.Tokenizer [source]
Extracts text from HTML suitable for passing on to other tokenizers
This should be before the actual tokenizer in the tokenizer list. Behind the scenes it extracts text from the HTML, and manages the offset mapping between the HTML and the text passed on to other tokenizers. It also expands entities and charrefs. Content inside SVG tags is ignored.
If the html doesn’t start with optional whitespace then
<
or&
, it is not considered HTML and will be passed on unprocessed. This would typically be the case for queries.html.parser
is used for the HTML processing.See the example.
- apsw.fts5.JSONTokenizer(con: apsw.Connection, args: list[str]) apsw.Tokenizer [source]
Extracts text from JSON suitable for passing to a tokenizer
The following tokenizer arguments are accepted:
- include_keys
0
(default) or1
if keys are extracted in addition to values
If the JSON doesn’t start with optional whitespace then
{
or[
, it is not considered JSON and will be passed on unprocessed. This would typically be the case for queries.See the example.
- class apsw.fts5.MatchInfo(query_info: QueryInfo, rowid: int, column_size: tuple[int], phrase_columns: tuple[tuple[int], ...])[source]
Information about a matched row, returned by
Table.search()
- apsw.fts5.NGramTokenizer(con: apsw.Connection, args: list[str]) apsw.Tokenizer [source]
Generates ngrams from the text
For example if doing 3 (trigram) then
a big dog
would result in'a b', ' bi', 'big', 'ig ', 'g d', ' do`, 'dog'
This is useful for queries where less than an entire word has been provided such as doing completions, substring, or suffix matches. For example a query of
ing
would find all occurrences even at the end of words with ngrams, but not with theUnicodeWordsTokenizer()
which requires the query to provide complete words.This tokenizer works on units of user perceived characters (grapheme clusters) where more than one codepoint can make up one user perceived character.
The following tokenizer arguments are accepted
- ngrams
Numeric ranges to generate. Smaller values allow showing results with less input but a larger index, while larger values will result in quicker searches as the input grows. Default is 3. You can specify
multiple values
.- categories
Which Unicode categories to include, by default all. You could include everything except punctuation and separators with
* !P* !Z*
.- emoji
0
or1
(default) if emoji are included, even if categories would exclude them.- regional_indicator
0
or1
(default) if regional indicators are included, even if categories would exclude them.
See the example.
- class apsw.fts5.QueryInfo(phrases: tuple[tuple[str | None, ...], ...])[source]
Information relevant to the query as a whole, returned by
Table.search()
- apsw.fts5.QueryTokensTokenizer(con: apsw.Connection, args: list[str]) apsw.Tokenizer [source]
Recognises a special tokens marker and returns those tokens for a query. This is useful for making queries directly using tokens, instead of pre-tokenized text.
It must be the first tokenizer in the list. Any text not using the special marker is passed to the following tokenizer.
See
apsw.fts5query.QueryTokens
for more details on the marker format.
- apsw.fts5.RegexPreTokenizer(con: apsw.Connection, args: list[str], *, pattern: str | re.Pattern, flags: int = 0) apsw.Tokenizer [source]
Combines regular expressions and another tokenizer
RegexTokenizer()
only finds tokens matching a regular expression, and ignores all other text. This tokenizer calls another tokenizer to handle the gaps between the patterns it finds. This is useful to extract identifiers and other known patterns, while still doing word search on the rest of the text.- Parameters:
pattern – The regular expression. For example
w+
is all alphanumeric and underscore characters.flags – Regular expression flags. Ignored if pattern is an already compiled pattern
You must specify an additional tokenizer name and arguments.
See the example
- apsw.fts5.RegexTokenizer(con: apsw.Connection, args: list[str], *, pattern: str | re.Pattern, flags: int = 0) apsw.Tokenizer [source]
Finds tokens using a regular expression
- Parameters:
pattern – The regular expression. For example
w+
is all alphanumeric and underscore characters.flags – Regular expression flags. Ignored if pattern is an already compiled pattern
See the example
- apsw.fts5.SimplifyTokenizer(con: apsw.Connection, args: list[str]) apsw.Tokenizer [source]
Tokenizer wrapper that simplifies tokens by neutralizing case, canonicalization, and diacritic/mark removal
Put this before another tokenizer to simplify its output. For example:
simplify casefold true unicodewords
The following tokenizer arguments are accepted, and are applied to each token in this order. If you do not specify an argument then it is off.
- strip
Codepoints become their compatibility representation - for example the Roman numeral Ⅲ becomes III. Diacritics, marks, and similar are removed. See
apsw.unicode.strip()
.- casefold
Neutralizes case distinction. See
apsw.unicode.casefold()
.
See the example.
- apsw.fts5.StopWordsTokenizer(test: Callable[[str], bool] | None = None) apsw.FTS5TokenizerFactory [source]
Removes tokens that are too frequent to be useful
To use you need a callable that takes a str, and returns a boolean. If
True
then the token is ignored.The following tokenizer arguments are accepted, or use as a decorator.
- test
Specify a
test
See the example.
- apsw.fts5.StringTokenizer(func: apsw.FTS5TokenizerFactory) apsw.Tokenizer [source]
Decorator for tokenizers that operate on strings
FTS5 tokenizers operate on UTF8 bytes for the text and offsets. This decorator provides your tokenizer with text and expects text offsets back, performing the conversions back to UTF8 byte offsets.
- apsw.fts5.SynonymTokenizer(get: Callable[[str], None | str | tuple[str]] | None = None) apsw.FTS5TokenizerFactory [source]
Adds colocated tokens such as
1st
forfirst
.To use you need a callable that takes a str, and returns a str, a sequence of str, or None. For example
dict.get()
does that.The following tokenizer arguments are accepted:
- reasons
Which tokenize
tokenize_reasons
you want the lookups to happen in as a space separated list. Default isQUERY
.- get
Specify a
get
, or use as a decorator.
See the example.
- class apsw.fts5.Table(db: Connection, name: str, schema: str = 'main')[source]
A helpful wrapper around a FTS5 table
The table must already exist. You can use the class method
create()
to create a new FTS5 table.- Parameters:
db – Connection to use
name – Table name
schema – Which attached database to use
- property change_cookie: int
An int that changes if the content of the table has changed.
This is useful to validate cached information.
- closest_tokens(token: str, *, n: int = 10, cutoff: float = 0.6, min_docs: int = 1, all_tokens: Iterable[tuple[str, int]] | None = None) list[tuple[float, str]] [source]
Returns closest known tokens to
token
with score for eachThis uses
difflib.get_close_matches()
algorithm to find close matches. Note that it is a statistical operation, and has no understanding of the tokens and their meaning.- Parameters:
token – Token to use
n – Maximum number of tokens to return
cutoff – Passed to
difflib.get_close_matches()
. Larger values require closer matches and decrease computation time.min_docs – Only test against other tokens that appear in at least this many rows. Experience is that about a third of tokens appear only in one row. Larger values significantly decrease computation time, but reduce the candidates.
all_tokens – A sequence of tuples of candidate token and number of rows it occurs in. If not provided then
tokens
is used.
- column_named(name: str) str | None [source]
Returns the column matching name or None if it doesn’t exist
SQLite is ascii case-insensitive, so this tells you the declared name, or None if it doesn’t exist.
- property columns: tuple[str, ...]
All columns of this table, including unindexed ones. Unindexed columns are ignored in queries.
- command_delete(rowid: int, *column_values: str)[source]
Does delete
See
delete()
for regular row deletion.If you are using an external content table, it is better to use triggers on that table.
- command_delete_all() None [source]
Does delete all
If you are using an external content table, it is better to use triggers on that table.
- command_integrity_check(external_content: bool = True) None [source]
Does integrity check
If external_content is True, then the FTS index is compared to the external content.
- command_merge(n: int) int [source]
Does merge
See the documentation for what positive and negative values of n mean.
- Returns:
The difference between sqlite3_total_changes() before and after running the command.
- config(name: str, value: apsw.SQLiteValue = None, *, prefix: str = 'x-apsw-') apsw.SQLiteValue [source]
Optionally sets, and gets a config value
If the value is not None, then it is changed. It is not recommended to change SQLite’s own values.
The prefix is to ensure your own config names don’t clash with those used by SQLite. For example you could remember the Unicode version used by your tokenizer, and rebuild if the version is updated.
The advantage of using this is that the names/values will survive the FTS5 table being renamed, backed up, restored etc.
- config_rank(val: str | None = None) str [source]
Optionally sets, and returns rank
When setting rank it must consist of a function name, open parentheses, zero or more SQLite value literals that will be arguments to the function, and a close parenthesis, For example
my_func(3, x'aabb', 'hello')
- config_secure_delete(val: bool | None = None) bool [source]
Optionally sets, and returns secure-delete
- classmethod create(db: Connection, name: str, columns: Iterable[str] | None, *, schema: str = 'main', unindexed: Iterable[str] | None = None, tokenize: Iterable[str] | None = None, support_query_tokens: bool = False, rank: str | None = None, prefix: Iterable[int] | int | None = None, content: str | None = None, content_rowid: str | None = None, contentless_delete: bool = False, contentless_unindexed: bool = False, columnsize: bool = True, detail: Literal['full', 'column', 'none'] = 'full', tokendata: bool = False, locale: bool = False, generate_triggers: bool = False, drop_if_exists: bool = False) Self [source]
Creates the table, returning a
Table
on successYou can use
apsw.Connection.table_exists()
to check if a table already exists.- Parameters:
db – connection to create the table on
name – name of table
columns – A sequence of column names. If you are using an external content table (recommended) you can supply
None
and the column names will be from the table named by thecontent
parameterschema – Which attached database the table is being created in
unindexed – Columns that will be unindexed
tokenize – The tokenize option. Supply as a sequence of strings which will be correctly quoted together.
support_query_tokens – Configure the tokenize option to allow
queries using tokens
.rank – The rank option if not using the default. See
config_rank()
for required syntax.prefix – The prefix option. Supply an int, or a sequence of int.
content – Name of the external content table. The external content table must be in the same database as the FTS5 table.
content_rowid – Name of the content rowid column if not using the default when using an external content table
contentless_delete – Set the contentless delete option for contentless tables.
contentless_unindexed – Set the contentless unindexed option for contentless tables
columnsize – Indicate if the column size tracking should be disabled to save space
detail – Indicate if detail should be reduced to save space
tokendata – Indicate if tokens have separate data after a null char
locale – Indicate if a locale is available to tokenizers and stored in the table
generate_triggers – If using an external content table and this is
True
, then triggers are created to keep this table updated with changes to the external content table. These require a table not a view.drop_if_exists – The FTS5 table will be dropped if it already exists, and then created.
If you create with an external content table, then
command_rebuild()
andcommand_optimize()
will be run to populate the contents.
- delete(rowid: int) bool [source]
Deletes the identified row
If you are using an external content table then the delete is directed to that table.
- Returns:
True if a row was deleted
- fts5vocab_name(type: Literal['row', 'col', 'instance']) str [source]
Creates a fts5vocab table in temp and returns fully quoted name
- key_tokens(rowid: int, *, limit: int = 10, columns: str | Sequence[str] | None = None) Sequence[tuple[float, str]] [source]
Finds tokens that are dense in this row, but rare in other rows
This is purely statistical and has no understanding of the tokens. Tokens that occur only in this row are ignored.
- Parameters:
rowid – Which row to examine
limit – Maximum number to return
columns – If provided then only look at specified column(s), else all indexed columns.
- Returns:
A sequence of tuples where each is a tuple of token and float score with bigger meaning more unique, sorted highest score first.
See the example.
See also
text_for_token()
to get original document text corresponding to a token
- more_like(ids: Sequence[int], *, columns: str | Sequence[str] | None = None, token_limit: int = 3) Iterator[MatchInfo] [source]
Like
search()
providing results similar to the provided ids.This is useful for providing infinite scrolling. Do a search remembering the rowids. When you get to the end, call this method with those rowids.
key_tokens()
is used to get key tokens from rows which is purely statistical and has no understanding of the text.- Parameters:
ids – rowids to consider
columns – If provided then only look at specified column(s), else all indexed columns.
token_limit – How many tokens are extracted from each row. Bigger values result in a broader search, while smaller values narrow it.
See the example.
- query_suggest(query: str, threshold: float = 0.01, *, tft_docs: int = 2, locale: str | None = None) str | None [source]
Suggests alternate query
This is useful if a query returns no or few matches. It is purely a statistical operation based on the tokens in the query and index. There is no guarantee that there will be more (or any) matches. The query structure (AND, OR, column filters etc) is maintained.
Transformations include:
Ensuring column names in column filters are of the closest indexed column name
Combining such as
some thing
tosomething
Splitting such as
noone
tono one
Replacing unknown/rare words with more popular ones
The query is parsed, tokenized, replacement tokens established, and original text via
text_for_token()
used to reconstitute the query.- Parameters:
query – A valid query string
threshold – Fraction of rows between
0.0
and1.00
to be rare - eg0.01
means a token occurring in less than 1% of rows is considered for replacement. Larger fractions increase the likelihood of replacements, while smaller reduces it. A value of0
will only replace tokens that are not in the index at all - essentially spelling correction onlytft_docs – Passed to
text_for_token()
as thedoc_limit
parameter. Larger values produce more representative text, but also increase processing time.locale – Locale used to tokenize the query.
- Returns:
None
if no suitable changes were found, or a replacement query string.
- property quoted_table_name: str
Provides the full table name for composing your own queries
It includes the attached database name and quotes special characters like spaces.
You can’t use bindings for table names in queries, so use this when constructing a query string:
my_table = apsw.fts5.Table(con, 'my_table') sql = f"""SELECT ... FROM { my_table.quoted_table_name } WHERE ...."""
- row_by_id(id: int, column: str | Sequence[str]) apsw.SQLiteValue | tuple[apsw.SQLiteValue] [source]
Returns the contents of the row id
You can request one column,, or several columns. If one column is requested then just that value is returned, and a tuple of values for more than column.
KeyError
is raised if the row does not exist.See the example.
- search(query: str, locale: str | None = None) Iterator[MatchInfo] [source]
Iterates query matches, best matches first
This avoids the need to write SQL. See the example.
- property structure: FTS5TableStructure
Structure of the table from the declared SQL
- property supports_query_tokens: bool
True if you can use
apsw.fts5query.QueryTokens
with this table
- text_for_token(token: str, doc_limit: int) str [source]
Provides the original text used to produce
token
Different text produces the same token because case can be ignored, accents and punctuation removed, synonyms and other processing.
This method finds the text that produced a token, by re-tokenizing the documents containing the token. Highest rowids are examined first so this biases towards the newest content.
- Parameters:
token – The token to find
doc_limit – Maximum number of documents to examine. The higher the limit the longer it takes, but the more representative the text is.
- Returns:
The most popular text used to produce the token in the examined documents
See the example.
- token_doc_frequency(count: int = 10) list[tuple[str, int]] [source]
Most frequent occurring tokens, useful for building a stop words list
This counts the total number of documents containing the token, so appearing 1,000 times in 1 document counts as 1, while once each in 1,000 documents counts as 1,000.
See also
- token_frequency(count: int = 10) list[tuple[str, int]] [source]
Most frequent tokens, useful for building a stop words list
This counts the total occurrences of the token, so appearing 1,000 times in 1 document counts the same as once each in 1,000 documents.
See also
- tokenize(utf8: bytes, reason: int = apsw.FTS5_TOKENIZE_DOCUMENT, locale: str | None = None, include_offsets=True, include_colocated=True)[source]
Tokenize the supplied utf8
- property tokenizer: FTS5Tokenizer
Tokenizer instance as used by this table
- property tokens: dict[str, int]
All the tokens as a dict with token as key, and the value being how many rows they are in
This can take some time on a large corpus - eg 2 seconds on a gigabyte dataset with half a million documents and 650,000 tokens. It is cached until the next content change.
- property tokens_per_column: list[int]
Count of tokens in each column, across all rows. Unindexed columns have a value of zero
- upsert(*args: apsw.SQLiteValue, **kwargs: apsw.SQLiteValue) int [source]
Insert or update with columns by positional and keyword arguments
You can mix and match positional and keyword arguments:
table.upsert("hello") table.upsert("hello", header="world") table.upsert(header="world")
If you specify a
rowid
keyword argument that is used as the rowid for the insert. If the corresponding row already exists then the row is modified with the provided values. rowids are always integers.The rowid of the inserted/modified row is returned.
If you are using an [external content](https://www.sqlite.org/fts5.html#external_content_tables) table:
The insert will be directed to the external content table
rowid
will map to thecontent_rowid
option if usedThe column names and positions of the FTS5 table, not the external content table is used
The FTS5 table is not updated - you should use triggers on the external content table to do that. See the
generate_triggers
option oncreate()
.
See the example
- class apsw.fts5.TokenizerArgument(default: Any = None, choices: Sequence[Any] | None = None, convertor: Callable[[str], Any] | None = None, convert_default: bool = False)[source]
Used as spec values to
parse_tokenizer_args()
- example
- apsw.fts5.TransformTokenizer(transform: Callable[[str], str | Sequence[str]] | None = None) apsw.FTS5TokenizerFactory [source]
Transforms tokens to a different token, such as stemming
To use you need a callable that takes a str, and returns a list of str, or just a str to use as replacements. You can return an empty list to remove the token.
The following tokenizer arguments are accepted.
- transform
Specify a
transform
, or use as a decorator
See the example.
- apsw.fts5.UnicodeWordsTokenizer(con: apsw.Connection, args: list[str]) apsw.Tokenizer [source]
Uses Unicode segmentation to extract words
The following tokenizer parameters are accepted. A segment is considered a word if a codepoint matching any of the categories, emoji, or regional indicator is present.
- categories
Default
L* N*
to include letters, and numbers. You should considerPd
for punctuation dash if you want words separated with dashes to be considered one word.Sm
for maths symbols andSc
for currency symbols may also be relevant,- emoji
0
or1
(default) if emoji are included. They will be a word by themselves.- regional_indicator
0
or1
(default) if regional indicators like 🇬🇧 🇵🇭 are included. They will be a word by themselves.
This does a lot better than the unicode61 tokenizer builtin to FTS5. It understands user perceived characters (made of many codepoints), and punctuation within words (eg
don't
is considered two wordsdon
andt
by unicode61), as well as how various languages work.For languages where there is no spacing or similar between words, only a dictionary can determine actual word boundaries. Examples include Japanese, Chinese, and Khmer. In this case the algorithm returns the user perceived characters individually making it similar to
NGramTokenizer()
which will provide good search experience at the cost of a slightly larger index.Use the
SimplifyTokenizer()
to make case insensitive, remove diacritics, combining marks, and use compatibility code points.See the example
- apsw.fts5.convert_boolean(value: str) bool [source]
Converts to boolean
Accepts
0
,1
,false
, andtrue
- apsw.fts5.convert_number_ranges(numbers: str) set[int] [source]
Converts comma separated number ranges
Takes input like
2,3-5,17
and converts to{2, 3, 4, 5, 17}
- apsw.fts5.convert_string_to_python(expr: str) Any [source]
Converts a string to a Python object
This is useful to process command line arguments and arguments to tokenizers. It automatically imports the necessary modules.
Warning
The string is ultimately
evaluated
allowing arbitrary code execution and side effects.Some examples of what is accepted are:
3 + 4
apsw.fts5.RegexTokenizer
snowballstemmer.stemmer(“english”).stemWord
nltk.stem.snowball.EnglishStemmer().stem
shutil.rmtree(“a/directory/location”) COULD DELETE ALL FILES
- apsw.fts5.convert_tokenize_reason(value: str) set[int] [source]
Converts a space separated list of
tokenize_reasons
into a set of corresponding valuesUse with
parse_tokenizer_args()
- apsw.fts5.convert_unicode_categories(patterns: str) set[str] [source]
Returns Unicode categories matching space separated values
fnmatch.fnmatchcase()
is used to check matches. An example pattern isL* Pc
would return{'Pc', 'Lm', 'Lo', 'Lu', 'Lt', 'Ll'}
You can also put ! in front to exclude categories, so
* !*m
would be all categories except those ending inm
.
- apsw.fts5.map_functions = {'position_rank': 'apsw.fts5aux.position_rank', 'subsequence': 'apsw.fts5aux.subsequence'}
APSW provided auxiliary functions for use with
register_functions()
- apsw.fts5.map_tokenizers = {'html': <function HTMLTokenizer>, 'json': <function JSONTokenizer>, 'ngram': <function NGramTokenizer>, 'querytokens': <function QueryTokensTokenizer>, 'simplify': <function SimplifyTokenizer>, 'unicodewords': <function UnicodeWordsTokenizer>}
APSW provided tokenizers for use with
register_tokenizers()
- apsw.fts5.parse_tokenizer_args(spec: dict[str, TokenizerArgument | Any], con: Connection, args: list[str]) dict[str, Any] [source]
Parses the arguments to a tokenizer based on spec returning corresponding values
- Parameters:
spec – A dictionary where the key is a string, and the value is either the corresponding default, or
TokenizerArgument
.con – Used to lookup other tokenizers
args – A list of strings as received by
apsw.FTS5TokenizerFactory
For example to parse
["arg1", "3", "big", "ship", "unicode61", "yes", "two"]
# spec on input { # Converts to integer "arg1": TokenizerArgument(convertor=int, default=7), # Limit allowed values "big": TokenizerArgument(choices=("ship", "plane")), # Accepts any string, with a default "small": "hello", # gathers up remaining arguments, if you intend # to process the results of another tokenizer "+": None } # options on output { "arg1": 3, "big": "ship", "small": "hello", "+": db.Tokenizer("unicode61", ["yes", "two"]) } # Using "+" in your ``tokenize`` functions def tokenize(utf8, flags, locale): tok = options["+"] for start, end, *tokens in tok(utf8, flags, locale): # do something yield start, end, *tokens
See also
Some useful convertors
See the example.
- apsw.fts5.register_functions(db: Connection, map: dict[str, str | Callable])[source]
Registers auxiliary functions named in map with the connection, if not already registered
The map contains the function name, and either the callable or a string which will be automatically
imported
.See
map_functions
- apsw.fts5.register_tokenizers(db: Connection, map: dict[str, str | Callable])[source]
Registers tokenizers named in map with the connection, if not already registered
The map contains the tokenizer name, and either the callable or a string which will be automatically
imported
.See
map_tokenizers
- apsw.fts5.string_tokenize(tokenizer: FTS5Tokenizer, text: str, flags: int, locale: str | None)[source]
Tokenizer caller to get string offsets back
Calls the tokenizer doing the conversion of text to UTF8, and converting the received UTF8 offsets back to text offsets.
- apsw.fts5.tokenize_reasons: dict[str, int] = {'AUX': 8, 'DOCUMENT': 4, 'QUERY': 1, 'QUERY_PREFIX': 3}
Mapping between friendly strings and constants for xTokenize flags
- apsw.fts5.tokenizer_test_strings(filename: str | Path | None = None) tuple[tuple[bytes, str], ...] [source]
Provides utf-8 bytes sequences for interesting test strings
- Parameters:
filename – File to load. If None then the builtin one is used
- Returns:
A tuple where each item is a tuple of utf8 bytes and comment str
The test file should be UTF-8 encoded text.
If it starts with a # then it is considered to be multiple text sections where a # line contains a description of the section. Any lines beginning ## are ignored.
- apsw.fts5.unicode_categories = {'Cc': 'Other control', 'Cf': 'Other format', 'Cn': 'Other not assigned', 'Co': 'Other private use', 'Cs': 'Other surrogate', 'Ll': 'Letter Lowercase', 'Lm': 'Letter modifier', 'Lo': 'Letter other', 'Lt': 'Letter titlecase', 'Lu': 'Letter Uppercase', 'Mc': 'Mark spacing combining', 'Me': 'Mark enclosing', 'Mn': 'Mark nonspacing', 'Nd': 'Number decimal digit', 'Nl': 'Number letter', 'No': 'Number other', 'Pc': 'Punctuation connector', 'Pd': 'Punctuation dash', 'Pe': 'Punctuation close', 'Pf': 'Punctuation final quote', 'Pi': 'Punctuation initial quote', 'Po': 'Punctuation other', 'Ps': 'Punctuation open', 'Sc': 'Symbol currency', 'Sk': 'Symbol modifier', 'Sm': 'Symbol math', 'So': 'Symbol other', 'Zl': 'Separator line', 'Zp': 'Separator paragraph', 'Zs': 'Separator space'}
Unicode categories and descriptions for reference
FTS5 Query module
apsw.fts5query
Create, parse, and modify queries
There are 3 representations of a query available:
query string
This the string syntax accepted by FTS5 where you represent AND, OR, NEAR, column filtering etc inline in the string. An example is:
love AND (title:^"big world" NOT summary:"sunset cruise")
parsed
This is a hierarchical representation using
dataclasses
with all fields present. Represented asQUERY
, it usesPHRASE
,NEAR
,COLUMNFILTER
,AND
,NOT
, andOR
. The string example above is:AND(queries=[PHRASE(phrase='love', initial=False, prefix=False, plus=None), NOT(match=COLUMNFILTER(columns=['title'], filter='include', query=PHRASE(phrase='big world', initial=True, prefix=False, plus=None)), no_match=COLUMNFILTER(columns=['summary'], filter='include', query=PHRASE(phrase='sunset cruise', initial=False, prefix=False, plus=None)))])
dict
This is a hierarchical representation using Python
dictionaries
which is easy for logging, storing as JSON, and manipulating. Fields containing default values are omitted. When provided to methods in this module, you do not need to provide intermediate PHRASE - just Python lists and strings directly. This is the easiest form to programmatically compose and modify queries in. The string example above is:{'@': 'AND', 'queries': [{'@': 'PHRASE', 'phrase': 'love'}, {'@': 'NOT', 'match': {'@': 'COLUMNFILTER', 'columns': ['title'], 'filter': 'include', 'query': {'@': 'PHRASE', 'initial': True, 'phrase': 'big world'}}, 'no_match': {'@': 'COLUMNFILTER', 'columns': ['summary'], 'filter': 'include', 'query': {'@': 'PHRASE', 'phrase': 'sunset cruise'}}}]}
See the example.
From type |
To type |
Conversion method |
---|---|---|
query string |
parsed |
|
parsed |
dict |
|
dict |
parsed |
|
parsed |
query string |
Other helpful functionality includes:
quote()
to appropriately double quote stringsextract_with_column_filters()
to get aQUERY
for a node within an existingQUERY
but applying the intermediate column filters.applicable_columns()
to work out which columns apply to part of aQUERY
walk()
to traverse a parsed query
- class apsw.fts5query.AND(queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE])[source]
All queries must match
- class apsw.fts5query.COLUMNFILTER(columns: Sequence[str], filter: Literal['include', 'exclude'], query: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE)[source]
Limit query to certain columns
This always reduces the columns that phrase matching will be done against.
- class apsw.fts5query.NOT(match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, no_match: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE)[source]
match must match, but no_match must not
- class apsw.fts5query.OR(queries: Sequence[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE])[source]
Any query must match
- class apsw.fts5query.PHRASE(phrase: str | QueryTokens, initial: bool = False, prefix: bool = False, plus: PHRASE | None = None)[source]
One phrase
- phrase: str | QueryTokens
Text of the phrase. If + was used (eg one+two) then it will be a list of phrases
- exception apsw.fts5query.ParseError(query: str, message: str, position: int)[source]
This exception is raised when an error parsing a query string is encountered
A simple printer:
print(exc.query) print(" " * exc.position + "^", exc.message)
- apsw.fts5query.QUERY
Type representing all query types.
- apsw.fts5query.QUERY_TOKENS_MARKER = '$!Tokens~'
Special marker at the start of a string to recognise it as a list of tokens for
QueryTokens
- class apsw.fts5query.QueryTokens(tokens: list[str | Sequence[str]])[source]
FTS5 query strings are passed to tokenizers which extract tokens, such as by splitting on whitespace, lower casing text, and removing characters like accents.
If you want to query tokens directly then use this class with the
tokens
member, using it wherePHRASE.phrase
goes and useto_query_string()
to compose your query.Your FTS5 table must use the
apsw.fts5.QueryTokensTokenizer
as the first tokenizer in the list. If the reason for tokenizing includes FTS5_TOKENIZE_QUERY and the text to be tokenized starts with the special marker, then the tokens are returned.apsw.fts5.Table.supports_query_tokens
will tell you if query tokens are handled correctly.apsw.fts5.Table.create()
parametersupport_query_tokens
will ensure thetokenize
table option is correctly set, You can get the tokens fromapsw.fts5.Table.tokens
.You can construct QueryTokens like this:
# One token QueryTokens(["hello"]) # Token sequence QueryTokens(["hello". "world", "today"]) # Colocated tokens use a nested list QueryTokens(["hello", ["first", "1st"]])
To use in a query:
{"@": "NOT", "match": QueryTokens(["hello", "world"]), "no_match": QueryTokens([["first", "1st"]])}
That would be equivalent to a query of
"Hello World" NOT "First"
if tokens were lower cased, and a tokenizer added a colocated1st
on seeingfirst
.- classmethod decode(data: str | bytes) QueryTokens | None [source]
If the marker is present then returns the corresponding
QueryTokens
, otherwise None.
- apsw.fts5query.applicable_columns(node: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, columns: Sequence[str]) set[str] [source]
Return which columns apply to
node
You can use
apsw.fts5.Table.columns_indexed()
to get the column list for a table. The column names are matched using SQLite semantics (ASCII case insensitive).If a query column is not in the provided columns, then
KeyError
is raised.
- apsw.fts5query.extract_with_column_filters(node: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE [source]
Return a new QUERY for a query rooted at start with child node, with intermediate
COLUMNFILTER
in between applied.This is useful if you want to execute a node from a top level query ensuring the column filters apply.
- apsw.fts5query.from_dict(d: dict[str, Any] | Sequence[str] | str | QueryTokens) COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE [source]
Turns dict back into a
QUERY
You can take shortcuts putting str or
QueryTokens
in places where PHRASE is expected. For example this is accepted:{ "@": "AND, "queries": ["hello", "world"] }
- apsw.fts5query.parse_query_string(query: str) COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE [source]
Returns the corresponding
QUERY
for the query string
- apsw.fts5query.quote(text: str | QueryTokens) str [source]
Quotes text if necessary to keep it as one unit using FTS5 quoting rules
Some examples:
text
return
hello
hello
one two
"one two"
(empty string)
""
one"two
"one""two"
- apsw.fts5query.to_dict(q: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) dict[str, Any] [source]
Converts structure to a dict
This is useful for pretty printing, logging, saving as JSON, modifying etc.
The dict has a key
@
with value corresponding to the dataclass (egNEAR
,PHRASE
,AND
) and the same field names as the corresponding dataclasses. Only fields with non-default values are emitted.
- apsw.fts5query.to_query_string(q: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) str [source]
Returns the corresponding query in text format
- apsw.fts5query.walk(start: COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE) Iterator[tuple[tuple[COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE, ...], COLUMNFILTER | NEAR | AND | OR | NOT | PHRASE]] [source]
Yields the parents and each node for a query recursively
The query tree is traversed top down. Use it like this:
for parents, node in walk(query): # parents will be a tuple of parent nodes # node will be current node if isinstance(node, PHRASE): print(node.phrase)
FTS5 Auxiliary functions module
apsw.fts5aux
Implementation of FTS5 auxiliary functions in Python.
Auxiliary functions are used for ranking results, and for processing search results.
- apsw.fts5aux.bm25(api: apsw.FTS5ExtensionApi, *args: apsw.SQLiteValue) float [source]
Perform the BM25 calculation for a matching row
It accepts weights for each column (default 1) which means how much a hit in that column counts for.
The builtin function is described here. This is a translation of the SQLite C version into Python for illustrative purposes.
- apsw.fts5aux.inverse_document_frequency(api: FTS5ExtensionApi) list[float] [source]
Measures how rare each search phrase is in the content
This helper method is intended for use in your own ranking functions. The result is the idf for each phrase in the query.
A phrase occurring in almost every row will have a value close to zero, while less frequent phrases have increasingly large positive numbers.
The values will always be at least 0.000001 so you don’t have to worry about negative numbers or division by zero, even for phrases that are not found.
- apsw.fts5aux.position_rank(api: apsw.FTS5ExtensionApi, *args: apsw.SQLiteValue)[source]
Ranking function boosting the earlier in a column phrases are located
bm25()
doesn’t take into where phrases occur. It makes no difference if a phrase occurs at the beginning, middle, or end. This boost takes into account how early the phrase match is, suitable for content with more significant text towards the beginning.If the query has phrases and operators (AND, OR, NOT) then those operators are not visible to this function, and only the location of each phrase is taken into consideration. See
apsw.fts5.QueryInfo.phrases
.It accepts parameters giving the weights for each column (default 1).
- apsw.fts5aux.subsequence(api: apsw.FTS5ExtensionApi, *args: apsw.SQLiteValue)[source]
Ranking function boosting rows where tokens are in order
bm25()
doesn’t take into account ordering. Phrase matches like"big truck"
must occur exactly together in that order. Matches forbig truck
scores the same providing both words exist anywhere. This function boosts matches where the order does match sobig red truck
gets a boost whiletruck, big
does not for the same query.If the query has phrases and operators (AND, OR, NOT) then those operators are not visible to this function, and it looks for ordering of each phrase. For example
big OR truck NOT red
will result in this function boostingbig ... truck ... red
in that order. Seeapsw.fts5.QueryInfo.phrases
.It accepts parameters giving the weights for each column (default 1).
FTS5Tokenizer class
- class apsw.FTS5Tokenizer
Wraps a registered tokenizer. Returned by
Connection.fts5_tokenizer()
.
- FTS5Tokenizer.__call__(utf8: bytes, flags: int, locale: str | None, *, include_offsets: bool = True, include_colocated: bool = True) list[tuple[int, int, *tuple[str, ...]]]
Does a tokenization, returning a list of the results. If you have no interest in token offsets or colocated tokens then they can be omitted from the results.
- Parameters:
utf8 – Input bytes
reason –
Reason
flaginclude_offsets – Returned list includes offsets into utf8 for each token
include_colocated – Returned list can include colocated tokens
Example outputs
Tokenizing
b"first place"
where1st
has been provided as a colocated token forfirst
.(Default) include_offsets True, include_colocated True
[ (0, 5, "first", "1st"), (6, 11, "place"), ]
include_offsets False, include_colocated True
[ ("first", "1st"), ("place", ), ]
include_offsets True, include_colocated False
[ (0, 5, "first"), (6, 11, "place"), ]
include_offsets False, include_colocated False
[ "first", "place", ]
- FTS5Tokenizer.connection: Connection
The
Connection
this tokenizer is registered with.
FTS5ExtensionApi class
- class apsw.FTS5ExtensionApi
Auxiliary functions run in
the context of a FTS5 search, and can be used for ranking,
highlighting, and similar operations. Auxiliary functions are
registered via Connection.register_fts5_function()
. This wraps
the auxiliary functions API
passed as the first parameter to auxiliary functions.
See the example.
- FTS5ExtensionApi.aux_data: Any
You can store an object as auxiliary data which is available across matching rows. It starts out as
None
.An example use is to do up front calculations once, rather than on every matched row, such as
fts5aux.inverse_document_frequency()
.
- FTS5ExtensionApi.column_count: int
Returns the number of columns in the table
- FTS5ExtensionApi.column_locale(column: int) str | None
Retrieves the locale for a column on this row.
- FTS5ExtensionApi.column_size(col: int = -1) int
Returns the total number of tokens in the current row for a specific column, or if
col
is negative then for all columns.
- FTS5ExtensionApi.column_text(col: int) bytes
Returns the utf8 bytes for the column of the current row.
- FTS5ExtensionApi.column_total_size(col: int = -1) int
Returns the total number of tokens in the table for a specific column, or if
col
is negative then for all columns.
- FTS5ExtensionApi.inst_count: int
Returns the number of hits in the current row
- FTS5ExtensionApi.inst_tokens(inst: int) tuple[str, ...] | None
Access tokens of hit inst in current row. None is returned if the call is not supported.
- FTS5ExtensionApi.phrase_column_offsets(phrase: int, column: int) list[int]
Returns token offsets the phrase number occurs in in the specified column.
- FTS5ExtensionApi.phrase_count: int
Returns the number of phrases in the query
- FTS5ExtensionApi.phrase_locations(phrase: int) list[list[int]]
Returns which columns and token offsets the phrase number occurs in.
The returned list is the same length as the number of columns. Each member is a list of token offsets in that column, and will be empty if the phrase is not in that column.
- FTS5ExtensionApi.phrases: tuple[tuple[str | None, ...], ...]
A tuple where each member is a phrase from the query. Each phrase is a tuple of str (or None when not available) per token of the phrase.
This combines the results of xPhraseCount, xPhraseSize and xQueryToken
- FTS5ExtensionApi.query_phrase(phrase: int, callback: FTS5QueryPhrase, closure: Any) None
Searches the table for the numbered query. The callback takes two parameters - a different
apsw.FTS5ExtensionApi
and closure.An example usage for this method is to see how often the phrases occur in the table. Setup a tracking counter here, and then in the callback you can update it on each visited row. This is shown in the example.
- FTS5ExtensionApi.row_count: int
Returns the number of rows in the table
- FTS5ExtensionApi.rowid: int
Rowid of the current row
- FTS5ExtensionApi.tokenize(utf8: bytes, locale: str | None, *, include_offsets: bool = True, include_colocated: bool = True) list
Tokenizes the utf8. FTS5 sets the reason to
FTS5_TOKENIZE_AUX
. Seeapsw.FTS5Tokenizer.__call__()
for details.
Unicode Text Handling
apsw.unicode
- Up to date Unicode aware methods and lookups
This module helps with Full text search and general Unicode, addressing the following:
The standard library
unicodedata
has limited information available (eg no information about emoji), and is only updated to new Unicode versions on a new Python version.Multiple consecutive codepoints can combine into a single user perceived character (grapheme cluster), such as combining accents, vowels and marks in some writing systems, variant selectors, joiners and linkers, etc. That means you can’t use indexes into
str
safely without potentially breaking them.The standard library provides no help in splitting text into grapheme clusters, words, and sentences, or into breaking text into multiple lines.
Text processing is performance sensitive - FTS5 easily handles hundreds of megabytes to gigabytes of text, and so should this module. It also affects the latency of each query as that is tokenized, and results can highlight words and sentences.
This module is independent of the main apsw module, and loading it does not load any database functionality. The majority of the functionality is implemented in C for size and performance reasons.
See unicode_version
for the implemented version.
Grapheme cluster, word, and sentence splitting
Unicode Technical Report #29 rules for finding grapheme clusters, words, and sentences are implemented. Tr29 specifies break points which can be found via
grapheme_next_break()
,word_next_break()
, andsentence_next_break()
.Building on those are iterators providing optional offsets and the text. This is used for tokenization (getting character and word boundaries correct), and for result highlighting (showing words/sentences before and after match).
Line break splitting
Unicode Technical Report #14 rules for finding where text cam be broken and resumed on the next line. Tr14 specifies break points which can be found via
line_break_next_break()
.Building on those are iterators providing optional offsets and the text. This is used for
text_wrap()
.
Unicode lookups
Category information
category()
Is an emoji or similar
is_extended_pictographic()
Flag characters
is_regional_indicator()
Codepoint names
codepoint_name()
Case folding, accent removal
casefold()
is used to do case insensitive comparisons.
strip()
is used to remove accents, marks, punctuation, joiners etc
Helpers
These are aware of grapheme cluster boundaries which Python’s builtin string operations are not. The text width functions take into account how wide the text is when displayed on most terminals.
grapheme_length()
to get the number of grapheme clusters in a string
grapheme_substr()
to get substrings using grapheme cluster indexing
grapheme_find()
to find a substring
split_lines()
to split text into lines using all the Unicode hard line break codepoints
text_width()
to count how wide the text is
expand_tabs()
to expand tabs using text width
text_width_substr()
to extract substrings based on text width
text_wrap()
to wrap paragraphs using Unicode words, line breaking, and text width
guess_paragraphs()
to establish paragraph boundaries for text that has line breaks in paragraphs like many plain text and similar markup formats.
Size
Performance
There some pure Python alternatives, with less functionality. They take 5 to 15 times more CPU time to process the same text. Use
python3 -m apsw.unicode benchmark --help
.
- apsw.unicode.unicode_version = '16.0'
The Unicode version that the rules and data tables implement
- apsw.unicode.category(codepoint: int | str) str [source]
Returns the general category - eg
Lu
for Letter UppercaseSee
apsw.fts5.unicode_categories
for descriptions mapping
- apsw.unicode.is_extended_pictographic(text: str) bool [source]
Returns True if any of the text has the extended pictographic property (Emoji and similar)
- apsw.unicode.is_regional_indicator(text: str) bool [source]
Returns True if any of the text is one of the 26 regional indicators used in pairs to represent country flags
- apsw.unicode.casefold(text: str) str [source]
Returns the text for equality comparison without case distinction
Case folding maps text to a canonical form where case differences are removed allowing case insensitive comparison. Unlike upper, lower, and title case, the result is not intended to be displayed to people.
- apsw.unicode.strip(text: str) str [source]
Returns the text for less exact comparison with accents, punctuation, marks etc removed
It will strip diacritics leaving the underlying characters so
áççéñțś
becomesaccents
, punctuation soe.g.
becomeseg
anddon't
becomesdont
, marks soदेवनागरी
becomesदवनगर
, as well as all spacing, formatting, variation selectors and similar codepoints.Codepoints are also converted to their compatibility representation. For example the single codepoint Roman numeral
Ⅲ
becomesIII
(three separate regular upper case I), and🄷🄴🄻🄻🄾
becomesHELLO
.The resulting text should not be shown to people, and is intended for doing relaxed equality comparisons, at the expense of false positives when the accents, marks, punctuation etc were intended.
You should do
case folding
after this.Emoji are preserved but variation selectors, fitzpatrick and joiners are stripped.
Regional indicators are preserved.
- apsw.unicode.split_lines(text: str, offset: int = 0) Iterator[str] [source]
Each line, using hard line break rules
This is a iterator yielding a line at a time. The end of line yielded will not include the hard line break characters.
- apsw.unicode.expand_tabs(text: str, tabsize: int = 8, invalid: str = '.') str [source]
Turns tabs into spaces aligning on tabsize boundaries, similar to
str.expandtabs()
This is aware of grapheme clusters and text width. Codepoints that have an invalid width are also replaced by
invalid
. Control characters are an example of an invalid character. Line breaks are replaced with newline.
- apsw.unicode.grapheme_length(text: str, offset: int = 0) int [source]
Returns number of grapheme clusters in the text. Unicode aware version of len
- apsw.unicode.grapheme_substr(text: str, start: int | None = None, stop: int | None = None) str [source]
Like
text[start:end]
but in grapheme cluster unitsstart
andend
can be negative to index from the end, or outside the bounds of the text but are never an invalid combination (you get empty string returned).To get one grapheme cluster, make stop one more than start. For example to get the 3rd last grapheme cluster:
grapheme_substr(text, -3, -3 + 1)
- apsw.unicode.grapheme_endswith(text: str, substring: str) bool [source]
Returns True if text ends with substring being aware of grapheme cluster boundaries
- apsw.unicode.grapheme_startswith(text: str, substring: str) bool [source]
Returns True if text starts with substring being aware of grapheme cluster boundaries
- apsw.unicode.grapheme_find(text: str, substring: str, start: int = 0, end: int | None = None) int [source]
Returns the offset in text where substring can be found, being aware of grapheme clusters. The start and end of the substring have to be at a grapheme cluster boundary.
- Parameters:
start – Where in text to start the search (default beginning)
end – Where to stop the search exclusive (default remaining text)
- Returns:
offset into text, or -1 if not found or substring is zero length
- apsw.unicode.text_width(text: str, offset: int = 0) int [source]
Returns how many columns the text would be if displayed in a terminal
You should
split_lines()
first and then operate on each line separately.If the text contains new lines, control characters, and similar unrepresentable codepoints then minus 1 is returned.
Terminals aren’t entirely consistent with each other, and Unicode has many kinds of codepoints, and combinations. Consequently this is right the vast majority of the time, but not always.
Note that web browsers do variable widths even in monospaced sections like
<pre>
so they won’t always agree with the terminal either.
- apsw.unicode.text_width_substr(text: str, width: int, offset: int = 0) tuple[int, str] [source]
Extracts substring width or less wide being aware of grapheme cluster boundaries. For example you could use this to get a substring that is 80 (or less) wide.
- Returns:
A tuple of how wide the substring is, and the substring
- apsw.unicode.guess_paragraphs(text: str, tabsize: int = 8) str [source]
- Given text that contains paragraphs containing newlines, guesses where the paragraphs end.
The returned
str
will have ``- `` removed where it was
determined to not mark a paragraph end.
If you have text like this, where paragraphs have newlines in them, then each line gets wrapped separately by text_wrap. This function tries to guess where the paragraphs end. Blank lines like above are definite. Indented lines that continue preserving the indent are considered the same paragraph, and a change of indent (in or out) is a new paragraph. So this will be a new paragraph, And this will be a new paragraph. * Punctuation/numbers at the start of line followed by indented text are considered the same paragraph 2. So this is a new paragraph, while this line is part of the line above 3. Optional numbers followed by punctuation then space - are considered new paragraphs
- apsw.unicode.text_wrap(text: str, width: int = 70, *, tabsize: int = 8, hyphen: str = '-', combine_space: bool = True, invalid: str = '?') Iterator[str] [source]
Similar to
textwrap.wrap()
but Unicode grapheme cluster and line break awareNote
Newlines in the text are treated as end of paragraph. If your text has paragraphs with newlines in them, then call
guess_paragraphs()
first.- Parameters:
text – string to process
width – width of yielded lines, if rendered using a monospace font such as to a terminal
tabsize – Tab stop spacing as tabs are expanded
hyphen – Used to show a segment was broken because it was wider than
width
combine_space – Leading space on each (indent) is always preserved. Other spaces where multiple occur are combined into one space.
invalid – If invalid codepoints are encountered such as control characters and surrogates then they are replaced with this.
This yields one line of
str
at a time, which will be exactlywidth
when output to a terminal. It will be right padded with spaces if necessary and not have a trailing newline.apsw.ext.format_query_table()
uses this method to ensure each column is the desired width.
- apsw.unicode.codepoint_name(codepoint: int | str) str | None [source]
Name or
None
if it doesn’t have oneFor example codepoint 65 is named
LATIN CAPITAL LETTER A
while codepoint U+D1234 is not assigned and would returnNone
.
- apsw.unicode.version_added(codepoint: int | str) str | None [source]
Returns the unicode version the codepoint was added
- apsw.unicode.version_dates = {'1.0': (1991, 10, 1), '1.1': (1993, 6, 1), '10.0': (2017, 6, 20), '11.0': (2018, 6, 5), '12.0': (2019, 3, 5), '12.1': (2019, 5, 7), '13.0': (2020, 3, 10), '14.0': (2021, 9, 14), '15.0': (2022, 9, 13), '15.1': (2023, 9, 12), '16.0': (2024, 9, 10), '2.0': (1996, 7, 1), '2.1': (1998, 5, 1), '3.0': (1999, 9, 1), '3.1': (2001, 3, 1), '3.2': (2002, 3, 1), '4.0': (2003, 4, 1), '4.1': (2005, 3, 31), '5.0': (2006, 7, 14), '5.1': (2008, 4, 4), '5.2': (2009, 10, 1), '6.0': (2010, 10, 11), '6.1': (2012, 1, 31), '6.2': (2012, 9, 26), '6.3': (2013, 9, 30), '7.0': (2014, 6, 16), '8.0': (2015, 6, 17), '9.0': (2016, 6, 21)}
Release date (year, month, day) for each unicode version intended for use with
version_added()
- apsw.unicode.grapheme_next_break(text: str, offset: int = 0) int [source]
Returns end of Grapheme cluster / User Perceived Character
For example regional indicators are in pairs, and a base codepoint can be combined with zero or more additional codepoints providing diacritics, marks, and variations. Break points are defined in the TR29 spec.
- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Index of first codepoint not part of the grapheme cluster starting at offset. You should extract
text[offset:span]
- apsw.unicode.grapheme_next(text: str, offset: int = 0) tuple[int, int] [source]
Returns span of next grapheme cluster
- apsw.unicode.grapheme_iter(text: str, offset: int = 0) Iterator[str] [source]
Iterator providing text of each grapheme cluster
- apsw.unicode.grapheme_iter_with_offsets(text: str, offset: int = 0) Iterator[tuple[int, int, str]] [source]
Iterator providing start, end, text of each grapheme cluster
- apsw.unicode.grapheme_iter_with_offsets_filtered(text: str, offset: int = 0, *, categories: Iterable[str], emoji: bool = False, regional_indicator: bool = False) Iterator[tuple[int, int, str]] [source]
Iterator providing start, end, text of each grapheme cluster, providing it includes codepoints from categories, emoji, or regional indicator
- apsw.unicode.word_next_break(text: str, offset: int = 0) int [source]
Returns end of next word or non-word
Finds the next break point according to the TR29 spec. Note that the segment returned may be a word, or a non-word (spaces, punctuation etc). Use
word_next()
to get words.- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Next break point
- apsw.unicode.word_default_categories = {'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'Nd', 'Nl', 'No'}
Default categories for selecting word segments - letters and numbers
- apsw.unicode.word_next(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) tuple[int, int] [source]
Returns span of next word
A segment is considered a word if it contains at least one codepoint corresponding to any of the categories, plus:
emoji (Extended_Pictographic in Unicode specs)
regional indicator - two character sequence for flags like 🇧🇷🇨🇦
- apsw.unicode.word_iter(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) Iterator[str] [source]
Iterator providing text of each word
- apsw.unicode.word_iter_with_offsets(text: str, offset: int = 0, *, categories: Iterable[str] = word_default_categories, emoji: bool = False, regional_indicator: bool = False) Iterator[str] [source]
Iterator providing start, end, text of each word
- apsw.unicode.sentence_next_break(text: str, offset: int = 0) int [source]
Returns end of sentence location.
Finds the next break point according to the TR29 spec. Note that the segment returned includes leading and trailing white space.
- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Next break point
- apsw.unicode.sentence_next(text: str, offset: int = 0) tuple[int, int] [source]
Returns span of next sentence
- apsw.unicode.sentence_iter(text: str, offset: int = 0) Iterator[str] [source]
Iterator providing text of each sentence
- apsw.unicode.sentence_iter_with_offsets(text: str, offset: int = 0) Iterator[tuple[int, int, str]] [source]
Iterator providing start, end, text of each sentence
- apsw.unicode.line_break_next_break(text: str, offset: int = 0) int [source]
Returns next opportunity to break a line
Finds the next break point according to the TR14 spec.
- Parameters:
text – The text to examine
offset – The first codepoint to examine
- Returns:
Next break point
- apsw.unicode.line_break_next(text: str, offset: int = 0) tuple[int, int] [source]
Returns span of next line