Skip to content

Tokenization

dictpress uses tokenizers to convert dictionary entries into searchable tokens for SQLite FTS5 fulltext search. There are two types of tokenizers: built-in default tokenizers and custom Lua tokenizers.

Default tokenizers

dictpress bundles Snowball stemming algorithms for 18 languages. These tokenizers lowercase and stem words to their root forms. For example, "running" becomes "run" in English so that searches for "run", "running" both are a match.

Supported languages: arabic, danish, dutch, english, finnish, french, german, greek, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, swedish, tamil, turkish

Configuration:

[lang.english]
tokenizer = "english"
tokenizer_type = "default"

CSV import format: default:english in the tokenizer column.

Lua tokenizers

For languages or use cases not covered by built-in stemmers, custom Lua tokenizers can be used. These are .lua scripts placed in the ./tokenizers directory (defined in config.toml). Lua tokenizers are useful for phonetic search (like Metaphone), transliteration-based search, or any custom tokenization logic.

Configuration In the config.toml file:

[lang.malayalam]
tokenizer = "indicphone_ml.lua"
tokenizer_type = "lua"

CSV import format lua:indicphone_ml.lua in the tokenizer column.

Writing a custom Lua tokenizer

A Lua tokenizer must export two functions: tokenize() for indexing and to_query() for search queries.

Required functions

-- Convert text to tokens for indexing.
-- Returns: table of token strings
function tokenize(text, lang)
    local tokens = {}
    for word in utils.words(text) do
        tokens[#tokens + 1] = word:lower()
    end
    return tokens
end

-- Convert search query to FTS5 query format.
-- Returns: string (FTS5 query)
function to_query(text, lang)
    return text:lower()
end

Global utils

All Lua tokenizers have access to a global utils table with helper functions:

Function Description
utils.words(s) Returns an iterator over whitespace-separated words
utils.trim(s) Trims leading/trailing whitespace
utils.split(s, delim) Splits string by delimiter (plain text)
utils.replace_all(s, old, new) Replaces all literal occurrences
utils.replace_all_pattern(s, pattern, repl) Replaces all Lua pattern matches
utils.filter_unicode_range(s, min_cp, max_cp) Keeps only characters in Unicode codepoint range

Example

See indicphone_ml.lua for a full example of a phonetic tokenizer for Malayalam. It converts Malayalam script to phonetic keys, enabling fuzzy search that matches words by pronunciation rather than exact spelling.