Subsequential Finite State Transducer
Given an input text, produces a new text by applying a fixed set of rewrite rules. The algorithm builds a minimal subsequential transducer and uses the "leftmost largest match" replacement strategy with skips. No overlap between the replaced parts is possible. The time needed to compute the transducer is linear in the size of the input dictionary. For any text t
of length |t|
the time it takes to perform a rewrite is also linear O(|t|+|t'|)
where t'
denotes the resulting output string.
Check out the Online Sandbox.
Usage
npm i --save ssfst
Example: Text Rewriting
const ssfst = ; const spellingCorrector = input: 'acheive' output: 'achieve' input: 'arguement' output: 'argument' input: 'independant' output: 'independent' input: 'posession' output: 'possession' input: 'mercy less' output: 'merciless' ; spellingCorrector; // => "independent"spellingCorrector; // => "merciless argument"spellingCorrector; // => "they achieved a lot"
The init
factory function takes a collection of pairs and returns a transducer. The transducer can be initialized by any iterable object.
{ input: 'dog' output: '<a href="https://en.wikipedia.org/wiki/Dog">dog</a>' ; input: 'fox' output: '<a href="https://en.wikipedia.org/wiki/Fox">fox</a>' ;} const transducer = ssfst;transducer;/* => The quick brown <a href="https://en.wikipedia.org/wiki/Fox">fox</a> jumped over the lazy <a href="https://en.wikipedia.org/wiki/Dog">dog</a>. */
Working with large datasets
Loading the full rewrite dictionary in memory is not optimal when working with large datasets. In this case we want to build the transducer by adding the entries asynchronously one at a time. This is achieved by using an async iterable.
For example, if our dataset is stored in a file, we can read its contents one line at a time.
Berlin,GermanyBuenos Aires,ArgentinaLondon,United KingdomSofia,BulgariaTokyo,Japan
This is the dictionary text file. Each line contains an entry and its input and output values are separated by a comma. We implement a generator function which reads it asynchronously line by line and yields an object which is consumed by the initialization of the transducer.
const fs = ;const readline = ;const ssfst = ; { const lineReader = readline; for { const input output = line; input output ; }}
We pass the async iterable to the initAsync
factory function.
const transducer = await ssfst;
Example: Key-Value Store
Due to its minimality, the subsequential transducer can also be used to efficiently store key-value pairs.
const val = transducer; // => Bulgariaconst invalid = transducer; // => Unknown Key
If there's no value for a given key, it will return the key itself, which simply reduces to processing a text without applying any rewrite rules.
Use with TypeScript
;
Run Locally
git clone https://github.com/deniskyashif/ssfst.gitcd ssfstnpm i
Sample implementations can be found at examples/
.
Run the Tests
npm t
References
This implementation follows the construction presented in "Efficient Dictionary-Based Text Rewriting using Subsequential Transducers" by S. Mihov, K. Schulz