align paired sequences automatically
autoalign learns to align a list of paired but unaligned sequences by alternating between estimating a cost function based on the alignment statistics, and finding the optimal alignment with respect to that estimated cost function; in other words, it learns an edit distance from data, starting with a guess and iterating until convergence:
0. guess initial cost function
🡓
1. align sequences using cost function 🡐
🡓
2. estimate co-ocurrence statistics 🡑
🡓
3. build cost function from statistics 🡑
🡓
4. compute avg alignment cost 🡑
🡓
5. if no improvement go to 6, otherwise 1 🡒
🡓
6. returned aligned pairs
Install
npm i autoalign
to uninstall:
npm un autoalign
Usage
CLI
$ ./node_modules/.bin/autoalign -h Usage: autoalign [options] Options: -V, --version output the version number -i --input <file> input file; format: LEFT, RIGHT with white-space separated symbols -m --machine [file] machine-readable output; format: LEFT, RIGHT, SCORE -u --human [file] human-readable output, format: SCORE LEFT \n\r RIGHT -h, --help output usage information
example using pronunciation dictionary Britfone as input:
$ ./node_modules/.bin/autoalign -i britfone.csv -m aligned.csv -u aligned.txt [1:30:04 PM] read 15861 pairs from britfone.csv[1:30:05 PM] avg.unnorm.cost: 2.2119283496010733[1:30:06 PM] avg.unnorm.cost: 1.3996218560070568[1:30:07 PM] avg.unnorm.cost: 1.383950761182861[1:30:08 PM] avg.unnorm.cost: 1.3785348695282842[1:30:09 PM] avg.unnorm.cost: 1.3784167544364208[1:30:10 PM] avg.unnorm.cost: 1.3784167544364208[1:30:11 PM] wrote 15861 alignments to aligned.csv[1:30:11 PM] wrote 15861 alignments to aligned.txt
britfone.csv
... ... ...
S O L V I N G, s ɒ l v ɪ ŋ
S O M A, s əʊ m ə
S O M A L I A, s əʊ m ɑː l ɪə
S O M E, s ɐ m
... ... ...
aligned.csv
... .... ...
S O L V I N G, s ɒ l v ɪ · ŋ, 0.8138013477578335
S O M A, s əʊ m ə, 0.8147196394392467
S O M A L I A, s əʊ m ɑː l · ɪə, 0.7530844100226581
S O M E, s ɐ m ·, 0.7520632287232023
... .... ...
aligned.txt
... ... ...
0.81 S O L V I N G
s ɒ l v ɪ · ŋ
0.81 W I F E
w aɪ f ·
0.81 P R O T E I N S
p ɹ əʊ t · iː n z
0.81 F O L D E R S
f əʊ l d · ə z
... ... ...
API
> aa = GAP: Getter BY_SCORE: Function BY_ALPHA: Function SEP: Getter UNSEEN: Getter Stats: Getter vocab_sizes: Getter joint_stats_from_alignment: Getter smoothed_stats: Getter non_zero_smoothed_stats: Getter with_zero_smoothing: Getter alignment_cost_fn_estimator: Getter padding_cost_fn_estimator: Getter uniform_cost_fn_estimator: Getter pmi_cost_fn_from: Getter npmi: Getter pmi: Getter autoalign: Function csv_text_from_alignments: Function pretty_text_from_alignments: Function > pairs = 'SOLVING' 'sɒlvɪŋ'... 'SOMA' 's''əʊ''m''ə'... 'SOMALIA' 's''əʊ''ɑː''l''ɪə'... 'SOME' 'sɐm' 'SOLVING' 'sɒlvɪŋ' 'SOMA' 's' 'əʊ' 'm' 'ə' 'SOMALIA' 's' 'əʊ' 'ɑː' 'l' 'ɪə' 'SOME' 'sɐm' > aa 'S' 'O' 'L' 'V' 'I' 'N' 'G' 's' 'ɒ' 'l' 'v' '·' 'ɪ' 'ŋ' 09007558120763637 'S' 'O' 'M' 'A' 's' 'əʊ' 'm' 'ə' 08754834816104571 'S' 'O' 'M' 'A' 'L' 'I' 'A' 's' 'əʊ' '·' 'ɑː' 'l' '·' 'ɪə' 08098602659026892 'S' 'O' 'M' 'E' 's' 'ɐ' 'm' '·' 08158098631621633
Browser
autoalign
will be available as a global
Background
Autoalign uses the agreggated statistics of current alignments, computed with dynamic programming, in order to compute a better cost function for the next alignment pass. Given two aligned strings, it counts the co-ocurrences between the aligned symbols, adding those up over all alignments in the dataset. For instance these two alignments:
P R O T E I N S
p ɹ əʊ t · iː n z
F O L D E R S
f əʊ l d · ə z
will result in co-occurence counts:
P:p = 1
R:ɹ = 1
O:əʊ = 2
T:t = 1
E:· = 2
I:iː = 1
N:n = 1
S:z = 2
F:f = 1
L:l = 1
D:d = 1
R:ə = 1
These statistics are first smoothed using good-turing estimation to avoid zero counts. Then they are used to estimate the strength of association between the symbols in the left (up) and right (down) alphabets (letters and phonemes in this example). This strength is measured with a normalised pointwise mutual information(NPMI) (x is left and y is right):
where pmi(x;y) is
and h(x, y) is the negative logarithm of the joint probability of the left and right symbols
(This use of pmi is analogous to log-odds ratio scoring in bioinformatics[1]).
NPMI's range is [-1, 1] and so it needs renormalising as the edit distance alignment costs must be non-negative. The end result is the final cost between the symbols.
Once the new cost function is estimated, the sequences are realigned and the process starts over again, stopping once the alignment cost for the whole dataset stops decreasing. Non-increase and convergence to a local minimum are guaranted as this is an instance of the Expectation Maximisation algorithm, similar to Viterbi Training and forced alignment in Speech Recognition.
autoalign is limited in three ways:
-
it only matches one symbol in the left alphabet to one symbol in the right alphabet alignment. This one-to-one alignment is inaccure for orthographic/phonetic sequences as phones and letters are often in a many-to-many correspondence. For tools that don't have this limitation see Related
-
it only uses the optimal alignment(s) to compute the counts, this misses the contribution of less likely aligments and makes the algorithm very sentitive to initial conditions[2][3]. For tools that use the full set of paths see Related
-
it can only compute monotonic alignments. This is not a problem for most linguistic data, with a notable exception being machine translation
[1] Biological Sequence Analysis, Durbin, Eddy, Krogh & Mitchison, 1998.
[2] Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion, Jiampojamarn, Kondrak & Sherif, 2007.
[3] Learning String Distance, Ristad & Yianilos, 1998.
NPM tasks
test
npm test
Other install options
global npm install:
npm i -g autoalignnpm ln autoalign //ensures module is availabe to nodejs
then cli is, e.g. autoalign -i britfone.csv -m aligned.csv -u aligned.txt
and npm un -g autoalign
to uninstall
library-only git install:
git clone git@github.com:josellarena/autoalignjs.gitcd autoalignjs
then in node require('./autoalign)
and in the browser <script src='autoalign.js'></script>
Related
M2MAligner - A HMM-based aligner that can also do many-to-many alignments
AltFstAligner - Another HMM-based aligner, an intended replacement for M2MAligner
See Also
Britfone - a British English phonetic dictionary.
In order to autoalign Britfone, the word column needs to be processed such that letters are separated by whitespace
Changelog
see Changelog
License
MIT © Jose Llarena