Levenshtein Edit Distance Between Strings
How can you measure the distance between two words? How can you find the closest match to a word in a list of words? The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance.
Vladimir Levenshtein is a Russian mathematician who published this notion in 1966. I am using his distance measure in a project that I will describe in a future post. Other applications include optical character recognition (OCR), spell checking, and DNA sequencing.
I learned about Levenshtein distance from the Wikipedia page.
Contents
"Hi ho, hi ho"
For my examples, I will use these seven famous names.
T = {"Doc","Grumpy","Happy","Sleepy","Bashful","Sneezy","Dopey"}
T = 1×7 cell array Columns 1 through 5 ["Doc"] ["Grumpy"] ["Happy"] ["Sleepy"] ["Bashful"] Columns 6 through 7 ["Sneezy"] ["Dopey"]
The strings "Sleepy" and "Sneezy" are close to each other because they are the same length, and we can transform "Sleepy" to "Sneezy" by two edit operations, change 'l' to 'n' and 'p' to 'z'. The Levenshtein distance between these two words is 2.
On the other hand, "Bashful" is not close to his friends. His name is longer and the only letter he shares with another is an 'a' with "Happy". So his distance to "Happy" is 6, while the distance to any of the others is 7, the length of his name.
Here is the matrix of pair-wise distances
for i = 1:7 for j = 1:7 D(i,j) = lev(T{i},T{j}); end end D
D = 0 6 5 6 7 6 3 6 0 4 4 7 5 5 5 4 0 4 6 5 3 6 4 4 0 7 2 4 7 7 6 7 0 7 7 6 5 5 2 7 0 4 3 5 3 4 7 4 0
Bashful's column is all 7's, except for a 6 in Happy's row and a 0 on the diagonal.
A bar graph of the total distances shows Bashful's name is the furthest from all the others, and that Dopey's is the closest, but just barely.
bar(sum(D)) title('Total Levenshtein Distance') set(gca,'xticklabels',T)
Recursive
Here is a recursive program that provides one way to compute lev(s,t). It compares each character in a string with all of the characters in the other string. For each comparison, a cost c is added to a total that is accumulated by the recursion. The cost of one comparison is 0 if the pair is a match and 1 if it isn't.
type levr
function d = levr(s,t,m,n) % Levenshtein distance between strings, recursive implementation. % levr(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % m = length(s), n = length(t), levr(s,t,m,n) initiates recursion. % https://en.wikipedia.org/wiki/Levenshtein_distance if nargin == 2 s = char(s); t = char(t); d = levr(s,t,length(s),length(t)); elseif m == 0 d = n; elseif n == 0 d = m; else c = s(m) ~= t(n); % c = 0 if chars match, 1 if not. d = min([levr(s,t,m-1,n) + 1 levr(s,t,m,n-1) + 1 levr(s,t,m-1,n-1) + c]); end end
Like all recursive programs, this code is impractical to use in practice with long strings or long lists of words because it repeatedly compares the same pairs of characters. The complexity is exponential in the lengths of the strings.
Matrix Memory
A time-memory tradeoff can be made by allocating storage to save the results of individual comparisons. The memory involved is a matrix of size (m+1)-by-(n+1) where m and n are the lengths of the two strings, so it's O(m*n). The time complexity is also O(m*n).
type levm
function d = levm(s,t) % Levenshtein distance between strings, matrix implementation. % levr(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % https://en.wikipedia.org/wiki/Levenshtein_distance s = char(s); t = char(t); m = length(s); n = length(t); D = zeros(m+1,n+1); D(:,1) = (0:m)'; D(1,:) = (0:n); for i = 1:m for j = 1:n c = s(i) ~= t(j); % c = 0 if chars match, 1 if not. D(i+1,j+1) = min([D(i,j+1) + 1 D(i+1,j) + 1 D(i,j) + c]); end end levm_print(s,t,D) d = D(m+1,n+1); end
Details
I've included a print routine so we can see some detail. Let's begin by finding the distance between a single letter and a word that doesn't contain that letter. The distance is the length n of the word because one substitution and n-1 deletions are required.
d = levm("S","Dopey")
D o p e y S 1 2 3 4 5 d = 5
Now let's have one character match. In this case the character is 'e'.
d = levm("Sle","Dopey")
D o p e y S 1 2 3 4 5 l 2 2 3 4 5 e 3 3 3 3 4 d = 4
Finally, two full words. The distance is the last entry in the last row or column of the matrix.
d = levm("Sleepy","Dopey")
D o p e y S 1 2 3 4 5 l 2 2 3 4 5 e 3 3 3 3 4 e 4 4 4 3 4 p 5 5 4 4 4 y 6 6 5 5 4 d = 4
Best Program
We don't need storage for the whole matrix, just two rows. The storage cost is now linear in the lengths of the strings. This is the most efficient of my three functions.
type lev
function d = lev(s,t) % Levenshtein distance between strings or char arrays. % lev(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % https://en.wikipedia.org/wiki/Levenshtein_distance s = char(s); t = char(t); m = length(s); n = length(t); x = 0:n; y = zeros(1,n+1); for i = 1:m y(1) = i; for j = 1:n c = (s(i) ~= t(j)); % c = 0 if chars match, 1 if not. y(j+1) = min([y(j) + 1 x(j+1) + 1 x(j) + c]); end % swap [x,y] = deal(y,x); end d = x(n+1); end
Metric Space
The function lev makes the set of strings a metric space. That's because of four properties. The distance from an element to itself is zero.
d(x,x) = 0
Otherwise the distance is positive.
d(x,y) > 0 if x ~= y
Distance is symmetric.
d(x,y) = d(y,x)
The triangle inequality, for any three elements,
d(x,y) <= d(x,w) + d(w,y)
44 Programming Languages
I usually avoid programming language debates, but here are implementations in 44 different programming languages, including one in MATLAB. I prefer my own code.
コメント
コメントを残すには、ここ をクリックして MathWorks アカウントにサインインするか新しい MathWorks アカウントを作成します。