Levenshtein Edit Distance Between Strings

How can you measure the distance between two words? How can you find the closest match to a word in a list of words? The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the edit distance.

Vladimir Levenshtein is a Russian mathematician who published this notion in 1966. I am using his distance measure in a project that I will describe in a future post. Other applications include optical character recognition (OCR), spell checking, and DNA sequencing.

I learned about Levenshtein distance from the Wikipedia page.

Contents

"Hi ho, hi ho"

For my examples, I will use these seven famous names.

   T = {"Doc","Grumpy","Happy","Sleepy","Bashful","Sneezy","Dopey"}
T =
  1×7 cell array
  Columns 1 through 5
    ["Doc"]    ["Grumpy"]    ["Happy"]    ["Sleepy"]    ["Bashful"]
  Columns 6 through 7
    ["Sneezy"]    ["Dopey"]

The strings "Sleepy" and "Sneezy" are close to each other because they are the same length, and we can transform "Sleepy" to "Sneezy" by two edit operations, change 'l' to 'n' and 'p' to 'z'. The Levenshtein distance between these two words is 2.

On the other hand, "Bashful" is not close to his friends. His name is longer and the only letter he shares with another is an 'a' with "Happy". So his distance to "Happy" is 6, while the distance to any of the others is 7, the length of his name.

Here is the matrix of pair-wise distances

   for i = 1:7
       for j = 1:7
           D(i,j) = lev(T{i},T{j});
       end
   end
   D
D =
     0     6     5     6     7     6     3
     6     0     4     4     7     5     5
     5     4     0     4     6     5     3
     6     4     4     0     7     2     4
     7     7     6     7     0     7     7
     6     5     5     2     7     0     4
     3     5     3     4     7     4     0

Bashful's column is all 7's, except for a 6 in Happy's row and a 0 on the diagonal.

A bar graph of the total distances shows Bashful's name is the furthest from all the others, and that Dopey's is the closest, but just barely.

   bar(sum(D))
   title('Total Levenshtein Distance')
   set(gca,'xticklabels',T)

Recursive

Here is a recursive program that provides one way to compute lev(s,t). It compares each character in a string with all of the characters in the other string. For each comparison, a cost c is added to a total that is accumulated by the recursion. The cost of one comparison is 0 if the pair is a match and 1 if it isn't.

   type levr

function d = levr(s,t,m,n)
% Levenshtein distance between strings, recursive implementation.
% levr(s,t) is the number of deletions, insertions,
% or substitutions required to transform s to t.
% m = length(s), n = length(t), levr(s,t,m,n) initiates recursion.
% https://en.wikipedia.org/wiki/Levenshtein_distance

    if nargin == 2
        s = char(s);
        t = char(t);
        d = levr(s,t,length(s),length(t));
    elseif m == 0
        d = n;
    elseif n == 0
        d = m;
    else
        c = s(m) ~= t(n); % c = 0 if chars match, 1 if not.
        d = min([levr(s,t,m-1,n) + 1
                 levr(s,t,m,n-1) + 1
                 levr(s,t,m-1,n-1) + c]);
    end
end
        
        

Like all recursive programs, this code is impractical to use in practice with long strings or long lists of words because it repeatedly compares the same pairs of characters. The complexity is exponential in the lengths of the strings.

Matrix Memory

A time-memory tradeoff can be made by allocating storage to save the results of individual comparisons. The memory involved is a matrix of size (m+1)-by-(n+1) where m and n are the lengths of the two strings, so it's O(m*n). The time complexity is also O(m*n).

    type levm

function d = levm(s,t)
% Levenshtein distance between strings, matrix implementation.
% levr(s,t) is the number of deletions, insertions,
% or substitutions required to transform s to t.
% https://en.wikipedia.org/wiki/Levenshtein_distance

    s = char(s);
    t = char(t);
    m = length(s);
    n = length(t);
    D = zeros(m+1,n+1);
    D(:,1) = (0:m)';
    D(1,:) = (0:n);
    for i = 1:m
        for j = 1:n
            c = s(i) ~= t(j); % c = 0 if chars match, 1 if not.
            D(i+1,j+1) = min([D(i,j+1) + 1
                              D(i+1,j) + 1
                              D(i,j)  +  c]);
        end
    end
    levm_print(s,t,D)    
    d = D(m+1,n+1);
end          

Details

I've included a print routine so we can see some detail. Let's begin by finding the distance between a single letter and a word that doesn't contain that letter. The distance is the length n of the word because one substitution and n-1 deletions are required.

    d = levm("S","Dopey")
         D    o    p    e    y
    S    1    2    3    4    5

d =
     5

Now let's have one character match. In this case the character is 'e'.

    d = levm("Sle","Dopey")
         D    o    p    e    y
    S    1    2    3    4    5
    l    2    2    3    4    5
    e    3    3    3    3    4

d =
     4

Finally, two full words. The distance is the last entry in the last row or column of the matrix.

    d = levm("Sleepy","Dopey")
         D    o    p    e    y
    S    1    2    3    4    5
    l    2    2    3    4    5
    e    3    3    3    3    4
    e    4    4    4    3    4
    p    5    5    4    4    4
    y    6    6    5    5    4

d =
     4

Best Program

We don't need storage for the whole matrix, just two rows. The storage cost is now linear in the lengths of the strings. This is the most efficient of my three functions.

   type lev
function d = lev(s,t)
% Levenshtein distance between strings or char arrays.
% lev(s,t) is the number of deletions, insertions,
% or substitutions required to transform s to t.
% https://en.wikipedia.org/wiki/Levenshtein_distance

    s = char(s);
    t = char(t);
    m = length(s);
    n = length(t);
    x = 0:n;
    y = zeros(1,n+1);   
    for i = 1:m
        y(1) = i;
        for j = 1:n
            c = (s(i) ~= t(j)); % c = 0 if chars match, 1 if not.
            y(j+1) = min([y(j) + 1
                          x(j+1) + 1
                          x(j) + c]);
        end
        % swap
        [x,y] = deal(y,x);
    end
    d = x(n+1);
end
        
        

Metric Space

The function lev makes the set of strings a metric space. That's because of four properties. The distance from an element to itself is zero.

d(x,x) = 0

Otherwise the distance is positive.

d(x,y) > 0 if x ~= y

Distance is symmetric.

d(x,y) = d(y,x)

The triangle inequality, for any three elements,

d(x,y) <= d(x,w) + d(w,y)

44 Programming Languages

I usually avoid programming language debates, but here are implementations in 44 different programming languages, including one in MATLAB. I prefer my own code.




Published with MATLAB® R2017a

|
  • print

コメント

コメントを残すには、ここ をクリックして MathWorks アカウントにサインインするか新しい MathWorks アカウントを作成します。