# Levenshtein Edit Distance Between Strings

How can you measure the distance between two words? How can you find the closest match to a word in a list of words? The Levenshtein distance between two strings is the number of single character deletions, insertions, or substitutions required to transform one string into the other. This is also known as the *edit distance*.

Vladimir Levenshtein is a Russian mathematician who published this notion in 1966. I am using his distance measure in a project that I will describe in a future post. Other applications include optical character recognition (OCR), spell checking, and DNA sequencing.

I learned about Levenshtein distance from the Wikipedia page.

### Contents

#### "Hi ho, hi ho"

For my examples, I will use these seven famous names.

T = {"Doc","Grumpy","Happy","Sleepy","Bashful","Sneezy","Dopey"}

T = 1×7 cell array Columns 1 through 5 ["Doc"] ["Grumpy"] ["Happy"] ["Sleepy"] ["Bashful"] Columns 6 through 7 ["Sneezy"] ["Dopey"]

The strings "Sleepy" and "Sneezy" are close to each other because they are the same length, and we can transform "Sleepy" to "Sneezy" by two edit operations, change 'l' to 'n' and 'p' to 'z'. The Levenshtein distance between these two words is 2.

On the other hand, "Bashful" is not close to his friends. His name is longer and the only letter he shares with another is an 'a' with "Happy". So his distance to "Happy" is 6, while the distance to any of the others is 7, the length of his name.

Here is the matrix of pair-wise distances

for i = 1:7 for j = 1:7 D(i,j) = lev(T{i},T{j}); end end D

D = 0 6 5 6 7 6 3 6 0 4 4 7 5 5 5 4 0 4 6 5 3 6 4 4 0 7 2 4 7 7 6 7 0 7 7 6 5 5 2 7 0 4 3 5 3 4 7 4 0

Bashful's column is all 7's, except for a 6 in Happy's row and a 0 on the diagonal.

A bar graph of the total distances shows Bashful's name is the furthest from all the others, and that Dopey's is the closest, but just barely.

bar(sum(D)) title('Total Levenshtein Distance') set(gca,'xticklabels',T)

#### Recursive

Here is a recursive program that provides one way to compute `lev(s,t)`. It compares each character in a string with all of the characters in the other string. For each comparison, a cost `c` is added to a total that is accumulated by the recursion. The cost of one comparison is 0 if the pair is a match and 1 if it isn't.

```
type levr
```

function d = levr(s,t,m,n) % Levenshtein distance between strings, recursive implementation. % levr(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % m = length(s), n = length(t), levr(s,t,m,n) initiates recursion. % https://en.wikipedia.org/wiki/Levenshtein_distance if nargin == 2 s = char(s); t = char(t); d = levr(s,t,length(s),length(t)); elseif m == 0 d = n; elseif n == 0 d = m; else c = s(m) ~= t(n); % c = 0 if chars match, 1 if not. d = min([levr(s,t,m-1,n) + 1 levr(s,t,m,n-1) + 1 levr(s,t,m-1,n-1) + c]); end end

Like all recursive programs, this code is impractical to use in practice with long strings or long lists of words because it repeatedly compares the same pairs of characters. The complexity is exponential in the lengths of the strings.

#### Matrix Memory

A time-memory tradeoff can be made by allocating storage to save the results of individual comparisons. The memory involved is a matrix of size (m+1)-by-(n+1) where m and n are the lengths of the two strings, so it's O(m*n). The time complexity is also O(m*n).

```
type levm
```

function d = levm(s,t) % Levenshtein distance between strings, matrix implementation. % levr(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % https://en.wikipedia.org/wiki/Levenshtein_distance s = char(s); t = char(t); m = length(s); n = length(t); D = zeros(m+1,n+1); D(:,1) = (0:m)'; D(1,:) = (0:n); for i = 1:m for j = 1:n c = s(i) ~= t(j); % c = 0 if chars match, 1 if not. D(i+1,j+1) = min([D(i,j+1) + 1 D(i+1,j) + 1 D(i,j) + c]); end end levm_print(s,t,D) d = D(m+1,n+1); end

#### Details

I've included a print routine so we can see some detail. Let's begin by finding the distance between a single letter and a word that doesn't contain that letter. The distance is the length n of the word because one substitution and n-1 deletions are required.

d = levm("S","Dopey")

D o p e y S 1 2 3 4 5 d = 5

Now let's have one character match. In this case the character is 'e'.

d = levm("Sle","Dopey")

D o p e y S 1 2 3 4 5 l 2 2 3 4 5 e 3 3 3 3 4 d = 4

Finally, two full words. The distance is the last entry in the last row or column of the matrix.

d = levm("Sleepy","Dopey")

D o p e y S 1 2 3 4 5 l 2 2 3 4 5 e 3 3 3 3 4 e 4 4 4 3 4 p 5 5 4 4 4 y 6 6 5 5 4 d = 4

#### Best Program

We don't need storage for the whole matrix, just two rows. The storage cost is now linear in the lengths of the strings. This is the most efficient of my three functions.

```
type lev
```

function d = lev(s,t) % Levenshtein distance between strings or char arrays. % lev(s,t) is the number of deletions, insertions, % or substitutions required to transform s to t. % https://en.wikipedia.org/wiki/Levenshtein_distance s = char(s); t = char(t); m = length(s); n = length(t); x = 0:n; y = zeros(1,n+1); for i = 1:m y(1) = i; for j = 1:n c = (s(i) ~= t(j)); % c = 0 if chars match, 1 if not. y(j+1) = min([y(j) + 1 x(j+1) + 1 x(j) + c]); end % swap [x,y] = deal(y,x); end d = x(n+1); end

#### Metric Space

The function `lev` makes the set of strings a *metric space*. That's because of four properties. The distance from an element to itself is zero.

d(x,x) = 0

Otherwise the distance is positive.

```
d(x,y) > 0 if x ~= y
```

Distance is *symmetric*.

d(x,y) = d(y,x)

The triangle inequality, for any three elements,

d(x,y) <= d(x,w) + d(w,y)

#### 44 Programming Languages

I usually avoid programming language debates, but here are implementations in 44 different programming languages, including one in MATLAB. I prefer my own code.

## 댓글

댓글을 남기려면 링크 를 클릭하여 MathWorks 계정에 로그인하거나 계정을 새로 만드십시오.