本文主要是介绍Levenshtein Distance 算法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,
- If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
- If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.
The Levenshtein distance algorithm has been used in:
- Spell checking
- Speech recognition
- DNA analysis
- Plagiarism detection
The Algorithm
Steps
Step | Description |
---|---|
1 | Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. |
2 | Initialize the first row to 0..n. Initialize the first column to 0..m. |
3 | Examine each character of s (i from 1 to n). |
4 | Examine each character of t (j from 1 to m). |
5 | If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1. |
6 | Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. |
7 | After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]. |
Example
This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".
Steps 1 and 2
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | |||||
A | 2 | |||||
M | 3 | |||||
B | 4 | |||||
O | 5 | |||||
L | 6 |
Steps 3 to 6 When i = 1
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | ||||
A | 2 | 1 | ||||
M | 3 | 2 | ||||
B | 4 | 3 | ||||
O | 5 | 4 | ||||
L | 6 | 5 |
Steps 3 to 6 When i = 2
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | |||
A | 2 | 1 | 1 | |||
M | 3 | 2 | 2 | |||
B | 4 | 3 | 3 | |||
O | 5 | 4 | 4 | |||
L | 6 | 5 | 5 |
Steps 3 to 6 When i = 3
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | ||
A | 2 | 1 | 1 | 2 | ||
M | 3 | 2 | 2 | 1 | ||
B | 4 | 3 | 3 | 2 | ||
O | 5 | 4 | 4 | 3 | ||
L | 6 | 5 | 5 | 4 |
Steps 3 to 6 When i = 4
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | 3 | |
A | 2 | 1 | 1 | 2 | 3 | |
M | 3 | 2 | 2 | 1 | 2 | |
B | 4 | 3 | 3 | 2 | 1 | |
O | 5 | 4 | 4 | 3 | 2 | |
L | 6 | 5 | 5 | 4 | 3 |
Steps 3 to 6 When i = 5
G | U | M | B | O | ||
0 | 1 | 2 | 3 | 4 | 5 | |
G | 1 | 0 | 1 | 2 | 3 | 4 |
A | 2 | 1 | 1 | 2 | 3 | 4 |
M | 3 | 2 | 2 | 1 | 2 | 3 |
B | 4 | 3 | 3 | 2 | 1 | 2 |
O | 5 | 4 | 4 | 3 | 2 | 1 |
L | 6 | 5 | 5 | 4 | 3 | 2 |
算法示例1:
private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
int[,] distance=new int[n + 1, m + 1]; // matrix
int cost=0;
if(n == 0) return m;
if(m == 0) return n;
//init1
for(int i=0; i <= n; distance[i, 0]=i++);
for(int j=0; j <= m; distance[0, j]=j++);
//find min distance
for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
cost=(t.Substring(j - 1, 1) ==
s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
return distance[n, m];
}
算法示例2:
private int Levenshtein(string str1, string str2)
{
int n = str1.Length;
int m = str2.Length;
int i; //遍历str1的
int j; //遍历str2的
char ch1; //str1的
char ch2; //str2的
int temp; //记录相同字符,在某个矩阵位置值的增量,不是0就是1
if(n == 0)
{
return m;
}
if(m == 0)
{
return n;
}
int[,] d = new int[n+1,m+1];
for(i=0; i<=n; i++)
{ //初始化第一列
d[i,0] = i;
}
for(j=0; j<=m; j++)
{ //初始化第一行
d[0,j] = j;
}
for(i=1; i<=n; i++)
{ //遍历str1
ch1 = str1[i-1];
//去匹配str2
for(j=1; j<=m; j++)
{
ch2 = str2[j-1];
if(ch1 == ch2)
{
temp = 0;
} else
{
temp = 1;
}
//左边+1,上边+1, 左上角+temp取最小
d[i, j] = Min(d[i - 1, j] + 1, d[i, j - 1] + 1, d[i - 1, j - 1] + temp);
}
}
return d[n,m];
}
private int Min(int one, int two, int three)
{
int min = one;
if (two < min)
{
min = two;
}
if (three < min)
{
min = three;
}
return min;
}
private double Sim(String str1, String str2)
{
int ld = Levenshtein(str1, str2);
return 1 - (double)ld / Math.Max(str1.Length, str2.Length);
}
算法示例3:空間復雜度從O(n*m)降到O(2m)
///*****************************
/// Compute Levenshtein distance
/// Memory efficient version
///*****************************
public int iLD(String sRow, String sCol)
{
int RowLen = sRow.Length; // length of sRow
int ColLen = sCol.Length; // length of sCol
int RowIdx; // iterates through sRow
int ColIdx; // iterates through sCol
char Row_i; // ith character of sRow
char Col_j; // jth character of sCol
int cost; // cost
/// Test string length
if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
throw (new Exception("/nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + "./nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));
// Step 1
if (RowLen == 0)
{
return ColLen;
}
if (ColLen == 0)
{
return RowLen;
}
/// Create the two vectors
int[] v0 = new int[RowLen + 1];
int[] v1 = new int[RowLen + 1];
int[] vTmp;
/// Step 2
/// Initialize the first vector
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
v0[RowIdx] = RowIdx;
}
// Step 3
/// Fore each column
for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
{
/// Set the 0'th element to the column number
v1[0] = ColIdx;
Col_j = sCol[ColIdx - 1];
// Step 4
/// Fore each row
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
Row_i = sRow[RowIdx - 1];
// Step 5
if (Row_i == Col_j)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
/// Find minimum
int m_min = v0[RowIdx] + 1;
int b = v1[RowIdx - 1] + 1;
int c = v0[RowIdx - 1] + cost;
if (b < m_min)
{
m_min = b;
}
if (c < m_min)
{
m_min = c;
}
v1[RowIdx] = m_min;
}
/// Swap the vectors
vTmp = v0;
v0 = v1;
v1 = vTmp;
}
// Step 7
/// Value between 0 - 100
/// 0==perfect match 100==totaly different
///
/// The vectors where swaped one last time at the end of the last loop,
/// that is why the result is now in v0 rather than in v1
System.Console.WriteLine("iDist=" + v0[RowLen]);
int max = System.Math.Max(RowLen, ColLen);
return ((100 * v0[RowLen]) / max);
}
From:http://hi.baidu.com/xining52113339/blog/item/8a23f1388ddfc523b9998f47.html
http://hi.baidu.com/pecefull0513/blog/item/a746ca1a292b9c118618bfbd.html
http://www.codeproject.com/KB/recipes/improvestringsimilarity.aspx
http://en.wikipedia.org/wiki/Levenshtein_distance
http://www.codeproject.com/KB/recipes/Levenshtein.aspx
这篇关于Levenshtein Distance 算法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!