Levenshtein Distance 算法

本文主要是介绍Levenshtein Distance 算法，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

编辑距离就是用来计算从原串（s）转换到目标串(t)所需要的最少的插入，删除和替换的数目，在NLP中应用比较广泛，如一些评测方法中就用到了（wer,mWer等），同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的，故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

The Levenshtein distance algorithm has been used in:

Spell checking
Speech recognition
DNA analysis
Plagiarism detection

The Algorithm

Steps

Step	Description
1	Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns.
2	Initialize the first row to 0..n. Initialize the first column to 0..m.
3	Examine each character of s (i from 1 to n).
4	Examine each character of t (j from 1 to m).
5	If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1.
6	Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7	After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

Example

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".

Steps 1 and 2

		G	U	M	B	O
	0	1	2	3	4	5
G	1
A	2
M	3
B	4
O	5
L	6

Steps 3 to 6 When i = 1

		G	U	M	B	O
	0	1	2	3	4	5
G	1	0
A	2	1
M	3	2
B	4	3
O	5	4
L	6	5

Steps 3 to 6 When i = 2

		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1
A	2	1	1
M	3	2	2
B	4	3	3
O	5	4	4
L	6	5	5

Steps 3 to 6 When i = 3

		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2
A	2	1	1	2
M	3	2	2	1
B	4	3	3	2
O	5	4	4	3
L	6	5	5	4

Steps 3 to 6 When i = 4

		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2	3
A	2	1	1	2	3
M	3	2	2	1	2
B	4	3	3	2	1
O	5	4	4	3	2
L	6	5	5	4	3

Steps 3 to 6 When i = 5

		G	U	M	B	O
	0	1	2	3	4	5
G	1	0	1	2	3	4
A	2	1	1	2	3	4
M	3	2	2	1	2	3
B	4	3	3	2	1	2
O	5	4	4	3	2	1
L	6	5	5	4	3	2

算法示例1：

private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
int[,] distance=new int[n + 1, m + 1]; // matrix
    int cost=0;
if(n == 0) return m;
if(m == 0) return n;
//init1
    for(int i=0; i <= n; distance[i, 0]=i++);
for(int j=0; j <= m; distance[0, j]=j++);
//find min distance
    for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
cost=(t.Substring(j - 1, 1) == 
s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
return distance[n, m];
}

算法示例2：

        private int Levenshtein(string str1, string str2)
        {
          int n = str1.Length;
          int m = str2.Length;
          int i;    //遍历str1的
            int j;    //遍历str2的
            char ch1;    //str1的
            char ch2;    //str2的
            int temp;    //记录相同字符,在某个矩阵位置值的增量,不是0就是1
            
            if(n == 0) 
            {
                return m;
            }
            if(m == 0) 
            {
                return n;
            }
            int[,] d = new int[n+1,m+1];

            for(i=0; i<=n; i++) 
            {    //初始化第一列
                d[i,0] = i;
            }
            for(j=0; j<=m; j++) 
            {    //初始化第一行
                d[0,j] = j;
            }

            for(i=1; i<=n; i++) 
            {    //遍历str1
                ch1 = str1[i-1];
                //去匹配str2
                for(j=1; j<=m; j++) 
                {
                    ch2 = str2[j-1];
                    if(ch1 == ch2) 
                    {
                        temp = 0;
                    } else
                    {
                        temp = 1;
                    }
                    //左边+1,上边+1, 左上角+temp取最小
                    d[i, j] = Min(d[i - 1, j] + 1, d[i, j - 1] + 1, d[i - 1, j - 1] + temp);
                }
            }
            return d[n,m];
        }

        private int Min(int one, int two, int three)
        {
            int min = one;
            if (two < min)
            {
                min = two;
            }
            if (three < min)
            {
                min = three;
            }
            return min;
        }

        private double Sim(String str1, String str2)
        {
            int ld = Levenshtein(str1, str2);
            return 1 - (double)ld / Math.Max(str1.Length, str2.Length);
        }

算法示例3：空間復雜度從O(n*m)降到O(2m)

///*****************************
        /// Compute Levenshtein distance 
        /// Memory efficient version
        ///*****************************
        public int iLD(String sRow, String sCol)
        {
            int RowLen = sRow.Length;  // length of sRow
            int ColLen = sCol.Length;  // length of sCol
            int RowIdx;                // iterates through sRow
            int ColIdx;                // iterates through sCol
            char Row_i;                // ith character of sRow
            char Col_j;                // jth character of sCol
            int cost;                   // cost

            /// Test string length
            if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
                throw (new Exception("/nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + "./nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));

            // Step 1

            if (RowLen == 0)
            {
                return ColLen;
            }

            if (ColLen == 0)
            {
                return RowLen;
            }

            /// Create the two vectors
            int[] v0 = new int[RowLen + 1];
            int[] v1 = new int[RowLen + 1];
            int[] vTmp;


            
            /// Step 2
            /// Initialize the first vector
            for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
            {
                v0[RowIdx] = RowIdx;
            }

            // Step 3

            /// Fore each column
            for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
            {
                /// Set the 0'th element to the column number
                v1[0] = ColIdx;

                Col_j = sCol[ColIdx - 1];


                // Step 4

                /// Fore each row
                for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
                {
                    Row_i = sRow[RowIdx - 1];


                    // Step 5

                    if (Row_i == Col_j)
                    {
                        cost = 0;
                    }
                    else
                    {
                        cost = 1;
                    }

                    // Step 6

                    /// Find minimum
                    int m_min = v0[RowIdx] + 1;
                    int b = v1[RowIdx - 1] + 1;
                    int c = v0[RowIdx - 1] + cost;

                    if (b < m_min)
                    {
                        m_min = b;
                    }
                    if (c < m_min)
                    {
                        m_min = c;
                    }

                    v1[RowIdx] = m_min;
                }

                /// Swap the vectors
                vTmp = v0;
                v0 = v1;
                v1 = vTmp;

            // Step 7

            /// Value between 0 - 100
            /// 0==perfect match 100==totaly different
            /// 
            /// The vectors where swaped one last time at the end of the last loop,
            /// that is why the result is now in v0 rather than in v1
            System.Console.WriteLine("iDist=" + v0[RowLen]);
            int max = System.Math.Max(RowLen, ColLen);
            return ((100 * v0[RowLen]) / max);
        }

From:http://hi.baidu.com/xining52113339/blog/item/8a23f1388ddfc523b9998f47.html

         http://hi.baidu.com/pecefull0513/blog/item/a746ca1a292b9c118618bfbd.html

        http://www.codeproject.com/KB/recipes/improvestringsimilarity.aspx

http://en.wikipedia.org/wiki/Levenshtein_distance

http://www.codeproject.com/KB/recipes/Levenshtein.aspx

这篇关于Levenshtein Distance 算法的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！

Levenshtein Distance 算法

The Algorithm

Steps

Example

Steps 1 and 2

Steps 3 to 6 When i = 1

Steps 3 to 6 When i = 2

Steps 3 to 6 When i = 3

Steps 3 to 6 When i = 4

Steps 3 to 6 When i = 5

相关文章

Java中的雪花算法Snowflake解析与实践技巧

使用雪花算法产生id导致前端精度缺失问题解决方案

Springboot实现推荐系统的协同过滤算法

openCV中KNN算法的实现

springboot+dubbo实现时间轮算法

C++中std::distance使用方法示例

SpringBoot实现MD5加盐算法的示例代码

Java时间轮调度算法的代码实现

如何通过Golang的container/list实现LRU缓存算法

golang字符串匹配算法解读