Levenshtein Distance 算法

2024-03-15 11:58
文章标签 算法 distance levenshtein

本文主要是介绍Levenshtein Distance 算法,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

编辑距离就是用来计算从原串(s)转换到目标串(t)所需要的最少的插入,删除和替换的数目,在NLP中应用比较广泛,如一些评测方法中就用到了(wer,mWer等),同时也常用来计算你对原文本所作的改动数。
编辑距离的算法是首先由俄国科学家Levenshtein提出的,故又叫Levenshtein Distance。
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,

  • If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
  • If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.

 

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

The Levenshtein distance algorithm has been used in:

  • Spell checking
  • Speech recognition
  • DNA analysis
  • Plagiarism detection

The Algorithm

Steps

StepDescription
1Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2Initialize the first row to 0..n.
Initialize the first column to 0..m.
3Examine each character of s (i from 1 to n).
4Examine each character of t (j from 1 to m).
5If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1.
6Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].

Example

This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".

Steps 1 and 2

  GUMBO
 012345
G1     
A2     
M3     
B4     
O5     
L6     

Steps 3 to 6 When i = 1

  GUMBO
 012345
G10    
A21    
M32    
B43    
O54    
L65    

Steps 3 to 6 When i = 2

  GUMBO
 012345
G101   
A211   
M322   
B433   
O544   
L655   

Steps 3 to 6 When i = 3

  GUMBO
 012345
G1012  
A2112  
M3221  
B4332  
O5443  
L6554  

Steps 3 to 6 When i = 4

  GUMBO
 012345
G10123 
A21123 
M32212 
B43321 
O54432 
L65543 

Steps 3 to 6 When i = 5

  GUMBO
 012345
G101234
A211234
M322123
B433212
O544321
L655432

 

算法示例1:

private int ComputeDistance (string s, string t)
{
int n=s.Length;
int m=t.Length;
int[,] distance=new int[n + 1, m + 1]; // matrix
    int cost=0;
if(n == 0) return m;
if(m == 0) return n;
//init1
    for(int i=0; i <= n; distance[i, 0]=i++);
for(int j=0; j <= m; distance[0, j]=j++);
//find min distance
    for(int i=1; i <= n; i++)
{
for(int j=1; j <= m;j++)
{
cost=(t.Substring(j - 1, 1) == 
s.Substring(i - 1, 1) ? 0 : 1);
distance[i,j]=Min3(distance[i - 1, j] + 1,
distance[i, j - 1] + 1,
distance[i - 1, j - 1] + cost);
}
}
return distance[n, m];
}
算法示例2:
        private int Levenshtein(string str1, string str2)
        {
          int n = str1.Length;
          int m = str2.Length;
          int i;    //遍历str1的
            int j;    //遍历str2的
            char ch1;    //str1的
            char ch2;    //str2的
            int temp;    //记录相同字符,在某个矩阵位置值的增量,不是0就是1
           
            if(n == 0)
            {
                return m;
            }
            if(m == 0)
            {
                return n;
            }
            int[,] d = new int[n+1,m+1];
            for(i=0; i<=n; i++) 
            {    //初始化第一列
                d[i,0] = i;
            }
            for(j=0; j<=m; j++)
            {    //初始化第一行
                d[0,j] = j;
            }
            for(i=1; i<=n; i++) 
            {    //遍历str1
                ch1 = str1[i-1];
                //去匹配str2
                for(j=1; j<=m; j++)
                {
                    ch2 = str2[j-1];
                    if(ch1 == ch2)
                    {
                        temp = 0;
                    } else
                    {
                        temp = 1;
                    }
                    //左边+1,上边+1, 左上角+temp取最小
                    d[i, j] = Min(d[i - 1, j] + 1, d[i, j - 1] + 1, d[i - 1, j - 1] + temp);
                }
            }
            return d[n,m];
        }
        private int Min(int one, int two, int three)
        {
            int min = one;
            if (two < min)
            {
                min = two;
            }
            if (three < min)
            {
                min = three;
            }
            return min;
        }
        private double Sim(String str1, String str2)
        {
            int ld = Levenshtein(str1, str2);
            return 1 - (double)ld / Math.Max(str1.Length, str2.Length);
        }

算法示例3:空間復雜度從O(n*m)降到O(2m)

///*****************************
        /// Compute Levenshtein distance
        /// Memory efficient version
        ///*****************************
        public int iLD(String sRow, String sCol)
        {
            int RowLen = sRow.Length;  // length of sRow
            int ColLen = sCol.Length;  // length of sCol
            int RowIdx;                // iterates through sRow
            int ColIdx;                // iterates through sCol
            char Row_i;                // ith character of sRow
            char Col_j;                // jth character of sCol
            int cost;                   // cost
            /// Test string length
            if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
                throw (new Exception("/nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + "./nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));
            // Step 1
            if (RowLen == 0)
            {
                return ColLen;
            }
            if (ColLen == 0)
            {
                return RowLen;
            }
            /// Create the two vectors
            int[] v0 = new int[RowLen + 1];
            int[] v1 = new int[RowLen + 1];
            int[] vTmp;

           
            /// Step 2
            /// Initialize the first vector
            for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
            {
                v0[RowIdx] = RowIdx;
            }
            // Step 3
            /// Fore each column
            for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
            {
                /// Set the 0'th element to the column number
                v1[0] = ColIdx;
                Col_j = sCol[ColIdx - 1];

                // Step 4
                /// Fore each row
                for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
                {
                    Row_i = sRow[RowIdx - 1];

                    // Step 5
                    if (Row_i == Col_j)
                    {
                        cost = 0;
                    }
                    else
                    {
                        cost = 1;
                    }
                    // Step 6
                    /// Find minimum
                    int m_min = v0[RowIdx] + 1;
                    int b = v1[RowIdx - 1] + 1;
                    int c = v0[RowIdx - 1] + cost;
                    if (b < m_min)
                    {
                        m_min = b;
                    }
                    if (c < m_min)
                    {
                        m_min = c;
                    }
                    v1[RowIdx] = m_min;
                }
                /// Swap the vectors
                vTmp = v0;
                v0 = v1;
                v1 = vTmp;
            }
               
            // Step 7
            /// Value between 0 - 100
            /// 0==perfect match 100==totaly different
            ///
            /// The vectors where swaped one last time at the end of the last loop,
            /// that is why the result is now in v0 rather than in v1
            System.Console.WriteLine("iDist=" + v0[RowLen]);
            int max = System.Math.Max(RowLen, ColLen);
            return ((100 * v0[RowLen]) / max);
        }
From:http://hi.baidu.com/xining52113339/blog/item/8a23f1388ddfc523b9998f47.html
         http://hi.baidu.com/pecefull0513/blog/item/a746ca1a292b9c118618bfbd.html
        http://www.codeproject.com/KB/recipes/improvestringsimilarity.aspx

   http://en.wikipedia.org/wiki/Levenshtein_distance

  http://www.codeproject.com/KB/recipes/Levenshtein.aspx

这篇关于Levenshtein Distance 算法的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/811899

相关文章

SpringBoot实现MD5加盐算法的示例代码

《SpringBoot实现MD5加盐算法的示例代码》加盐算法是一种用于增强密码安全性的技术,本文主要介绍了SpringBoot实现MD5加盐算法的示例代码,文中通过示例代码介绍的非常详细,对大家的学习... 目录一、什么是加盐算法二、如何实现加盐算法2.1 加盐算法代码实现2.2 注册页面中进行密码加盐2.

Java时间轮调度算法的代码实现

《Java时间轮调度算法的代码实现》时间轮是一种高效的定时调度算法,主要用于管理延时任务或周期性任务,它通过一个环形数组(时间轮)和指针来实现,将大量定时任务分摊到固定的时间槽中,极大地降低了时间复杂... 目录1、简述2、时间轮的原理3. 时间轮的实现步骤3.1 定义时间槽3.2 定义时间轮3.3 使用时

如何通过Golang的container/list实现LRU缓存算法

《如何通过Golang的container/list实现LRU缓存算法》文章介绍了Go语言中container/list包实现的双向链表,并探讨了如何使用链表实现LRU缓存,LRU缓存通过维护一个双向... 目录力扣:146. LRU 缓存主要结构 List 和 Element常用方法1. 初始化链表2.

golang字符串匹配算法解读

《golang字符串匹配算法解读》文章介绍了字符串匹配算法的原理,特别是Knuth-Morris-Pratt(KMP)算法,该算法通过构建模式串的前缀表来减少匹配时的不必要的字符比较,从而提高效率,在... 目录简介KMP实现代码总结简介字符串匹配算法主要用于在一个较长的文本串中查找一个较短的字符串(称为

通俗易懂的Java常见限流算法具体实现

《通俗易懂的Java常见限流算法具体实现》:本文主要介绍Java常见限流算法具体实现的相关资料,包括漏桶算法、令牌桶算法、Nginx限流和Redis+Lua限流的实现原理和具体步骤,并比较了它们的... 目录一、漏桶算法1.漏桶算法的思想和原理2.具体实现二、令牌桶算法1.令牌桶算法流程:2.具体实现2.1

Python中的随机森林算法与实战

《Python中的随机森林算法与实战》本文详细介绍了随机森林算法,包括其原理、实现步骤、分类和回归案例,并讨论了其优点和缺点,通过面向对象编程实现了一个简单的随机森林模型,并应用于鸢尾花分类和波士顿房... 目录1、随机森林算法概述2、随机森林的原理3、实现步骤4、分类案例:使用随机森林预测鸢尾花品种4.1

不懂推荐算法也能设计推荐系统

本文以商业化应用推荐为例,告诉我们不懂推荐算法的产品,也能从产品侧出发, 设计出一款不错的推荐系统。 相信很多新手产品,看到算法二字,多是懵圈的。 什么排序算法、最短路径等都是相对传统的算法(注:传统是指科班出身的产品都会接触过)。但对于推荐算法,多数产品对着网上搜到的资源,都会无从下手。特别当某些推荐算法 和 “AI”扯上关系后,更是加大了理解的难度。 但,不了解推荐算法,就无法做推荐系

康拓展开(hash算法中会用到)

康拓展开是一个全排列到一个自然数的双射(也就是某个全排列与某个自然数一一对应) 公式: X=a[n]*(n-1)!+a[n-1]*(n-2)!+...+a[i]*(i-1)!+...+a[1]*0! 其中,a[i]为整数,并且0<=a[i]<i,1<=i<=n。(a[i]在不同应用中的含义不同); 典型应用: 计算当前排列在所有由小到大全排列中的顺序,也就是说求当前排列是第

csu 1446 Problem J Modified LCS (扩展欧几里得算法的简单应用)

这是一道扩展欧几里得算法的简单应用题,这题是在湖南多校训练赛中队友ac的一道题,在比赛之后请教了队友,然后自己把它a掉 这也是自己独自做扩展欧几里得算法的题目 题意:把题意转变下就变成了:求d1*x - d2*y = f2 - f1的解,很明显用exgcd来解 下面介绍一下exgcd的一些知识点:求ax + by = c的解 一、首先求ax + by = gcd(a,b)的解 这个

综合安防管理平台LntonAIServer视频监控汇聚抖动检测算法优势

LntonAIServer视频质量诊断功能中的抖动检测是一个专门针对视频稳定性进行分析的功能。抖动通常是指视频帧之间的不必要运动,这种运动可能是由于摄像机的移动、传输中的错误或编解码问题导致的。抖动检测对于确保视频内容的平滑性和观看体验至关重要。 优势 1. 提高图像质量 - 清晰度提升:减少抖动,提高图像的清晰度和细节表现力,使得监控画面更加真实可信。 - 细节增强:在低光条件下,抖