中文字符串模糊匹配算法|C# Levenshtein Distance

本文主要是介绍中文字符串模糊匹配算法|C# Levenshtein Distance,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

中文字符串模糊匹配算法|C# Levenshtein Distance

2010-01-06 09:08:09  

C# Levenshtein Distance
by Sam Allen - Updated November 27, 2009
You want to match approximate strings with fuzzy logic, using the Levenshtein distance algorithm. Many projects need this logic, including programs that manage prescription drugs, spell-checkers, suggestion searches and plagiarism detectors. Here we see a simple but complete implementation of this algorithm using the C# programming language.

Words:                ant, aunt
Levenshtein distance: 1
Note:                 Only 1 edit is needed.
                      The 'u' must be added at index 2.

Words:                Samantha, Sam
Levenshtein distance: 5
Note:                 The final 5 letters must be removed.

Words:                Flomax, Volmax
Levenshtein distance: 3
Note:                 The first 3 letters must be changed
                      Drug names are commonly confused.Levenshtein algorithm
First, credit goes to Vladimir Levenshtein, a Russian scientist. Here we see the C# code I adapted and optimized. It uses a two-dimensional array instead of a jagged array because the space required will only have one width and one height.

=== Program that implements the algorithm (C#) ===

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

=== Output from the program ===

1
5
3Description. The Levenshtein method is static. This Compute method doesn't need to store state or instance data, which means you can declare it as static. This can also improve performance, avoiding callvirt instructions. You can easily verify that the above implementation is the standard version of Levenshtein by looking at one of the textbooks you were supposed to read.

Performance notes. The code I show above was adapted by me from another source, and optimized so that it is three times faster. However, there are faster variants of Levenshtein algorithms for some scenarios. [Levenshtein distance - wikipedia.org]

Static classes. This algorithm is stateless, which means it doesn't store instance data and therefore can be put in a static class. Static classes are easier to add to new projects than separate methods.

Usage
Here we see how you can call the method in your C# programs. You will often want to compare multiple strings with the Levenshtein algorithm. The example here shows how you can compare strings in a loop. We use a List of string[] arrays.

=== Program that calls Levenshtein in loop (C#) ===

static void Main()
{
    List<string[]> l = new List<string[]>
    {
        new string[]{"ant", "aunt"},
        new string[]{"Sam", "Samantha"},
        new string[]{"clozapine", "olanzapine"},
        new string[]{"flomax", "volmax"},
        new string[]{"toradol", "tramadol"},
        new string[]{"kitten", "sitting"}
    };

    foreach (string[] a in l)
    {
        int cost = Compute(a[0], a[1]);
        Console.WriteLine("{0} -> {1} = {2}",
            a[0],
            a[1],
            cost);
    }
}

=== Output of the program ===

ant -> aunt = 1
Sam -> Samantha = 5
clozapine -> olanzapine = 3
flomax -> volmax = 3
toradol -> tramadol = 3
kitten -> sitting = 3More resources
Michael Gilleland has an excellent page about the Levenshtein distance and many implementations of it, and that resource is important if you need more detailed reference. [Levenshtein Distance - merriampark.com]

Performance mistake
I found the C# version linked from merriampark.com, but I adapted that code for some big performance improvements. I changed the first statement into the second statement. The before version makes a new string copy for each single character. The after version examines characters directly, with no copy strings made, taking 75% less time to run.

=== Slow version that uses Substring ===

// It makes new strings.
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);

=== Fast version that uses chars ===

// Doesn't make new strings with Substring.
cost = (t[j - 1] == s[i - 1]) ? 0 : 1;Summary
Here we saw the famous Levenshtein Distance algorithm, adapted and optimized for the C# programming language. The author places the code here in the public domain, and encourages you to test it and improve it. This means you are free to use it anywhere you want. Use this code to implement approximate string matching. The brilliance of the algorithm is from Dr. Levenshtein, not the author of this article. [Page protected by Copyscape; do not copy.]

这篇关于中文字符串模糊匹配算法|C# Levenshtein Distance的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/682241

相关文章

C#使用HttpClient进行Post请求出现超时问题的解决及优化

《C#使用HttpClient进行Post请求出现超时问题的解决及优化》最近我的控制台程序发现有时候总是出现请求超时等问题,通常好几分钟最多只有3-4个请求,在使用apipost发现并发10个5分钟也... 目录优化结论单例HttpClient连接池耗尽和并发并发异步最终优化后优化结论我直接上优化结论吧,

Java 字符数组转字符串的常用方法

《Java字符数组转字符串的常用方法》文章总结了在Java中将字符数组转换为字符串的几种常用方法,包括使用String构造函数、String.valueOf()方法、StringBuilder以及A... 目录1. 使用String构造函数1.1 基本转换方法1.2 注意事项2. 使用String.valu

C#使用yield关键字实现提升迭代性能与效率

《C#使用yield关键字实现提升迭代性能与效率》yield关键字在C#中简化了数据迭代的方式,实现了按需生成数据,自动维护迭代状态,本文主要来聊聊如何使用yield关键字实现提升迭代性能与效率,感兴... 目录前言传统迭代和yield迭代方式对比yield延迟加载按需获取数据yield break显式示迭

c# checked和unchecked关键字的使用

《c#checked和unchecked关键字的使用》C#中的checked关键字用于启用整数运算的溢出检查,可以捕获并抛出System.OverflowException异常,而unchecked... 目录在 C# 中,checked 关键字用于启用整数运算的溢出检查。默认情况下,C# 的整数运算不会自

C#实现获得某个枚举的所有名称

《C#实现获得某个枚举的所有名称》这篇文章主要为大家详细介绍了C#如何实现获得某个枚举的所有名称,文中的示例代码讲解详细,具有一定的借鉴价值,有需要的小伙伴可以参考一下... C#中获得某个枚举的所有名称using System;using System.Collections.Generic;usi

Go语言实现将中文转化为拼音功能

《Go语言实现将中文转化为拼音功能》这篇文章主要为大家详细介绍了Go语言中如何实现将中文转化为拼音功能,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 有这么一个需求:新用户入职 创建一系列账号比较麻烦,打算通过接口传入姓名进行初始化。想把姓名转化成拼音。因为有些账号即需要中文也需要英

C# 读写ini文件操作实现

《C#读写ini文件操作实现》本文主要介绍了C#读写ini文件操作实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧... 目录一、INI文件结构二、读取INI文件中的数据在C#应用程序中,常将INI文件作为配置文件,用于存储应用程序的

C#实现获取电脑中的端口号和硬件信息

《C#实现获取电脑中的端口号和硬件信息》这篇文章主要为大家详细介绍了C#实现获取电脑中的端口号和硬件信息的相关方法,文中的示例代码讲解详细,有需要的小伙伴可以参考一下... 我们经常在使用一个串口软件的时候,发现软件中的端口号并不是普通的COM1,而是带有硬件信息的。那么如果我们使用C#编写软件时候,如

C#中图片如何自适应pictureBox大小

《C#中图片如何自适应pictureBox大小》文章描述了如何在C#中实现图片自适应pictureBox大小,并展示修改前后的效果,修改步骤包括两步,作者分享了个人经验,希望对大家有所帮助... 目录C#图片自适应pictureBox大小编程修改步骤总结C#图片自适应pictureBox大小上图中“z轴

使用C#代码计算数学表达式实例

《使用C#代码计算数学表达式实例》这段文字主要讲述了如何使用C#语言来计算数学表达式,该程序通过使用Dictionary保存变量,定义了运算符优先级,并实现了EvaluateExpression方法来... 目录C#代码计算数学表达式该方法很长,因此我将分段描述下面的代码片段显示了下一步以下代码显示该方法如