C#通过HtmlAgilityPack轻松解析HTML

2023-12-17 10:58

本文主要是介绍C#通过HtmlAgilityPack轻松解析HTML,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

(时间紧张,任务繁重,未完待续……有时间我就回来继续翻译)
这是一篇译文,你没看错,走过路过不要错过,一个英语渣的心塞翻译。

原文在这里Easily Parse HTML Documents in C#


  • C#轻松解析HTML通过HtmlAgilityPack
  • 原文
    • Easily Parse HTML Documents in C#.
      • HtmlAgilityPack
      • Basic Parsing
      • Using XPath


C#轻松解析HTML通过HtmlAgilityPack

你将要创建一个C#应用且需要解析HTML网页。你可能使用正则表达式,但是看起来使用DOM方法更有效率。你是否体会过XPath的优点?

.Net内包含一个HtmlDocument类,


原文

为了防止原文丢失

Easily Parse HTML Documents in C#.

So, you are building a C# application and need to parse a web page’s HTML. You could use regular expressions, but it seems more efficient to use a DOM-based approach. What if you could even take advantage of the power of XPath?

.Net contains an HtmlDocument class, along with HtmlElement, in System.Windows.Forms, which could seem pretty interesting. It does provide basic DOM methods like GetElementById and GetElementsByTagName. However, if you try to create an HtmlDocument object, you will soon notice that it has no public constructor. It is actually a wrapper around an unmanaged class and the only way you can get an instance is through the WebBrowser control. Quite slow and annoying… So, what are the other solutions?

XmlDocument and XmlNode are an interesting solution if you have correctly formatted XML or XHTML. If you are to retrieve content from the web, then you should will need another library that will check the markup and correct it if needed. You may want to try something like Tidy or SGMLReader. Then you can create an XmlDocument and access quite interesting methods to parse and manipulate the nodes.

HtmlAgilityPack

Another solution that I actually now use every time I need to parse HTML is the free and open source HtmlAgilityPack library. It provides HtmlDocument and HtmlNode classes, which are quite similar to .NET’s XmlDocument and XmlNode classes. You can load the HTML either from a file, an URL or a string. There is no need to check the markup validity first as HtmlAgilityPack will take care of making everything valid by closing unclosed tags and fixing other markup errors. Once the document is loaded, you can start having fun parsing through the nodes!

Basic Parsing

The HtmlDocument object provides a getElementById method that let you target a specific node using its Id. You can use properties such as ChildNodes, FirstChild, NextSibling and ParentNode to navigate through the nodes. You can also use the Ancestors and Descendants methods to respectively get a list of all the ancestors or descendants of a node. Optionally, a node name can be given to retrieve only one type of nodes. Use the Attributes property to access a node’s attributes.

Here is a simple example that retrieves a web page and lists all the external links within a given node specified by its Id:

// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.somewebsite.com");// Targets a specific node
HtmlNode someNode = document.GetElementbyId("mynode");// If there is no node with that Id, someNode will be null
if (someNode != null)
{// Extracts all links within that nodeIEnumerable<HtmlNode> allLinks = someNode.Descendants("a");// Outputs the href for external linksforeach (HtmlNode link in allLinks){// Checks whether the link contains an HREF attributeif (link.Attributes.Contains("href")){// Simple check: if the href begins with "http://", prints it outif (link.Attributes["href"].Value.StartsWith("http://"))Console.WriteLine(link.Attributes["href"].Value);}}
}

Using XPath

As I mentioned above, HtmlAgilityPack supports XPath. If you don’t know XPath, I really suggest you take some time to learn it. It is quite simple, yet powerful. The HtmlNode class provides two methods to retrieve nodes matching an XPath expression: SelectSingleNode and SelectNodes. The first returns only one node (the first one matching) and the latter returns all matching nodes.

Here is almost the same example as above, but using XPath instead. Load the HtmlDocument object the same way and then:

// Targets a specific node
HtmlNode someNode = document.DocumentNode.SelectSingleNode("//*[@id='mynode']");// If there is no node with that Id, someNode will be null
if (someNode != null)
{// Extracts all links within that node// Note the leading dot (.) to make it look relative to the current node instead of the whole documentHtmlNodeCollection allLinks = someNode.SelectNodes(".//a");

The remaining is the same.

But that code is not any shorter or simpler than the previous one! It might even actually seem more complicated with that XPath syntax. That’s right, but here comes the power of XPath. Both expressions could be combined into only one that would do everything at once. And here is the new code after the HtmlDocument object loading as above:

// Extracts all links under a specific node that have an href that begins with "http://"
HtmlNodeCollection allLinks = document.DocumentNode.SelectNodes("//*[@id='mynode']//a[starts-with(@href,'http://')]");// Outputs the href for external links
foreach (HtmlNode link in allLinks)Console.WriteLine(link.Attributes["href"].Value);

Simple enough? Only the XPath part might be a bit hard to understand if you are new to it, but you will get used and eventually read it easily. This example is quite simple, but there is a lot more you can do using XPath to parse through nodes.

I hope this short introduction to HtmlAgilityPack will help you getting started using this really nice library and help you with your projects!

这篇关于C#通过HtmlAgilityPack轻松解析HTML的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/504124

相关文章

nginx -t、nginx -s stop 和 nginx -s reload 命令的详细解析(结合应用场景)

《nginx-t、nginx-sstop和nginx-sreload命令的详细解析(结合应用场景)》本文解析Nginx的-t、-sstop、-sreload命令,分别用于配置语法检... 以下是关于 nginx -t、nginx -s stop 和 nginx -s reload 命令的详细解析,结合实际应

C#连接SQL server数据库命令的基本步骤

《C#连接SQLserver数据库命令的基本步骤》文章讲解了连接SQLServer数据库的步骤,包括引入命名空间、构建连接字符串、使用SqlConnection和SqlCommand执行SQL操作,... 目录建议配合使用:如何下载和安装SQL server数据库-CSDN博客1. 引入必要的命名空间2.

MyBatis中$与#的区别解析

《MyBatis中$与#的区别解析》文章浏览阅读314次,点赞4次,收藏6次。MyBatis使用#{}作为参数占位符时,会创建预处理语句(PreparedStatement),并将参数值作为预处理语句... 目录一、介绍二、sql注入风险实例一、介绍#(井号):MyBATis使用#{}作为参数占位符时,会

C#读写文本文件的多种方式详解

《C#读写文本文件的多种方式详解》这篇文章主要为大家详细介绍了C#中各种常用的文件读写方式,包括文本文件,二进制文件、CSV文件、JSON文件等,有需要的小伙伴可以参考一下... 目录一、文本文件读写1. 使用 File 类的静态方法2. 使用 StreamReader 和 StreamWriter二、二进

C#中Guid类使用小结

《C#中Guid类使用小结》本文主要介绍了C#中Guid类用于生成和操作128位的唯一标识符,用于数据库主键及分布式系统,支持通过NewGuid、Parse等方法生成,感兴趣的可以了解一下... 目录前言一、什么是 Guid二、生成 Guid1. 使用 Guid.NewGuid() 方法2. 从字符串创建

PostgreSQL的扩展dict_int应用案例解析

《PostgreSQL的扩展dict_int应用案例解析》dict_int扩展为PostgreSQL提供了专业的整数文本处理能力,特别适合需要精确处理数字内容的搜索场景,本文给大家介绍PostgreS... 目录PostgreSQL的扩展dict_int一、扩展概述二、核心功能三、安装与启用四、字典配置方法

C# 比较两个list 之间元素差异的常用方法

《C#比较两个list之间元素差异的常用方法》:本文主要介绍C#比较两个list之间元素差异,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考下吧... 目录1. 使用Except方法2. 使用Except的逆操作3. 使用LINQ的Join,GroupJoin

深度解析Java DTO(最新推荐)

《深度解析JavaDTO(最新推荐)》DTO(DataTransferObject)是一种用于在不同层(如Controller层、Service层)之间传输数据的对象设计模式,其核心目的是封装数据,... 目录一、什么是DTO?DTO的核心特点:二、为什么需要DTO?(对比Entity)三、实际应用场景解析

深度解析Java项目中包和包之间的联系

《深度解析Java项目中包和包之间的联系》文章浏览阅读850次,点赞13次,收藏8次。本文详细介绍了Java分层架构中的几个关键包:DTO、Controller、Service和Mapper。_jav... 目录前言一、各大包1.DTO1.1、DTO的核心用途1.2. DTO与实体类(Entity)的区别1

Java中的雪花算法Snowflake解析与实践技巧

《Java中的雪花算法Snowflake解析与实践技巧》本文解析了雪花算法的原理、Java实现及生产实践,涵盖ID结构、位运算技巧、时钟回拨处理、WorkerId分配等关键点,并探讨了百度UidGen... 目录一、雪花算法核心原理1.1 算法起源1.2 ID结构详解1.3 核心特性二、Java实现解析2.