C#通过HtmlAgilityPack轻松解析HTML

2023-12-17 10:58

本文主要是介绍C#通过HtmlAgilityPack轻松解析HTML,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

(时间紧张,任务繁重,未完待续……有时间我就回来继续翻译)
这是一篇译文,你没看错,走过路过不要错过,一个英语渣的心塞翻译。

原文在这里Easily Parse HTML Documents in C#


  • C#轻松解析HTML通过HtmlAgilityPack
  • 原文
    • Easily Parse HTML Documents in C#.
      • HtmlAgilityPack
      • Basic Parsing
      • Using XPath


C#轻松解析HTML通过HtmlAgilityPack

你将要创建一个C#应用且需要解析HTML网页。你可能使用正则表达式,但是看起来使用DOM方法更有效率。你是否体会过XPath的优点?

.Net内包含一个HtmlDocument类,


原文

为了防止原文丢失

Easily Parse HTML Documents in C#.

So, you are building a C# application and need to parse a web page’s HTML. You could use regular expressions, but it seems more efficient to use a DOM-based approach. What if you could even take advantage of the power of XPath?

.Net contains an HtmlDocument class, along with HtmlElement, in System.Windows.Forms, which could seem pretty interesting. It does provide basic DOM methods like GetElementById and GetElementsByTagName. However, if you try to create an HtmlDocument object, you will soon notice that it has no public constructor. It is actually a wrapper around an unmanaged class and the only way you can get an instance is through the WebBrowser control. Quite slow and annoying… So, what are the other solutions?

XmlDocument and XmlNode are an interesting solution if you have correctly formatted XML or XHTML. If you are to retrieve content from the web, then you should will need another library that will check the markup and correct it if needed. You may want to try something like Tidy or SGMLReader. Then you can create an XmlDocument and access quite interesting methods to parse and manipulate the nodes.

HtmlAgilityPack

Another solution that I actually now use every time I need to parse HTML is the free and open source HtmlAgilityPack library. It provides HtmlDocument and HtmlNode classes, which are quite similar to .NET’s XmlDocument and XmlNode classes. You can load the HTML either from a file, an URL or a string. There is no need to check the markup validity first as HtmlAgilityPack will take care of making everything valid by closing unclosed tags and fixing other markup errors. Once the document is loaded, you can start having fun parsing through the nodes!

Basic Parsing

The HtmlDocument object provides a getElementById method that let you target a specific node using its Id. You can use properties such as ChildNodes, FirstChild, NextSibling and ParentNode to navigate through the nodes. You can also use the Ancestors and Descendants methods to respectively get a list of all the ancestors or descendants of a node. Optionally, a node name can be given to retrieve only one type of nodes. Use the Attributes property to access a node’s attributes.

Here is a simple example that retrieves a web page and lists all the external links within a given node specified by its Id:

// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.somewebsite.com");// Targets a specific node
HtmlNode someNode = document.GetElementbyId("mynode");// If there is no node with that Id, someNode will be null
if (someNode != null)
{// Extracts all links within that nodeIEnumerable<HtmlNode> allLinks = someNode.Descendants("a");// Outputs the href for external linksforeach (HtmlNode link in allLinks){// Checks whether the link contains an HREF attributeif (link.Attributes.Contains("href")){// Simple check: if the href begins with "http://", prints it outif (link.Attributes["href"].Value.StartsWith("http://"))Console.WriteLine(link.Attributes["href"].Value);}}
}

Using XPath

As I mentioned above, HtmlAgilityPack supports XPath. If you don’t know XPath, I really suggest you take some time to learn it. It is quite simple, yet powerful. The HtmlNode class provides two methods to retrieve nodes matching an XPath expression: SelectSingleNode and SelectNodes. The first returns only one node (the first one matching) and the latter returns all matching nodes.

Here is almost the same example as above, but using XPath instead. Load the HtmlDocument object the same way and then:

// Targets a specific node
HtmlNode someNode = document.DocumentNode.SelectSingleNode("//*[@id='mynode']");// If there is no node with that Id, someNode will be null
if (someNode != null)
{// Extracts all links within that node// Note the leading dot (.) to make it look relative to the current node instead of the whole documentHtmlNodeCollection allLinks = someNode.SelectNodes(".//a");

The remaining is the same.

But that code is not any shorter or simpler than the previous one! It might even actually seem more complicated with that XPath syntax. That’s right, but here comes the power of XPath. Both expressions could be combined into only one that would do everything at once. And here is the new code after the HtmlDocument object loading as above:

// Extracts all links under a specific node that have an href that begins with "http://"
HtmlNodeCollection allLinks = document.DocumentNode.SelectNodes("//*[@id='mynode']//a[starts-with(@href,'http://')]");// Outputs the href for external links
foreach (HtmlNode link in allLinks)Console.WriteLine(link.Attributes["href"].Value);

Simple enough? Only the XPath part might be a bit hard to understand if you are new to it, but you will get used and eventually read it easily. This example is quite simple, but there is a lot more you can do using XPath to parse through nodes.

I hope this short introduction to HtmlAgilityPack will help you getting started using this really nice library and help you with your projects!

这篇关于C#通过HtmlAgilityPack轻松解析HTML的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/504124

相关文章

Vue和React受控组件的区别小结

《Vue和React受控组件的区别小结》本文主要介绍了Vue和React受控组件的区别小结,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学... 目录背景React 的实现vue3 的实现写法一:直接修改事件参数写法二:通过ref引用 DOMVu

Java实现将HTML文件与字符串转换为图片

《Java实现将HTML文件与字符串转换为图片》在Java开发中,我们经常会遇到将HTML内容转换为图片的需求,本文小编就来和大家详细讲讲如何使用FreeSpire.DocforJava库来实现这一功... 目录前言核心实现:html 转图片完整代码场景 1:转换本地 HTML 文件为图片场景 2:转换 H

C#使用Spire.Doc for .NET实现HTML转Word的高效方案

《C#使用Spire.Docfor.NET实现HTML转Word的高效方案》在Web开发中,HTML内容的生成与处理是高频需求,然而,当用户需要将HTML页面或动态生成的HTML字符串转换为Wor... 目录引言一、html转Word的典型场景与挑战二、用 Spire.Doc 实现 HTML 转 Word1

C#实现一键批量合并PDF文档

《C#实现一键批量合并PDF文档》这篇文章主要为大家详细介绍了如何使用C#实现一键批量合并PDF文档功能,文中的示例代码简洁易懂,感兴趣的小伙伴可以跟随小编一起学习一下... 目录前言效果展示功能实现1、添加文件2、文件分组(书签)3、定义页码范围4、自定义显示5、定义页面尺寸6、PDF批量合并7、其他方法

Vue3绑定props默认值问题

《Vue3绑定props默认值问题》使用Vue3的defineProps配合TypeScript的interface定义props类型,并通过withDefaults设置默认值,使组件能安全访问传入的... 目录前言步骤步骤1:使用 defineProps 定义 Props步骤2:设置默认值总结前言使用T

深度解析Python中递归下降解析器的原理与实现

《深度解析Python中递归下降解析器的原理与实现》在编译器设计、配置文件处理和数据转换领域,递归下降解析器是最常用且最直观的解析技术,本文将详细介绍递归下降解析器的原理与实现,感兴趣的小伙伴可以跟随... 目录引言:解析器的核心价值一、递归下降解析器基础1.1 核心概念解析1.2 基本架构二、简单算术表达

深度解析Java @Serial 注解及常见错误案例

《深度解析Java@Serial注解及常见错误案例》Java14引入@Serial注解,用于编译时校验序列化成员,替代传统方式解决运行时错误,适用于Serializable类的方法/字段,需注意签... 目录Java @Serial 注解深度解析1. 注解本质2. 核心作用(1) 主要用途(2) 适用位置3

C#下Newtonsoft.Json的具体使用

《C#下Newtonsoft.Json的具体使用》Newtonsoft.Json是一个非常流行的C#JSON序列化和反序列化库,它可以方便地将C#对象转换为JSON格式,或者将JSON数据解析为C#对... 目录安装 Newtonsoft.json基本用法1. 序列化 C# 对象为 JSON2. 反序列化

Java MCP 的鉴权深度解析

《JavaMCP的鉴权深度解析》文章介绍JavaMCP鉴权的实现方式,指出客户端可通过queryString、header或env传递鉴权信息,服务器端支持工具单独鉴权、过滤器集中鉴权及启动时鉴权... 目录一、MCP Client 侧(负责传递,比较简单)(1)常见的 mcpServers json 配置

C#文件复制异常:"未能找到文件"的解决方案与预防措施

《C#文件复制异常:未能找到文件的解决方案与预防措施》在C#开发中,文件操作是基础中的基础,但有时最基础的File.Copy()方法也会抛出令人困惑的异常,当targetFilePath设置为D:2... 目录一个看似简单的文件操作问题问题重现与错误分析错误代码示例错误信息根本原因分析全面解决方案1. 确保