使用 XPATH 和 HTML Cleaner 解析 HTML/XML

2024-01-08 20:48

本文主要是介绍使用 XPATH 和 HTML Cleaner 解析 HTML/XML,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

太阳火神的美丽人生 (http://blog.csdn.net/opengl_es)

本文遵循“署名-非商业用途-保持一致”创作公用协议

转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS、Android、Html5、Arduino、pcDuino否则,出自本博客的文章拒绝转载或再转载,谢谢合作。



使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

JANUARY 5, 2010
tags: android, examples, HTML, parse, scraping, XML, XPATH

大家好
Hey everyone,

有时我发现有一种能力十分有用,尤其在 Web 相关的应用中,那就是从 web 站点获取 HTML 并且从 HTML 解析数据,或是任何你要想得到的内容(对于我的情况大多总是数据)。
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).


I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
public class OptionScraper {
     // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER
     private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2" ;
     private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;
     private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span" ;
     // TAGNODE OBJECT, ITS USE WILL COME IN LATER
     private static TagNode node;
     // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)
     public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {
         // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE
         String option_url = "http://finance.yahoo.com/q?s=" + name.toUpperCase();
         // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE
         HtmlCleaner cleaner = new HtmlCleaner();
         CleanerProperties props = cleaner.getProperties();
         props.setAllowHtmlInsideAttributes( true );
         props.setAllowMultiWordAttributes( true );
         props.setRecognizeUnicodeChars( true );
         props.setOmitComments( true );
         // OPEN A CONNECTION TO THE DESIRED URL
         URL url = new URL(option_url);
         URLConnection conn = url.openConnection();
         //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT
         node = cleaner.clean( new InputStreamReader(conn.getInputStream()));
         // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)
         Object[] info_nodes = node.evaluateXPath(NAME_XPATH);
         Object[] time_nodes = node.evaluateXPath(TIME_XPATH);
         Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);
         // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED
         if (info_nodes.length > 0 ) {
             // CASTED TO A TAGNODE
             TagNode info_node = (TagNode) info_nodes[ 0 ];
             // HOW TO RETRIEVE THE CONTENTS AS A STRING
             String info = info_node.getChildren().iterator().next().toString().trim();
             // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)
             processInfoNode(o, info);
         }
         if (time_nodes.length > 0 ) {
             TagNode time_node = (TagNode) time_nodes[ 0 ];
             String date = time_node.getChildren().iterator().next().toString().trim();
             // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE
             processDateNode(o, date);
         }
         if (price_nodes.length > 0 ) {
             TagNode price_node = (TagNode) price_nodes[ 0 ];
             double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());
             o.setPremium(price);
         }
         return o;
     }
}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei



这篇关于使用 XPATH 和 HTML Cleaner 解析 HTML/XML的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/584772

相关文章

Vue和React受控组件的区别小结

《Vue和React受控组件的区别小结》本文主要介绍了Vue和React受控组件的区别小结,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学... 目录背景React 的实现vue3 的实现写法一:直接修改事件参数写法二:通过ref引用 DOMVu

Java使用Javassist动态生成HelloWorld类

《Java使用Javassist动态生成HelloWorld类》Javassist是一个非常强大的字节码操作和定义库,它允许开发者在运行时创建新的类或者修改现有的类,本文将简单介绍如何使用Javass... 目录1. Javassist简介2. 环境准备3. 动态生成HelloWorld类3.1 创建CtC

使用Python批量将.ncm格式的音频文件转换为.mp3格式的实战详解

《使用Python批量将.ncm格式的音频文件转换为.mp3格式的实战详解》本文详细介绍了如何使用Python通过ncmdump工具批量将.ncm音频转换为.mp3的步骤,包括安装、配置ffmpeg环... 目录1. 前言2. 安装 ncmdump3. 实现 .ncm 转 .mp34. 执行过程5. 执行结

Java实现将HTML文件与字符串转换为图片

《Java实现将HTML文件与字符串转换为图片》在Java开发中,我们经常会遇到将HTML内容转换为图片的需求,本文小编就来和大家详细讲讲如何使用FreeSpire.DocforJava库来实现这一功... 目录前言核心实现:html 转图片完整代码场景 1:转换本地 HTML 文件为图片场景 2:转换 H

Java使用jar命令配置服务器端口的完整指南

《Java使用jar命令配置服务器端口的完整指南》本文将详细介绍如何使用java-jar命令启动应用,并重点讲解如何配置服务器端口,同时提供一个实用的Web工具来简化这一过程,希望对大家有所帮助... 目录1. Java Jar文件简介1.1 什么是Jar文件1.2 创建可执行Jar文件2. 使用java

C#使用Spire.Doc for .NET实现HTML转Word的高效方案

《C#使用Spire.Docfor.NET实现HTML转Word的高效方案》在Web开发中,HTML内容的生成与处理是高频需求,然而,当用户需要将HTML页面或动态生成的HTML字符串转换为Wor... 目录引言一、html转Word的典型场景与挑战二、用 Spire.Doc 实现 HTML 转 Word1

Vue3绑定props默认值问题

《Vue3绑定props默认值问题》使用Vue3的defineProps配合TypeScript的interface定义props类型,并通过withDefaults设置默认值,使组件能安全访问传入的... 目录前言步骤步骤1:使用 defineProps 定义 Props步骤2:设置默认值总结前言使用T

Java中的抽象类与abstract 关键字使用详解

《Java中的抽象类与abstract关键字使用详解》:本文主要介绍Java中的抽象类与abstract关键字使用详解,本文通过实例代码给大家介绍的非常详细,感兴趣的朋友跟随小编一起看看吧... 目录一、抽象类的概念二、使用 abstract2.1 修饰类 => 抽象类2.2 修饰方法 => 抽象方法,没有

MyBatis ParameterHandler的具体使用

《MyBatisParameterHandler的具体使用》本文主要介绍了MyBatisParameterHandler的具体使用,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参... 目录一、概述二、源码1 关键属性2.setParameters3.TypeHandler1.TypeHa

Spring 中的切面与事务结合使用完整示例

《Spring中的切面与事务结合使用完整示例》本文给大家介绍Spring中的切面与事务结合使用完整示例,本文通过实例代码给大家介绍的非常详细,对大家的学习或工作具有一定的参考借鉴价值,需要的朋友参考... 目录 一、前置知识:Spring AOP 与 事务的关系 事务本质上就是一个“切面”二、核心组件三、完