使用 XPATH 和 HTML Cleaner 解析 HTML/XML

2024-01-08 20:48

本文主要是介绍使用 XPATH 和 HTML Cleaner 解析 HTML/XML,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

太阳火神的美丽人生 (http://blog.csdn.net/opengl_es)

本文遵循“署名-非商业用途-保持一致”创作公用协议

转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS、Android、Html5、Arduino、pcDuino否则,出自本博客的文章拒绝转载或再转载,谢谢合作。



使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

JANUARY 5, 2010
tags: android, examples, HTML, parse, scraping, XML, XPATH

大家好
Hey everyone,

有时我发现有一种能力十分有用,尤其在 Web 相关的应用中,那就是从 web 站点获取 HTML 并且从 HTML 解析数据,或是任何你要想得到的内容(对于我的情况大多总是数据)。
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).


I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
public class OptionScraper {
     // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER
     private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2" ;
     private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;
     private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span" ;
     // TAGNODE OBJECT, ITS USE WILL COME IN LATER
     private static TagNode node;
     // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)
     public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {
         // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE
         String option_url = "http://finance.yahoo.com/q?s=" + name.toUpperCase();
         // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE
         HtmlCleaner cleaner = new HtmlCleaner();
         CleanerProperties props = cleaner.getProperties();
         props.setAllowHtmlInsideAttributes( true );
         props.setAllowMultiWordAttributes( true );
         props.setRecognizeUnicodeChars( true );
         props.setOmitComments( true );
         // OPEN A CONNECTION TO THE DESIRED URL
         URL url = new URL(option_url);
         URLConnection conn = url.openConnection();
         //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT
         node = cleaner.clean( new InputStreamReader(conn.getInputStream()));
         // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)
         Object[] info_nodes = node.evaluateXPath(NAME_XPATH);
         Object[] time_nodes = node.evaluateXPath(TIME_XPATH);
         Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);
         // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED
         if (info_nodes.length > 0 ) {
             // CASTED TO A TAGNODE
             TagNode info_node = (TagNode) info_nodes[ 0 ];
             // HOW TO RETRIEVE THE CONTENTS AS A STRING
             String info = info_node.getChildren().iterator().next().toString().trim();
             // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)
             processInfoNode(o, info);
         }
         if (time_nodes.length > 0 ) {
             TagNode time_node = (TagNode) time_nodes[ 0 ];
             String date = time_node.getChildren().iterator().next().toString().trim();
             // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE
             processDateNode(o, date);
         }
         if (price_nodes.length > 0 ) {
             TagNode price_node = (TagNode) price_nodes[ 0 ];
             double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());
             o.setPremium(price);
         }
         return o;
     }
}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei



这篇关于使用 XPATH 和 HTML Cleaner 解析 HTML/XML的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/584772

相关文章

C#中checked关键字的使用小结

《C#中checked关键字的使用小结》本文主要介绍了C#中checked关键字的使用,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学... 目录✅ 为什么需要checked? 问题:整数溢出是“静默China编程”的(默认)checked的三种用

C#中预处理器指令的使用小结

《C#中预处理器指令的使用小结》本文主要介绍了C#中预处理器指令的使用,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧... 目录 第 1 名:#if/#else/#elif/#endif✅用途:条件编译(绝对最常用!) 典型场景: 示例

C#实现将XML数据自动化地写入Excel文件

《C#实现将XML数据自动化地写入Excel文件》在现代企业级应用中,数据处理与报表生成是核心环节,本文将深入探讨如何利用C#和一款优秀的库,将XML数据自动化地写入Excel文件,有需要的小伙伴可以... 目录理解XML数据结构与Excel的对应关系引入高效工具:使用Spire.XLS for .NETC

C++ 右值引用(rvalue references)与移动语义(move semantics)深度解析

《C++右值引用(rvaluereferences)与移动语义(movesemantics)深度解析》文章主要介绍了C++右值引用和移动语义的设计动机、基本概念、实现方式以及在实际编程中的应用,... 目录一、右值引用(rvalue references)与移动语义(move semantics)设计动机1

MySQL 筛选条件放 ON后 vs 放 WHERE 后的区别解析

《MySQL筛选条件放ON后vs放WHERE后的区别解析》文章解释了在MySQL中,将筛选条件放在ON和WHERE中的区别,文章通过几个场景说明了ON和WHERE的区别,并总结了ON用于关... 今天我们来讲讲数据库筛选条件放 ON 后和放 WHERE 后的区别。ON 决定如何 "连接" 表,WHERE

Mysql中RelayLog中继日志的使用

《Mysql中RelayLog中继日志的使用》MySQLRelayLog中继日志是主从复制架构中的核心组件,负责将从主库获取的Binlog事件暂存并应用到从库,本文就来详细的介绍一下RelayLog中... 目录一、什么是 Relay Log(中继日志)二、Relay Log 的工作流程三、Relay Lo

使用Redis实现会话管理的示例代码

《使用Redis实现会话管理的示例代码》文章介绍了如何使用Redis实现会话管理,包括会话的创建、读取、更新和删除操作,通过设置会话超时时间并重置,可以确保会话在用户持续活动期间不会过期,此外,展示了... 目录1. 会话管理的基本概念2. 使用Redis实现会话管理2.1 引入依赖2.2 会话管理基本操作

Springboot请求和响应相关注解及使用场景分析

《Springboot请求和响应相关注解及使用场景分析》本文介绍了SpringBoot中用于处理HTTP请求和构建HTTP响应的常用注解,包括@RequestMapping、@RequestParam... 目录1. 请求处理注解@RequestMapping@GetMapping, @PostMappin

springboot3.x使用@NacosValue无法获取配置信息的解决过程

《springboot3.x使用@NacosValue无法获取配置信息的解决过程》在SpringBoot3.x中升级Nacos依赖后,使用@NacosValue无法动态获取配置,通过引入SpringC... 目录一、python问题描述二、解决方案总结一、问题描述springboot从2android.x

Mybatis的mapper文件中#和$的区别示例解析

《Mybatis的mapper文件中#和$的区别示例解析》MyBatis的mapper文件中,#{}和${}是两种参数占位符,核心差异在于参数解析方式、SQL注入风险、适用场景,以下从底层原理、使用场... 目录MyBATis 中 mapper 文件里 #{} 与 ${} 的核心区别一、核心区别对比表二、底