PHP——爬虫DOM解析

2024-06-22 12:52

文章标签 php dom 爬虫解析

本文主要是介绍PHP——爬虫DOM解析，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

背景

php在爬取网页信息的时候，有一些函数可以使用。
这里介绍两个

DOMDocument
DOMXPath

代码解析

<?php
// 示例HTML
$html = '<!DOCTYPE html>
<html><head><meta charset="UTF-8"><title>Example</title></head><body><div><p>Hello World!</p><p xx="test_custom_key" k="test_multi_key">this is a test 中文</p><a href="#xxx_link">Link</a><p id="xxx_p">Link</p><a href="#xxx_link——2">Link</a><a href="#xxx_link_%E4%B8%AD%E6%96%87%E9%93%BE%E6%8E%A5">Link</a></div><div id="custom_div_id"><p id="xxx">Another paragraph</p><a href="#">Another link</a><div>second empty div</div><div id="custom_div_id"><p>second div paragraph</p></div></div></body>
</html>';function printDomNode($paragraph){echo "----\nNode: ".$paragraph->nodeValue . "\n";echo "all attr: \n";for ($i = 0; $i < $paragraph->attributes->length; $i++) {$attr = $paragraph->attributes->item($i);echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";}
}// 创建DOMDocument实例并加载HTML
$dom = new DOMDocument();
@$dom->loadHTML($html);// 创建DOMXPath实例
$xpath = new DOMXPath($dom);echo "----------------------------\n";
// 示例1：查找所有<p>元素
$paragraphs = $xpath->query('//p');
foreach ($paragraphs as $paragraph) {echo "----\nNode----------: ".$paragraph->nodeValue . "\n";echo "id attr: ".$paragraph->getAttribute('id') . "\n";echo "all attr: \n";for ($i = 0; $i < $paragraph->attributes->length; $i++) {$attr = $paragraph->attributes->item($i);echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";}echo "=foreach=\n";foreach ($paragraph->attributes as $attr) {echo "\t".$attr->name . ': ' . $attr->value . "\n";echo "\t".$attr->nodeName . ': ' . $attr->nodeValue . "\n";}}echo "----------------------------\n";
$paragraphs = $xpath->query('//p[@xx="test_custom_key"]');
foreach ($paragraphs as $paragraph) {printDomNode($paragraph);
}echo "----------------------------\n";
// 示例2：查找包含特定文本的<a>元素
$links = $xpath->query('//a[text()="Link"]');
//$links = $xpath->query('//*[text()="Link"]'); //不限制a标签，会找到所有值是Link的节点
foreach ($links as $link) {$herf = $link->getAttribute('href');echo "origin: ".$herf . "\n";echo "decode: ".urldecode($herf) . "\n";
}//如果找到指定路径下面的节点
echo "----------------------------指定路径下的p节点\n";
$paragraphs = $xpath->query('//div[@id="custom_div_id"]//p[@id="xxx"]');
foreach ($paragraphs as $paragraph) {printDomNode($paragraph);
}echo "----------------------------\n";
// 示例3：查找<div>元素内的所有节点（这个会找到所有子节点，包括节点里面的节点）
$divChildren = $xpath->query('//div/*');
foreach ($divChildren as $child) {
//    echo $child->nodeName . ": " . $child->nodeValue . "\n";printDomNode($paragraph);
}?>

示例：比如获取一个html文档中的p标签

步骤
- 获取网页html
  - 这里省略了请求url。如果需要从url获取html：$html = file_get_contents($url);
- 将html文件构建成DOM树结构
- 使用DOMXPath类来查找指定的元素节点
  - 构建DOMXPath类实例
  - 使用query函数查询
    - 如果找所有p标签，那么会遍历整个树，把所有的p标签找出来
    - 如何找指定属性的p标签：p[@xx="test_custom_key"]
      - 这里的xx是自定义的属性
      - 这里的@表示选择节点的属性
    - 如何找到指定路径下的节点
      - //div[@id="custom_div_id"]//p[@id="xxx"]