开源情报之领英人脸情报收集，如何快速收集上亿张人脸情报

本文主要是介绍开源情报之领英人脸情报收集，如何快速收集上亿张人脸情报，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

一.前言
先看应用例子：
残忍至极！乌克兰用人脸识别战死俄军，联系母亲打“心理战”
情报机构，所掌握的数据，可以是市面上流出的任何数据，比如市面上泄露的领英数据，facebook，twitter，这些数据可以作为开源情报的基础数据之一，用来将互联网与个人实体联系起来
所有的技术，第一服务目标是暴力，如果你是一个程序员，如何构建一个能联系起现实的庞大数据库，通过触手可及的互联网内容。先展示我的成果，再来讲述技术：
已经成功收集了几千万张这类头像

二.技术实现
SeetaFace6，爬虫
领英已经实现了严格的反爬措施，要爬取6亿条用户的头像，那就要找一个相对于好的弱项进行攻破；已知领英开发团队来之meta，meta程序员好给每个用户搞多个接口返回用户信息，例如badges页面，可以通过该页面，获取无穷无尽的用户头像
1.实现第一步，获取领英的账号地址，如果你是出色的情报人员，你手上应该有已经有了上亿的领英用户主页地址了，如果没有，你可以自己使用程序进行爬取，或者通过灰色渠道，这里写如何通过爬虫爬取：
爬虫实现技术，java selenium,使用现成领英账号登录后进行爬取
如何实现selenium的登录控制与特征抹除：

package com.util;
import org.openqa.selenium.Dimension;
import org.openqa.selenium.PageLoadStrategy;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.logging.LogType;
import org.openqa.selenium.logging.LoggingPreferences;
import org.openqa.selenium.net.ChromeDriverProxy;
public class WebDriverTool {/*** 获取web驱动* * @return 浏览器驱动*/public ChromeDriver getWebDriver(String username) {System.setProperty("webdriver.chrome.driver", com.util.PropertyUtil.getvalue("chromedriver"));// 指定驱动路径// 设置浏览器参数ChromeOptions options = new ChromeOptions();Map<String, Object> prefs = new HashMap<String, Object>();prefs.put("credentials_enable_service", false);prefs.put("profile.password_manager_enabled", false);prefs.put("profile.password_manager_enabled", false);options.addArguments("user-data-dir=C:\\chrome\\"+username);//指定浏览器的运行文件存储地，领英账号登录后就可以保持长久的会话了/**excludeSwitches", Arrays.asList("enable-automation")在高版本的谷歌浏览器是无法屏蔽window.navigator.webdriver 为false 的特征，这里写出来是为了配合其他参数来关闭浏览器上显示"正在收到自动测试软件控制"的提示**/options.setExperimentalOption("excludeSwitches", Arrays.asList("enable-automation"));options.addArguments("--disable-blink-features");options.addArguments("--disable-blink-features=AutomationControlled");options.setExperimentalOption("useAutomationExtension", false);//options.addArguments("blink-settings=imagesEnabled=false");options.setExperimentalOption("prefs", prefs);// 创建驱动对象ChromeDriver driver = new ChromeDriver(options);//ChromeDriverProxy driver=new ChromeDriverProxy(options);driver.manage().window().setSize(new Dimension(1280, 1024));// 去除seleium全部指纹特征FileReader fileReader = new FileReader("C:\\lurk.js");String js = fileReader.readString();// MapBuilder是依赖hutool工具包的apiMap<String, Object> commandMap = MapBuilder.create(new LinkedHashMap<String, Object>()).put("source", js).build();// executeCdpCommand这个api在selenium3中是没有的,请使用selenium4才能使用此api((ChromeDriver) driver).executeCdpCommand("Page.addScriptToEvaluateOnNewDocument", commandMap);return driver ;}}

lurk.js 文件是控制特征去除的js片段，下载地址：https://download.csdn.net/download/qq_19383667/88444628
使用selenium进行账号登录后，找到：https://sg.linkedin.com/in/li-hao-74581548 这个页面，你会发现，只需要知道领英用户主页地址，即可快速批量的获得用户的头像文件，而且访问一个地址，你就能获取几十张额外的头像与用户主页地址

同名推荐
url对应的正主
同公司地域行业的推荐
到这里基本上能完成很多头像的收集

2.SeetaFace6实现头像切割与特征收集
该项目java版地址：https://gitee.com/cnsugar/seetaface6JNI，特征识别方法为：

try {BufferedImage user = ImageIO.read(new File(downpath));if (user != null) {float[] s = FaceHelper.extractMaxFace(user);ArrayList<Float> list = new ArrayList<Float>();if (s != null) {for (int i = 0; i < s.length; i++) {list.add(s[i]);}JSONArray maxfacecode = JSONArray.fromObject(list);maxfacecode_str = maxfacecode.toString();//数字化的人脸特征值，后期直接可用用作人脸对比}}} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}