本文主要是介绍自制搜索(elasticsearch安装,mongo-connector同步数据,python操作),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
做一个搜索,以es为基础,数据存在mongodb
1:Elasticsearch
下载:
elasticsearch下载地址https://www.elastic.co/downloads/elasticsearch
安装:
修改elasticsearch-5.5.1/config/elasticsearch.yml
# 集群名称
cluster.name: myElasticsearch
# 节点名称
node.name: node001
# 0.0.0.0是为了让别的机器访问
network.host: 0.0.0.0
# 端口
http.port: 9200
命令:elasticsearch-5.5.1/bin/elasticsearch
浏览器:127.0.0.1:9200
2:Elasticsearch-head
修改elasticsearch-5.5.1/config/elasticsearch.yml
# 增加新的参数,这样head插件可以访问es
http.cors.enabled: true
http.cors.allow-origin: "*"
git clone git://github.com/mobz/elasticsearch-head.git
安装grunt(需要node和npm):
npm install -g grunt-cli
npm install -g grunt
修改head源码
elasticsearch-head/Gruntfile.js
connect: {server: {options: {port: 9100,hostname: '*',base: '.',keepalive: true}}}});
添加hostname: '*',
elasticsearch-head/_site/app.js
# 修改head的连接地址:
this.base_uri = this.config.base_uri || this.prefs.get("app-base_uri") || "http://localhost:9200";
# 把localhost修改成你es的服务器地址,如:
this.base_uri = this.config.base_uri || this.prefs.get("app-base_uri") || "http://x.x.x.x:9200";
安装elasticsearch-head(需要node和npm)
cd elasticsearch-head/
npminstall
启动:
grunt server
浏览器:127.0.0.1:9100
3:mongo-connector
mongo-connector需要开启MongoDB复制集
新建三个data文件夹
# replSet后面是复制集名称,port是端口,dbpath是data目录
# 第一个节点
sudo mongod --dbpath=/Users/zjl/mongodbdata/data1 --port 27018 --replSet rs0
# 第二个节点
sudo mongod --dbpath=/Users/zjl/mongodbdata/data2 --port 27019 --replSet rs0
# 第三个节点
sudo mongod --dbpath=/Users/zjl/mongodbdata/data3 --port 27020 --replSet rs0
mongo 127.0.0.1:27018config = {"_id": "rs0",members: [{ "_id": 0,"host": "127.0.0.1:27018"},{ "_id": 1,"host": "127.0.0.1:27019"},{ "_id": 2,"host": "127.0.0.1:27020",arbiterOnly:true}]}
# arbiterOnly:true是这个节点是仲裁节点,据说这个很鸡肋,只需节点个数是单数就不需要了,仲裁节点不存数据# 进行初始化:
rs.initiate(config);# 查看配置信息
rs.conf();# 查看状态
rs.status();
https://github.com/mongodb-labs/mongo-connector
pip install 'mongo-connector[elastic5]'
同步命令:mongo-connector -m 127.0.0.1:27018 -t 127.0.0.1:9200 -d elastic2_doc_manager
出现Logging to /xx/xx/mongo-connector.log.说明正常
现在我们在mongo主节点新建一个库,会马上同步到其它子节点,并且同步到elasticsearch
rs后面是端口号,27018是主节点,我在主节点新建一个库,马上就同步到子节点
也同步到了elasticsearch上,mongo的库是elasticsearch的索引,表是elasticsearch的type
4:python操作elasticsearch
https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html
pip install elasticsearch
百度,谷歌之类的搜索会对用户的输入进行关键词提取,去重,英文单词纠错
先往mongo添加一些数据
elasticsearch据说内部默认字符串相似度算法是TF-IDF,但是没有分词,不过elasticsearch有ik分词器的插件可以不需要自己手动实现,不过这里我用jieba分词,因为es搜索时内部的算法已经封装好了,而且字符串算法都那些套路,所以我对搜索结果再进行算法加工也没有太大的意义,所以我能做的就是在关键字进入es之前做一些处理,比如关键词提取,单词纠错,感觉能做很有限,毕竟搜索引擎到底怎么样的我不清楚
estest
|----enchant_py.py(单词纠错,网上找的)
|----EsQuery.py(elasticsearch操作)
|----flaskrun.py(flask服务)
|----dict.txt(jieba词库,发现分词结果不理想往里面添加词语,设置词频)
|----stop_words.txt(jieba的停用词,用于关键词提取)
|----big.txt(单词纠错用到的词库,这个太长了,找一个英英词典或者英文小说,如果效果不好,往里面添加你想要的词)
enchant_py.py
# -*- coding: utf-8 -*-
#__author__="ZJL"import re, collectionsdef words(text): return re.findall('[a-z]+', text.lower())def train(features):model = collections.defaultdict(lambda: 1)for f in features:model[f] += 1return modelNWORDS = train(words(open('big.txt').read()))alphabet = 'abcdefghijklmnopqrstuvwxyz'def edits1(word):n = len(word)return set([word[0:i] + word[i + 1:] for i in range(n)] + # deletion[word[0:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(n - 1)] + # transposition[word[0:i] + c + word[i + 1:] for i in range(n) for c in alphabet] + # alteration[word[0:i] + c + word[i:] for i in range(n + 1) for c in alphabet]) # insertiondef known_edits2(word):return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)def known(words): return set(w for w in words if w in NWORDS)def correct(word):candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]return max(candidates, key=lambda w: NWORDS[w])# print('thew => ' + correct('thew'))
# print('spak => ' + correct('spak'))
# print('goof => ' + correct('goof'))
# print('babyu => ' + correct('babyu'))
# print('spalling => ' + correct('spalling'))
# print("Hello =>"+ correct('Hello'))
EsQuery.py
# -*- coding: utf-8 -*-
#__author__="ZJL"from elasticsearch import Elasticsearch
import jieba
import jieba.analyse
import re
import enchant_py
import jsonclass ESQuery(object):def __init__(self):self.es = Elasticsearch("127.0.0.1:9200")def ES_Query(self,es_index,es_doc_type,query_key_list,strs,num,size_num):from_num = (num - 1) * size_numsize_num = num * size_numesstrs = " ".join(strs.get("key_list", ""))str_key = strs.get("key_str", "")re_nums = re.findall(r'[0-9]+', esstrs)re_nums_list = []if re_nums:for re_num in re_nums:re_nums_list.append({"match": {"age": re_num}})for query_key in query_key_list:re_nums_list.append({"match": {query_key: esstrs}})print(re_nums_list)body = {"query":{"bool":{"must": [],"must_not": [],"should": re_nums_list}},"from": from_num,"size": size_num,"sort": [],"aggs": {},# 关键字高亮"highlight": {"fields": {"school": {},"name":{}}}}a = self.es.search(index=es_index, doc_type=es_doc_type,body=body)aa = a["hits"]aa["key_str"] = str_keydata_json = json.dumps(aa)print(data_json)return (data_json)def Check_Keyword(self,key_str):# 词库file_name = "dict.txt"# 停用词 stop_words.txtstop_file_name = "stop_words.txt"# 加载词库jieba.load_userdict(file_name)# 加载停用词jieba.analyse.set_stop_words(stop_file_name)key_str_copy = key_str# 正则找出所有英文单词result_list = re.findall(r'[a-zA-Z]+', key_str_copy)# key_str_list = list(jieba.cut(key_str.strip()))# print(key_str_list)# 单词量小于3(百度超过两个也不纠错),将单词纠错,将原词与纠错后的词添加到字典corr_dict = {}if len(result_list)<3 and len(result_list)>0:for restr in result_list:strd = enchant_py.correct(restr)if restr!=strd:corr_dict[restr] = strd# 将纠错后的词替换原来的单词for corr in corr_dict:key_str_copy = key_str_copy.replace(corr,corr_dict.get(corr,""))# jieba的tf-idf算法,提取关键词tagstr = jieba.analyse.extract_tags(key_str_copy, topK=20, withWeight=False, allowPOS=())# 考虑到英文短句超不多在这个范围,且不太会有停用词,这样中英文结合后也能去掉中文的停用词elif len(result_list)<3 and len(result_list)>5:tagstr = jieba.analyse.extract_tags(key_str_copy, topK=20, withWeight=False, allowPOS=())# 英文单词过多就直接原样输出else:# 分词key_str_list = list(jieba.cut(key_str_copy))# 如果全英文中出现特殊符号就去掉stop_key = [" ","(",")",".",",","\'","\"","*","+","-","\\","/","`","~","@","#","$","%","^","&",'[',']',"{","}",";","?","!","\t","\n",":"]for key in stop_key:if key in key_str_list:key_str_list.remove(key)tagstr = key_str_list# 如果单词没有纠错就不显示if corr_dict:data_str = key_strelse:data_str = ""data = {"key_list":tagstr,"key_str":data_str}print(data)return data
flaskrun.py
# -*-coding:utf-8 -*-__author__ = "ZJL"from flask import Flask
from flask import request
from EsQuery import ESQuery
from werkzeug.contrib.fixers import ProxyFixapp = Flask(__name__)"""
@api {get} / 首页
@apiName index
@apiGroup indexx
"""
@app.route("/")
def index():return "hello world""""
@api {get} /query 查询
@apiName 查询
@apiGroup 查询xx
@apiParam {string} strs 关键字
@apiParam {string} num 页码.
@apiParam {string} size_num 每页数量
"""
@app.route('/query', methods=['GET'])
def es_query():if request.method == 'GET' and request.args['strs'] and request.args['num'] and request.args['size_num']:num = int(request.args['num'])size_num = int(request.args['size_num'])strs = request.args['strs']eq = ESQuery()key_str_dict = eq.Check_Keyword(strs)es_index = ["test99911"]es_type = []es_query_list = ["title","body"]data_json = eq.ES_Query(es_index, es_type, es_query_list,key_str_dict, num, size_num)return data_jsonelse:return "no"app.wsgi_app = ProxyFix(app.wsgi_app)
if __name__ == "__main__":app.run(host="0.0.0.0",port=5123) # ,debug=True,threaded=True# 分别通过3中方式获取参数:request.form, request.args,request.values# postForm= request.form# getArgs= request.args# postValues= request.values
dict.txt
相似度 5
stop_words.txt
的
了
和
是
就
都
而
及
與
著
或
一個
沒有
我們
你們
妳們
他們
她們
是否
与
着
一个
没有
我们
你们
他们
她们
它们
big.txt太长了不贴了
这篇关于自制搜索(elasticsearch安装,mongo-connector同步数据,python操作)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!