【RAG入门教程04】Langchian的文档切分

2024-06-10 04:28

本文主要是介绍【RAG入门教程04】Langchian的文档切分,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

在 Langchain 中,文档转换器是一种在将文档提供给其他 Langchain 组件之前对其进行处理的工具。通过清理、处理和转换文档,这些工具可确保 LLM 和其他 Langchain 组件以优化其性能的格式接收数据。

上一章我们了解了文档加载器,加载完文档之后还需要对文档进行转换。

  • 文本分割器
  • 集成

Text Splitters

文本分割器专门用于将文本文档分割成更小、更易于管理的单元。

理想情况下,这些块应该是句子或段落,以便理解文本中的上下文和关系。

分割器考虑了 LLM 处理能力的局限性。通过创建更小的块,LLM 可以在其上下文窗口内更有效地分析信息。

  • CharacterTextSplitter
  • RecursiveCharacterTextSplitter
  • Split by tokens
  • Semantic Chunking
  • HTMLHeaderTextSplitter
  • MarkdownHeaderTextSplitter
  • RecursiveJsonSplitter
  • Split Cod

CharacterTextSplitter

from langchain_text_splitters import CharacterTextSplittertext_splitter = CharacterTextSplitter(separator="\n\n",chunk_size=1000,chunk_overlap=200,length_function=len,is_separator_regex=False,
)
  • separator:这是用于标识文本中自然断点的分隔符。在本例中,它被设置为“\n\n”,这意味着分割器将寻找双换行符作为潜在的分割点。
  • chunk_size:此参数指定每个文本块的目标大小,以字符数表示。在这里,它被设置为 1000,这意味着分割器将旨在创建大约 1000 个字符长的文本块。
  • chunk_overlap:此参数允许连续块之间重叠字符。它被设置为 200,这意味着每个块将包含前一个块末尾的 200 个字符。这种重叠可以帮助确保在块之间的边界上不会丢失任何重要信息。
  • length_function:这是一个用于测量文本块长度的函数。在本例中,它被设置为内置的 len 函数,该函数计算字符串中的字符数。
  • is_separator_regex:此参数指定分隔符是否为正则表达式。它被设置为 False,表示分隔符是一个纯字符串,而不是正则表达式模式。

CharacterTextSplitter根据指定的分隔符拆分文本,默认情况下分隔符设置为 ‘\n\n’。chunk_size参数确定每个块的最大大小,并且只有在可行的情况下才会进行拆分。如果字符串以 n 个字符开头,后跟一个分隔符,然后在下一个分隔符之前有 m 个字符,则如果 chunk_size 小于 n + m + len(separator),则第一个块的大小将为 n。

from langchain_community.document_loaders import PyPDFLoaderloader = PyPDFLoader("book.pdf")
pages = loader.load_and_split()from langchain_text_splitters import CharacterTextSplittertext_splitter = CharacterTextSplitter(separator="\n",chunk_size=1000,chunk_overlap=200,length_function=len,is_separator_regex=False,
)texts = text_splitter.split_text(pages[0].page_content)
print(len(texts))# 4texts[0]"""
'Our goal with this book is to provide the guidance and framework for you,the reader, to grow on \nthe path to being a truly excellent database 
reliability engineer (DBRE). When naming the book we \nchose to use thewords reliability engineer , rather than administrator.  \nBen Treynor, 
VP of Engineering at Google, says the following about reliability engi‐ 
neering:  \nfundamentally doing work that has historically been done by an 
operations team, but using engineers with software \nexpertise, and bankingon the fact that these engineers are inherently both predisposed to, and 
have the ability to, \nsubstitute automation for  human labor.  \nToday’s 
database professionals must be engineers, not administrators. 
We build things. We create \nthings. As engineers practicing devops, 
we are all in this together, and nothing is someone else’s \nproblem.As engineers, we apply repeatable processes, establ ished knowledge, 
and expert judgment'
"""texts[1]"""
'things. As engineers practicing devops, we are all in this together, and nothing is someone else’s \nproblem. As engineers, we apply repeatable processes, establ ished knowledge, and expert judgment \nto design, build, and operate production data stores and the data structures within. As database \nreliability engineers, we must take the operational principles and the depth of database expertise \nthat we possess one ste p further.  \nIf you look at the non -storage components of today’s infrastructures, you will see sys‐ tems that are \neasily built, run, and destroyed via programmatic and often automatic means. The lifetimes of these \ncomponents can be measured in days, and sometimes even  hours or minutes. When one goes away, \nthere is any number of others to step in and keep the quality of service at expected levels.  \nOur next goal is that you gain a framework of principles and practices for the design, building, and'
"""

RecursiveCharacterTextSplitter

关键区别在于,如果结果块仍然大于所需的 chunk_size,它将继续分割结果块,以确保所有最终块都在指定的大小限制内。它由字符列表参数化。

from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(# Set a really small chunk size, just to show.separators=["\n\n", "\n", " ", ""],chunk_size=50,chunk_overlap=40,length_function=len,is_separator_regex=False,
)
texts = text_splitter.split_text(pages[0].page_content)
print(len(texts))texts[2]"""
'book is to provide the guidance and framework for'
"""texts[3]"""
'provide the guidance and framework for you, the'
"""

在文本拆分的上下文中,“递归”意味着拆分器将重复将其拆分逻辑应用于生成的块,直到它们满足某些标准,例如小于指定的最大长度。这在处理需要分解成更小、更易于管理的片段(可能在不同的粒度级别)的非常长的文本时特别有用。

Split By Tokens

原文:“The quick brown fox jumps over the lazy dog。”

标记:[“The”、“quick”、“brown”、“fox”、“jumps”、“over”、“the”、“lazy”、“dog”]

在此示例中,文本根据空格和标点符号拆分为标记。每个单词都成为单独的标记。在实践中,标记化可能更复杂,尤其是对于具有不同书写系统的语言或处理特殊情况(例如,“don’t”可能拆分为“do”和“n’t”)。

有各种标记器。

TokenTextSplitter 来自 tiktoken 库。

from langchain_text_splitters import TokenTextSplittertext_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=1)texts = text_splitter.split_text(pages[0].page_content)texts[0]"""
'Our goal with this book is to provide the guidance'
"""texts[1]"""
' guidance and framework for you, the reader, to'
"""

SpacyTextSplitter 来自spacy库。

from langchain_text_splitters import SpacyTextSplittertext_splitter = SpacyTextSplitter(chunk_size=1000)texts = text_splitter.split_text(pages[0].page_content)

NLTKTextSplitter来自nltk库。

from langchain_text_splitters import NLTKTextSplittertext_splitter = NLTKTextSplitter(chunk_size=1000)texts = text_splitter.split_text(pages[0].page_content)

我们甚至可以利用 Hugging Face 标记器。

from transformers import GPT2TokenizerFasttokenizer = GPT2TokenizerFast.from_pretrained("gpt2")text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=100, chunk_overlap=10
)
texts = text_splitter.split_text(pages[0].page_content)

HTMLHeaderTextSplitter

HTMLHeaderTextSplitter是一个网页代码分块器,它根据 HTML 元素拆分文本,并将相关元数据分配给分块内的每个标头。它可以返回单个分块或将具有相同元数据的元素组合在一起,以保持语义分组并保留文档的结构上下文。此拆分器可与分块管道中的其他文本拆分器结合使用。

from langchain_text_splitters import HTMLHeaderTextSplitterhtml_string = """
<!DOCTYPE html>
<html>
<body><div><h1>Foo</h1><p>Some intro text about Foo.</p><div><h2>Bar main section</h2><p>Some intro text about Bar.</p><h3>Bar subsection 1</h3><p>Some text about the first subtopic of Bar.</p><h3>Bar subsection 2</h3><p>Some text about the second subtopic of Bar.</p></div><div><h2>Baz</h2><p>Some text about Baz</p></div><br><p>Some concluding text about Foo</p></div>
</body>
</html>
"""headers_to_split_on = [("h1", "Header 1"),("h2", "Header 2"),("h3", "Header 3"),
]html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits"""
[Document(page_content='Foo'),Document(page_content='Some intro text about Foo.  \nBar main section Bar subsection 1 Bar subsection 2', metadata={'Header 1': 'Foo'}),Document(page_content='Some intro text about Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section'}),Document(page_content='Some text about the first subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 1'}),Document(page_content='Some text about the second subtopic of Bar.', metadata={'Header 1': 'Foo', 'Header 2': 'Bar main section', 'Header 3': 'Bar subsection 2'}),Document(page_content='Baz', metadata={'Header 1': 'Foo'}),Document(page_content='Some text about Baz', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'}),Document(page_content='Some concluding text about Foo', metadata={'Header 1': 'Foo'})]
"""

MarkdownHeaderTextSplitter

类似于 HTMLHeaderTextSplitter ,专用于 markdown 文件。

from langchain_text_splitters import MarkdownHeaderTextSplittermarkdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"headers_to_split_on = [("#", "Header 1"),("##", "Header 2"),("###", "Header 3"),
]markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits"""
[Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]
"""

RecursiveJsonSplitter

import requests# This is a large nested json object and will be loaded as a python dict
json_data = requests.get("https://api.smith.langchain.com/openapi.json").json()from langchain_text_splitters import RecursiveJsonSplittersplitter = RecursiveJsonSplitter(max_chunk_size=300)# Recursively split json data - If you need to access/manipulate the smaller json chunks
json_chunks = splitter.split_json(json_data=json_data)json_chunks
"""
{'openapi': '3.0.2','info': {'title': 'LangSmith', 'version': '0.1.0'},'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'],'summary': 'Read Tracer Session','description': 'Get a specific session.'}}}},{'paths': {'/api/v1/sessions/{session_id}': {'get': {'operationId': 'read_tracer_session_api_v1_sessions__session_id__get'}}}},{'paths': {'/api/v1/sessions/{session_id}': {'get': {'parameters': [{'required': True,'schema': {'title': 'Session Id', 'type': 'string', 'format': 'uuid'},'name': 'session_id','in': 'path'},{'required': False,'schema': {'title': 'Include Stats','type': 'boolean','default': False},'name': 'include_stats','in': 'query'},{'required': False,'schema': {'title': 'Accept', 'type': 'string'},'name': 'accept','in': 'header'}]}}}},{'paths': {'/api/v1/sessions/{session_id}': {'get': {'responses': {'200': {'description': 'Successful Response','content': {'application/json': {'schema': {'$ref': '#/components/schemas/TracerSession'}}}}}}}}},{'paths': {'/api/v1/sessions/{session_id}': {'get': {'responses': {'422': {'description': 'Validation Error','content': {'application/json': {'schema': {'$ref': '#/components/schemas/HTTPValidationError'}}}}},'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}]}}}},
...{'components': {'securitySchemes': {'API Key': {'type': 'apiKey','in': 'header','name': 'X-API-Key'},'Tenant ID': {'type': 'apiKey', 'in': 'header', 'name': 'X-Tenant-Id'},'Bearer Auth': {'type': 'http', 'scheme': 'bearer'}}}}]
"""

Split Code

Langchain 中的“Split Code”概念是指将代码划分为更小、更易于管理的段或块的过程。

from langchain_text_splitters import Language[e.value for e in Language]"""
['cpp','go','java','kotlin','js','ts','php','proto','python','rst','ruby','rust','scala','swift','markdown','latex','html','sol','csharp','cobol','c','lua','perl']
"""
from langchain_text_splitters import (Language,RecursiveCharacterTextSplitter,
)PYTHON_CODE = """
def hello_world():print("Hello, World!")# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs"""
[Document(page_content='def hello_world():\n    print("Hello, World!")'),Document(page_content='# Call the function\nhello_world()')]
"""
JS_CODE = """
function helloWorld() {console.log("Hello, World!");
}// Call the function
helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.JS, chunk_size=60, chunk_overlap=0
)
js_docs = js_splitter.create_documents([JS_CODE])
js_docs"""
[Document(page_content='function helloWorld() {\n  console.log("Hello, World!");\n}'),Document(page_content='// Call the function\nhelloWorld();')]
"""

这篇关于【RAG入门教程04】Langchian的文档切分的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1047206

相关文章

使用Java将DOCX文档解析为Markdown文档的代码实现

《使用Java将DOCX文档解析为Markdown文档的代码实现》在现代文档处理中,Markdown(MD)因其简洁的语法和良好的可读性,逐渐成为开发者、技术写作者和内容创作者的首选格式,然而,许多文... 目录引言1. 工具和库介绍2. 安装依赖库3. 使用Apache POI解析DOCX文档4. 将解析

Java利用docx4j+Freemarker生成word文档

《Java利用docx4j+Freemarker生成word文档》这篇文章主要为大家详细介绍了Java如何利用docx4j+Freemarker生成word文档,文中的示例代码讲解详细,感兴趣的小伙伴... 目录技术方案maven依赖创建模板文件实现代码技术方案Java 1.8 + docx4j + Fr

使用C#代码在PDF文档中添加、删除和替换图片

《使用C#代码在PDF文档中添加、删除和替换图片》在当今数字化文档处理场景中,动态操作PDF文档中的图像已成为企业级应用开发的核心需求之一,本文将介绍如何在.NET平台使用C#代码在PDF文档中添加、... 目录引言用C#添加图片到PDF文档用C#删除PDF文档中的图片用C#替换PDF文档中的图片引言在当

详解C#如何提取PDF文档中的图片

《详解C#如何提取PDF文档中的图片》提取图片可以将这些图像资源进行单独保存,方便后续在不同的项目中使用,下面我们就来看看如何使用C#通过代码从PDF文档中提取图片吧... 当 PDF 文件中包含有价值的图片,如艺术画作、设计素材、报告图表等,提取图片可以将这些图像资源进行单独保存,方便后续在不同的项目中使

Python实现合并与拆分多个PDF文档中的指定页

《Python实现合并与拆分多个PDF文档中的指定页》这篇文章主要为大家详细介绍了如何使用Python实现将多个PDF文档中的指定页合并生成新的PDF以及拆分PDF,感兴趣的小伙伴可以参考一下... 安装所需要的库pip install PyPDF2 -i https://pypi.tuna.tsingh

Python批量调整Word文档中的字体、段落间距及格式

《Python批量调整Word文档中的字体、段落间距及格式》这篇文章主要为大家详细介绍了如何使用Python的docx库来批量处理Word文档,包括设置首行缩进、字体、字号、行间距、段落对齐方式等,需... 目录关键代码一级标题设置  正文设置完整代码运行结果最近关于批处理格式的问题我查了很多资料,但是都没

python中列表list切分的实现

《python中列表list切分的实现》列表是Python中最常用的数据结构之一,经常需要对列表进行切分操作,本文主要介绍了python中列表list切分的实现,文中通过示例代码介绍的非常详细,对大家... 目录一、列表切片的基本用法1.1 基本切片操作1.2 切片的负索引1.3 切片的省略二、列表切分的高

Python自动化Office文档处理全攻略

《Python自动化Office文档处理全攻略》在日常办公中,处理Word、Excel和PDF等Office文档是再常见不过的任务,手动操作这些文档不仅耗时耗力,还容易出错,幸运的是,Python提供... 目录一、自动化处理Word文档1. 安装python-docx库2. 读取Word文档内容3. 修改

使用Python快速实现链接转word文档

《使用Python快速实现链接转word文档》这篇文章主要为大家详细介绍了如何使用Python快速实现链接转word文档功能,文中的示例代码讲解详细,感兴趣的小伙伴可以跟随小编一起学习一下... 演示代码展示from newspaper import Articlefrom docx import

浅析如何使用Swagger生成带权限控制的API文档

《浅析如何使用Swagger生成带权限控制的API文档》当涉及到权限控制时,如何生成既安全又详细的API文档就成了一个关键问题,所以这篇文章小编就来和大家好好聊聊如何用Swagger来生成带有... 目录准备工作配置 Swagger权限控制给 API 加上权限注解查看文档注意事项在咱们的开发工作里,API