LlamaIndex 组件

本文主要是介绍LlamaIndex 组件 - Storing，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

文章目录

- 一、储存概览
- - 1、概念
  - 2、使用模式
  - 3、模块
- 二、Vector Stores
- - 1、简单向量存储
  - 2、矢量存储选项和功能支持
  - 3、Example Notebooks
- 三、文件存储
- - 1、简单文档存储
  - 2、MongoDB 文档存储
  - 3、Redis 文档存储
  - 4、Firestore 文档存储
- 四、索引存储
- - 1、简单索引存储
  - 2、MongoDB 索引存储
  - 3、Redis索引存储
- 五、Chat Stores
- - 1、简单聊天商店
  - 2、Redis聊天商店
- 六、键值存储
- 七、保存和加载数据
- - 1、持久化数据
  - 2、加载数据中
  - 3、使用远程后端
- 八、自定义存储
- - 低级API
  - 矢量存储集成和存储

本文转载改编自： https://docs.llamaindex.ai/en/stable/module_guides/storing/

一、储存概览

1、概念

LlamaIndex 提供了一个用于摄取、索引和查询外部数据的高级接口。

在底层，LlamaIndex 还支持可交换存储组件，允许您自定义：

文档存储Node：存储摄取的文档（即对象）的地方，
索引存储：存储索引元数据的位置，
向量存储：存储嵌入向量的位置。
图存储：存储知识图的位置（即 for KnowledgeGraphIndex）。
聊天存储：存储和组织聊天消息的地方。

文档/索引存储依赖于通用的键值存储抽象，这也在下面详细介绍。

LlamaIndex 支持将数据持久保存到fsspec支持的任何存储后端。我们已确认支持以下存储后端：

本地文件系统
AWS S3
Cloudflare R2

在这里插入图片描述

2、使用模式

许多向量存储（FAISS 除外）将存储数据和索引（嵌入）。

这意味着您不需要使用单独的文档存储或索引存储。

这也意味着您不需要显式地保留这些数据——这会自动发生。用法如下所示来构建新索引/重新加载现有索引。

## build a new index
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.deeplake import DeepLakeVectorStore# construct vector store and customize storage context
vector_store = DeepLakeVectorStore(dataset_path="<dataset_path>")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Load documents and build index
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context
)***
## reload an existing one
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

有关更多详细信息，请参阅下面的矢量存储模块指南。

请注意，通常要使用存储抽象，您需要定义一个StorageContext对象：

from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.storage.index_store import SimpleIndexStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core import StorageContext# create storage context using default stores
storage_context = StorageContext.from_defaults(docstore=SimpleDocumentStore(),vector_store=SimpleVectorStore(),index_store=SimpleIndexStore(),
)

有关自定义/持久性的更多详细信息可以在下面的指南中找到。

定制化
保存/加载

3、模块

我们提供有关不同存储组件的深入指南。

Vector Stores
Docstores
Index Stores
Key-Val Stores
Graph Stores
ChatStores

二、Vector Stores

向量存储包含所摄取文档块（有时也包含文档块）的嵌入向量。

1、简单向量存储

默认情况下，LlamaIndex 使用简单的内存向量存储，非常适合快速实验。它们可以通过调用vector_store.persist()（分别）保存到磁盘（并从磁盘加载）SimpleVectorStore.from_persist_path(...)。

2、矢量存储选项和功能支持

LlamaIndex 支持 20 多种不同的矢量存储选项。我们正在积极添加更多集成并提高每个集成的功能覆盖范围。

Vector Store	Type	Metadata Filtering	Hybrid Search	Delete	Store Documents	Async
Apache Cassandra®	self-hosted / cloud	✓		✓	✓
Astra DB	cloud	✓		✓	✓
Azure AI Search	cloud	✓	✓	✓	✓
Azure CosmosDB MongoDB	cloud			✓	✓
BaiduVectorDB	cloud	✓	✓		✓
ChatGPT Retrieval Plugin	aggregator			✓	✓
Chroma	self-hosted	✓		✓	✓
DashVector	cloud	✓	✓	✓	✓
Databricks	cloud	✓		✓	✓
Deeplake	self-hosted / cloud	✓		✓	✓
DocArray	aggregator	✓		✓	✓
DuckDB	in-memory / self-hosted	✓		✓	✓
DynamoDB	cloud			✓
Elasticsearch	self-hosted / cloud	✓	✓	✓	✓	✓
FAISS	in-memory
txtai	in-memory
Jaguar	self-hosted / cloud	✓	✓	✓	✓
LanceDB	cloud	✓		✓	✓
Lantern	self-hosted / cloud	✓	✓	✓	✓	✓
Metal	cloud	✓		✓	✓
MongoDB Atlas	self-hosted / cloud	✓		✓	✓
MyScale	cloud	✓	✓	✓	✓
Milvus / Zilliz	self-hosted / cloud	✓		✓	✓
Neo4jVector	self-hosted / cloud			✓	✓
OpenSearch	self-hosted / cloud	✓	✓	✓	✓	✓
Pinecone	cloud	✓	✓	✓	✓
Postgres	self-hosted / cloud	✓	✓	✓	✓	✓
pgvecto.rs	self-hosted / cloud	✓	✓	✓	✓
Qdrant	self-hosted / cloud	✓	✓	✓	✓	✓
Redis	self-hosted / cloud	✓		✓	✓
Simple	in-memory	✓		✓
SingleStore	self-hosted / cloud	✓		✓	✓
Supabase	self-hosted / cloud	✓		✓	✓
Tair	cloud	✓		✓	✓
TiDB	cloud	✓		✓	✓
TencentVectorDB	cloud	✓	✓	✓	✓
Timescale		✓		✓	✓	✓
Typesense	self-hosted / cloud	✓		✓	✓
Upstash	cloud				✓
Weaviate	self-hosted / cloud	✓	✓	✓	✓

有关更多详细信息，请参阅矢量存储集成。

3、Example Notebooks

Astra DB
Async Index Creation
Azure AI Search
Azure Cosmos DB
Baidu
Caasandra
Chromadb
Dash
Databricks
Deeplake
DocArray HNSW
DocArray in-Memory
DuckDB
Espilla
Jaguar
LanceDB
Lantern
Metal
Milvus
MyScale
ElsaticSearch
FAISS
MongoDB Atlas
Neo4j
OpenSearch
Pinecone
Pinecone Hybrid Search
PGvectoRS
Postgres
Redis
Qdrant
Qdrant Hybrid Search
Rockset
Simple
Supabase
Tair
TiDB
Tencent
Timesacle
Upstash
Weaviate
Weaviate Hybrid Search
Zep

三、文件存储

文档存储包含摄取的文档块，我们将其称为Node对象。

有关更多详细信息，请参阅API 参考。

1、简单文档存储

默认情况下，将对象SimpleDocumentStore存储Node在内存中。它们可以通过调用docstore.persist()（分别）保存到磁盘（并从磁盘加载）SimpleDocumentStore.from_persist_path(...)。

可以在这里找到更完整的示例

2、MongoDB 文档存储

我们支持 MongoDB 作为替代文档存储后端，在Node摄取对象时保留数据。

from llama_index.storage.docstore.mongodb import MongoDocumentStore
from llama_index.core.node_parser import SentenceSplitter# create parser and parse document into nodes
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)# create (or load) docstore and add nodes
docstore = MongoDocumentStore.from_uri(uri="<mongodb+srv://...>")
docstore.add_documents(nodes)# create storage context
storage_context = StorageContext.from_defaults(docstore=docstore)# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)

在后台，MongoDocumentStore连接到固定的 MongoDB 数据库并为节点初始化新集合（或加载现有集合）。

注意：您可以在实例化时配置db_name和，否则它们默认为和。namespace``MongoDocumentStore``db_name="db_docstore"``namespace="docstore"

请注意，使用 an 时无需调用storage_context.persist()(或) ，因为默认情况下数据是持久保存的。docstore.persist()``MongoDocumentStore

MongoDocumentStore您可以轻松地重新连接到 MongoDB 集合并通过使用现有的db_name和重新初始化来重新加载索引collection_name。

可以在这里找到更完整的示例

3、Redis 文档存储

我们支持 Redis 作为替代文档存储后端，在Node摄取对象时保留数据。

from llama_index.storage.docstore.redis import RedisDocumentStore
from llama_index.core.node_parser import SentenceSplitter# create parser and parse document into nodes
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)# create (or load) docstore and add nodes
docstore = RedisDocumentStore.from_host_and_port(host="127.0.0.1", port="6379", namespace="llama_index"
)
docstore.add_documents(nodes)# create storage context
storage_context = StorageContext.from_defaults(docstore=docstore)# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)

在后台，RedisDocumentStore连接到 redis 数据库并将节点添加到存储在{namespace}/docs.

namespace注意：实例化时可以配置RedisDocumentStore，否则默认namespace="docstore"。

RedisDocumentStore您可以轻松地重新连接到 Redis 客户端并通过使用现有的host、port和重新初始化来重新加载索引namespace。

可以在这里找到更完整的示例

4、Firestore 文档存储

我们支持 Firestore 作为替代文档存储后端，在Node摄取对象时保留数据。

from llama_index.storage.docstore.firestore import FirestoreDocumentStore
from llama_index.core.node_parser import SentenceSplitter# create parser and parse document into nodes
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)# create (or load) docstore and add nodes
docstore = FirestoreDocumentStore.from_database(project="project-id",database="(default)",
)
docstore.add_documents(nodes)# create storage context
storage_context = StorageContext.from_defaults(docstore=docstore)# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)

在后台，FirestoreDocumentStore连接到 Google Cloud 中的 firestore 数据库，并将您的节点添加到存储在{namespace}/docs.

namespace注意：实例化时可以配置FirestoreDocumentStore，否则默认namespace="docstore"。

FirestoreDocumentStore您可以轻松地重新连接到 Firestore 数据库并通过使用现有的project、database和重新初始化来重新加载索引namespace。

可以在这里找到更完整的示例

四、索引存储

索引存储包含轻量级索引元数据（即构建索引时创建的附加状态信息）。

有关更多详细信息，请参阅API 参考。

1、简单索引存储

默认情况下，LlamaIndex 使用由内存中键值存储支持的简单索引存储。它们可以通过调用index_store.persist()（分别）保存到磁盘（并从磁盘加载）SimpleIndexStore.from_persist_path(...)。

2、MongoDB 索引存储

与文档存储类似，我们也可以将其用作MongoDB索引存储的存储后端。

from llama_index.storage.index_store.mongodb import MongoIndexStore
from llama_index.core import VectorStoreIndex# create (or load) index store
index_store = MongoIndexStore.from_uri(uri="<mongodb+srv://...>")# create storage context
storage_context = StorageContext.from_defaults(index_store=index_store)# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)# or alternatively, load index
from llama_index.core import load_index_from_storageindex = load_index_from_storage(storage_context)

在后台，MongoIndexStore连接到固定的 MongoDB 数据库并为索引元数据初始化新集合（或加载现有集合）。

注意：您可以在实例化时配置db_name和，否则它们默认为和。namespace``MongoIndexStore``db_name="db_docstore"``namespace="docstore"

请注意，使用 an 时无需调用storage_context.persist()(或) ，因为默认情况下数据是持久保存的。index_store.persist()``MongoIndexStore

MongoIndexStore您可以轻松地重新连接到 MongoDB 集合并通过使用现有的db_name和重新初始化来重新加载索引collection_name。

可以在这里找到更完整的示例

3、Redis索引存储

我们支持 Redis 作为替代文档存储后端，在Node摄取对象时保留数据。

from llama_index.storage.index_store.redis import RedisIndexStore
from llama_index.core import VectorStoreIndex# create (or load) docstore and add nodes
index_store = RedisIndexStore.from_host_and_port(host="127.0.0.1", port="6379", namespace="llama_index"
)# create storage context
storage_context = StorageContext.from_defaults(index_store=index_store)# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)# or alternatively, load index
from llama_index.core import load_index_from_storageindex = load_index_from_storage(storage_context)

在后台，RedisIndexStore连接到 redis 数据库并将节点添加到存储在{namespace}/index.

namespace注意：实例化时可以配置RedisIndexStore，否则默认namespace="index_store"。

RedisIndexStore您可以轻松地重新连接到 Redis 客户端并通过使用现有的host、port和重新初始化来重新加载索引namespace。

可以在这里找到更完整的示例

五、Chat Stores

聊天存储充当存储您的聊天历史记录的集中式界面。与其他存储格式相比，聊天历史记录是独一无二的，因为消息的顺序对于维持整体对话非常重要。

user_ids聊天存储可以通过键（如或其他唯一可识别字符串）组织聊天消息序列，并处理delete、insert、和get操作。

1、简单聊天商店

最基本的聊天存储是SimpleChatStore，它将消息存储在内存中，并且可以保存到磁盘或从磁盘保存，也可以序列化并存储在其他地方。

通常，您将实例化一个聊天存储并将其提供给内存模块。SimpleChatStore如果未提供，则使用聊天存储的内存模块将默认使用。

from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.core.memory import ChatMemoryBufferchat_store = SimpleChatStore()chat_memory = ChatMemoryBuffer.from_defaults(token_limit=3000,chat_store=chat_store,chat_store_key="user1",
)

创建内存后，您可以将其包含在代理或聊天引擎中：

agent = OpenAIAgent.from_tools(tools, memory=memory)
# OR
chat_engine = index.as_chat_engine(memory=memory)

要保存聊天存储供以后使用，您可以从磁盘保存/加载

chat_store.persist(persist_path="chat_store.json")
loaded_chat_store = SimpleChatStore.from_persist_path(persist_path="chat_store.json"
)

或者您可以与字符串进行转换，同时将字符串保存在其他位置

chat_store_string = chat_store.json()
loaded_chat_store = SimpleChatStore.parse_raw(chat_store_string)

2、Redis聊天商店

使用RedisChatStore，您可以远程存储聊天记录，而不必担心手动保存和加载聊天记录。

from llama_index.storage.chat_store.redis import RedisChatStore
from llama_index.core.memory import ChatMemoryBufferchat_store = RedisChatStore(redis_url="redis://localhost:6379", ttl=300)chat_memory = ChatMemoryBuffer.from_defaults(token_limit=3000,chat_store=chat_store,chat_store_key="user1",
)

六、键值存储

键值存储是为我们的文档存储和索引存储提供支持的底层存储抽象。

我们提供以下键值存储：

简单键值存储：内存中的 KV 存储。用户可以选择调用persist这个kv存储来将数据保存到磁盘。
MongoDB 键值存储：MongoDB KV 存储。

有关更多详细信息，请参阅API 参考。

注意：目前，这些存储抽象不面向外部。

七、保存和加载数据

1、持久化数据

默认情况下，LlamaIndex 将数据存储在内存中，如果需要，可以显式保留该数据：

storage_context.persist(persist_dir="<persist_dir>")

这将根据指定persist_dir（或./storage默认）将数据保存到磁盘。

假设您跟踪要加载的索引 ID，则可以保留多个索引并从同一目录加载多个索引。

用户还可以配置MongoDB默认保存数据的替代存储后端（例如）。在这种情况下，调用storage_context.persist()将不会执行任何操作。

2、加载数据中

要加载数据，用户只需使用相同的配置重新创建存储上下文（例如传入相同的persist_dir或向量存储客户端）。

storage_context = StorageContext.from_defaults(docstore=SimpleDocumentStore.from_persist_dir(persist_dir="<persist_dir>"),vector_store=SimpleVectorStore.from_persist_dir(persist_dir="<persist_dir>"),index_store=SimpleIndexStore.from_persist_dir(persist_dir="<persist_dir>"),
)

StorageContext然后我们可以通过下面的一些便利函数加载特定的索引。

from llama_index.core import (load_index_from_storage,load_indices_from_storage,load_graph_from_storage,
)# load a single index
# need to specify index_id if multiple indexes are persisted to the same directory
index = load_index_from_storage(storage_context, index_id="<index_id>")# don't need to specify index_id if there's only one index in storage context
index = load_index_from_storage(storage_context)# load multiple indices
indices = load_indices_from_storage(storage_context)  # loads all indices
indices = load_indices_from_storage(storage_context, index_ids=[index_id1, ...]
)  # loads specific indices# load composable graph
graph = load_graph_from_storage(storage_context, root_id="<root_id>"
)  # loads graph with the specified root_id

3、使用远程后端

默认情况下，LlamaIndex 使用本地文件系统来加载和保存文件。但是，您可以通过传递对象来覆盖它fsspec.AbstractFileSystem。

这是一个简单的例子，实例化一个向量存储：

import dotenv
import s3fs
import osdotenv.load_dotenv("../../../.env")# load documents
documents = SimpleDirectoryReader("../../../examples/paul_graham_essay/data/"
).load_data()
print(len(documents))
index = VectorStoreIndex.from_documents(documents)

至此，一切都已经是一样了。现在 - 让我们实例化一个 S3 文件系统并从那里保存/加载。

# set up s3fs
AWS_KEY = os.environ["AWS_ACCESS_KEY_ID"]
AWS_SECRET = os.environ["AWS_SECRET_ACCESS_KEY"]
R2_ACCOUNT_ID = os.environ["R2_ACCOUNT_ID"]assert AWS_KEY is not None and AWS_KEY != ""s3 = s3fs.S3FileSystem(key=AWS_KEY,secret=AWS_SECRET,endpoint_url=f"https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com",s3_additional_kwargs={"ACL": "public-read"},
)# If you're using 2+ indexes with the same StorageContext,
# run this to save the index to remote blob storage
index.set_index_id("vector_index")# persist index to s3
s3_bucket_name = "llama-index/storage_demo"  # {bucket_name}/{index_name}
index.storage_context.persist(persist_dir=s3_bucket_name, fs=s3)# load index from s3
index_from_s3 = load_index_from_storage(StorageContext.from_defaults(persist_dir=s3_bucket_name, fs=s3),index_id="vector_index",
)

默认情况下，如果您不传递文件系统，我们将假定本地文件系统。

八、自定义存储

默认情况下，LlamaIndex 隐藏了复杂性，让您可以用不到 5 行代码查询数据：

from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderdocuments = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the documents.")

在底层，LlamaIndex 还支持可交换的存储层，允许您自定义提取的文档（即Node对象）、嵌入向量和索引元数据的存储位置。

在这里插入图片描述

低级API

为此，无需使用高级 API，

index = VectorStoreIndex.from_documents(documents)

我们使用较低级别的 API 来提供更精细的控制：

from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.storage.index_store import SimpleIndexStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core.node_parser import SentenceSplitter# create parser and parse document into nodes
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)# create storage context using default stores
storage_context = StorageContext.from_defaults(docstore=SimpleDocumentStore(),vector_store=SimpleVectorStore(),index_store=SimpleIndexStore(),
)# create (or load) docstore and add nodes
storage_context.docstore.add_documents(nodes)# build index
index = VectorStoreIndex(nodes, storage_context=storage_context)# save index
index.storage_context.persist(persist_dir="<persist_dir>")# can also set index_id to save multiple indexes to the same folder
index.set_index_id("<index_id>")
index.storage_context.persist(persist_dir="<persist_dir>")# to load index later, make sure you setup the storage context
# this will loaded the persisted stores from persist_dir
storage_context = StorageContext.from_defaults(persist_dir="<persist_dir>")# then load the index object
from llama_index.core import load_index_from_storageloaded_index = load_index_from_storage(storage_context)# if loading an index from a persist_dir containing multiple indexes
loaded_index = load_index_from_storage(storage_context, index_id="<index_id>")# if loading multiple indexes from a persist dir
loaded_indicies = load_index_from_storage(storage_context, index_ids=["<index_id>", ...]
)

您可以通过一行更改来自定义底层存储，以实例化不同的文档存储、索引存储和向量存储。有关更多详细信息，请参阅文档存储、向量存储、索引存储指南。

矢量存储集成和存储

我们的大多数矢量存储集成将整个索引（矢量+文本）存储在矢量存储本身中。这样做的主要好处是不必显式地持久化索引，如上所示，因为矢量存储已经托管并将数据持久化在我们的索引中。

支持这种做法的矢量存储是：

AzureAISearchVectorStore
ChatGPTRetrievalPluginClient
CassandraVectorStore
ChromaVectorStore
EpsillaVectorStore
DocArrayHnswVectorStore
DocArrayInMemoryVectorStore
JaguarVectorStore
LanceDBVectorStore
MetalVectorStore
MilvusVectorStore
MyScaleVectorStore
OpensearchVectorStore
PineconeVectorStore
QdrantVectorStore
RedisVectorStore
UpstashVectorStore
WeaviateVectorStore

下面是一个使用 Pinecone 的小例子：

import pinecone
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.pinecone import PineconeVectorStore# Creating a Pinecone index
api_key = "api_key"
pinecone.init(api_key=api_key, environment="us-west1-gcp")
pinecone.create_index("quickstart", dimension=1536, metric="euclidean", pod_type="p1"
)
index = pinecone.Index("quickstart")# construct vector store
vector_store = PineconeVectorStore(pinecone_index=index)# create storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)# load documents
documents = SimpleDirectoryReader("./data").load_data()# create index, which will insert documents/vectors to pinecone
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context
)

如果您有一个已加载数据的现有矢量存储，您可以连接到它并直接创建一个VectorStoreIndex，如下所示：

index = pinecone.Index("quickstart")
vector_store = PineconeVectorStore(pinecone_index=index)
loaded_index = VectorStoreIndex.from_vector_store(vector_store=vector_store)