通过智能分块和元数据集成获得更好的搜索结果

[导读]通常，我们开发基于 LLM 的检索应用程序的知识库包含大量各种格式的数据。为了向LLM提供最相关的上下文来回答知识库中特定部分的问题，我们依赖于对知识库中的文本进行分块并将其放在方便的位置。

通常，我们开发基于 LLM 的检索应用程序的知识库包含大量各种格式的数据。为了向LLM提供最相关的上下文来回答知识库中特定部分的问题，我们依赖于对知识库中的文本进行分块并将其放在方便的位置。

分块

分块是将文本分割成有意义的单元以改进信息检索的过程。通过确保每个块代表一个集中的想法或观点，分块有助于保持内容的上下文完整性。

在本文中，我们将讨论分块的三个方面：

· 糟糕的分块如何导致结果相关性降低

· 良好的分块如何带来更好的结果

· 如何通过元数据进行良好的分块，从而获得具有良好语境的结果

为了有效地展示分块的重要性，我们将采用同一段文本，对其应用 3 种不同的分块方法，并检查如何根据查询检索信息。

分块并存储至 Qdrant

让我们看看下面的代码，它展示了对同一文本进行分块的三种不同方法。

Python

import qdrant_client

from qdrant_client.models import PointStruct, Distance, VectorParams

import openai

import yaml

# Load configuration

with open('config.yaml', 'r') as file:

config = yaml.safe_load(file)

# Initialize Qdrant client

client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key

openai.api_key = config['openai']['api_key']

def embed_text(text):

print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded

response = openai.embeddings.create(

input=[text], # Input needs to be a list

model=config['openai']['model_name']

)

embedding = response.data[0].embedding # Access using the attribute, not as a dictionary

print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation

return embedding

# Function to create a collection if it doesn't exist

def create_collection_if_not_exists(collection_name, vector_size):

collections = client.get_collections().collections

if collection_name not in [collection.name for collection in collections]:

client.create_collection(

collection_name=collection_name,

vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

)

print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation

else:

print(f"Collection {collection_name} already exists.") # Collection existence check

# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.

text = """

Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

"""

# Poor Chunking Strategy

def poor_chunking(text, chunk_size=40):

chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced

return chunks

# Good Chunking Strategy

def good_chunking(text):

import re

sentences = re.split(r'(?<=[.!?]) +', text)

print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced

return sentences

# Good Chunking with Metadata

def good_chunking_with_metadata(text):

chunks = good_chunking(text)

metadata_chunks = []

for chunk in chunks:

if "healthcare" in chunk:

metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})

elif "ethical implications" in chunk or "data privacy" in chunk:

metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})

else:

metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})

print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced

return metadata_chunks

# Store chunks in Qdrant

def store_chunks(chunks, collection_name):

if len(chunks) == 0:

print(f"No chunks were generated for the collection '{collection_name}'.")

return

# Generate embedding for the first chunk to determine vector size

sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]

sample_embedding = embed_text(sample_text)

vector_size = len(sample_embedding)

create_collection_if_not_exists(collection_name, vector_size)

for idx, chunk in enumerate(chunks):

text = chunk if isinstance(chunk, str) else chunk["text"]

embedding = embed_text(text)

payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload

client.upsert(collection_name=collection_name, points=[

PointStruct(id=idx, vector=embedding, payload=payload)

])

print(f"Chunks successfully stored in the collection '{collection_name}'.")

# Execute chunking and storing separately for each strategy

print("Starting poor_chunking...")

store_chunks(poor_chunking(text), "poor_chunking")

print("Starting good_chunking...")

store_chunks(good_chunking(text), "good_chunking")

print("Starting good_chunking_with_metadata...")

store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")

上面的代码执行以下操作：

· embed_text方法接收文本，使用 OpenAI 嵌入模型生成嵌入，并返回生成的嵌入。

· 初始化用于分块和后续内容检索的文本字符串

· 糟糕的分块策略：将文本分成每 40 个字符的块

· 良好的分块策略：根据句子拆分文本以获得更有意义的上下文

· 具有元数据的良好分块策略：向句子级块添加适当的元数据

· 一旦为块生成了嵌入，它们就会存储在 Qdrant Cloud 中相应的集合中。

请记住，创建不良分块只是为了展示不良分块如何影响检索。

下面是来自 Qdrant Cloud 的块的屏幕截图，您可以看到元数据被添加到句子级块中以指示来源和主题。

基于分块策略的检索结果

现在让我们编写一些代码来根据查询从 Qdrant Vector DB 中检索内容。

Python

import qdrant_client

from qdrant_client.models import PointStruct, Distance, VectorParams

import openai

import yaml

# Load configuration

with open('config.yaml', 'r') as file:

config = yaml.safe_load(file)

# Initialize Qdrant client

client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key

openai.api_key = config['openai']['api_key']

def embed_text(text):

print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded

response = openai.embeddings.create(

input=[text], # Input needs to be a list

model=config['openai']['model_name']

)

embedding = response.data[0].embedding # Access using the attribute, not as a dictionary

print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation

return embedding

# Function to create a collection if it doesn't exist

def create_collection_if_not_exists(collection_name, vector_size):

collections = client.get_collections().collections

if collection_name not in [collection.name for collection in collections]:

client.create_collection(

collection_name=collection_name,

vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

)

print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation

else:

print(f"Collection {collection_name} already exists.") # Collection existence check

# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.

text = """

"""

# Poor Chunking Strategy

def poor_chunking(text, chunk_size=40):

chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced

return chunks

# Good Chunking Strategy

def good_chunking(text):

import re

sentences = re.split(r'(?<=[.!?]) +', text)

print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced

return sentences

# Good Chunking with Metadata

def good_chunking_with_metadata(text):

chunks = good_chunking(text)

metadata_chunks = []

for chunk in chunks:

if "healthcare" in chunk:

metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})

elif "ethical implications" in chunk or "data privacy" in chunk:

metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})

else:

metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})

print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced

return metadata_chunks

# Store chunks in Qdrant

def store_chunks(chunks, collection_name):

if len(chunks) == 0:

print(f"No chunks were generated for the collection '{collection_name}'.")

return

# Generate embedding for the first chunk to determine vector size

sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]

sample_embedding = embed_text(sample_text)

vector_size = len(sample_embedding)

create_collection_if_not_exists(collection_name, vector_size)

for idx, chunk in enumerate(chunks):

text = chunk if isinstance(chunk, str) else chunk["text"]

embedding = embed_text(text)

payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload

client.upsert(collection_name=collection_name, points=[

PointStruct(id=idx, vector=embedding, payload=payload)

])

print(f"Chunks successfully stored in the collection '{collection_name}'.")

# Execute chunking and storing separately for each strategy

print("Starting poor_chunking...")

store_chunks(poor_chunking(text), "poor_chunking")

print("Starting good_chunking...")

store_chunks(good_chunking(text), "good_chunking")

print("Starting good_chunking_with_metadata...")

store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")

上面的代码执行以下操作：

· 定义查询并生成查询的嵌入

· 搜索查询设置为"ethical implications of AI in healthcare"。

· 该retrieve_and_print函数搜索特定的 Qdrant 集合并检索最接近查询嵌入的前 3 个向量。

现在让我们看看输出：

python retrieval_test.py

Results from 'poor_chunking' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: . The ethical implications of AI in heal

Source: N/A

Topic: N/A

Result 2:

Text: ant impact is healthcare. AI is being us

Source: N/A

Topic: N/A

Result 3:

Text:

Artificial intelligence is transforming

Source: N/A

Topic: N/A

Results from 'good_chunking' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.

Source: N/A

Topic: N/A

Result 2:

Text: One of the key areas where AI is making a significant impact is healthcare.

Source: N/A

Topic: N/A

Result 3:

Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

Source: N/A

Topic: N/A

Results from 'good_chunking_with_metadata' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.

Source: Healthcare Section

Topic: AI in Healthcare

Result 2:

Text: One of the key areas where AI is making a significant impact is healthcare.

Source: Healthcare Section

Topic: AI in Healthcare

Result 3:

Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

Source: General

Topic: AI Overview

同一搜索查询的输出根据实施的分块策略而有所不同。

· 分块策略不佳：您可以注意到，这里的结果不太相关，这是因为文本被分成了任意的小块。

· 良好的分块策略：这里的结果更相关，因为文本被分成句子，保留了语义含义。

· 使用元数据进行良好的分块策略：这里的结果最准确，因为文本经过深思熟虑地分块并使用元数据进行增强。

从实验中得出的推论

· 分块需要精心制定策略，并且块大小不宜太小或太大。

· 分块不当的一个例子是，块太小，在非自然的地方切断句子，或者块太大，同一个块中包含多个主题，这使得检索非常混乱。

· 分块的整个想法都围绕着为 LLM 提供更好的背景的概念。

· 元数据通过提供额外的上下文层极大地增强了结构正确的分块。例如，我们已将来源和主题作为元数据元素添加到我们的分块中。

· 检索系统受益于这些附加信息。例如，如果元数据表明某个区块属于“医疗保健部分”，则系统可以在进行与医疗保健相关的查询时优先考虑这些区块。

· 通过改进分块，结果可以结构化和分类。如果查询与同一文本中的多个上下文匹配，我们可以通过查看块的元数据来确定信息属于哪个上下文或部分。

牢记这些策略，并在基于 LLM 的搜索应用程序中分块取得成功。

通过智能分块和元数据集成获得更好的搜索结果

与传统的驱动方式相比，共阴恒流驱动在能效有哪些优势

工业电机驱动电源设计：反电动势抑制与过流保护的集成方案

如何解决 LED 驱动电源的易损坏问题

LED设计中LED驱动电源的公式

EV主驱IGBT隔离驱动电源方案选择问题探讨

合理的驱动电源方案成为大功率区域照明的主流选择

AC-DC电源转换拓扑结构设计

针对于LED照明驱动电源技术中的电磁干扰其中的三大硬件问题措施

LED驱动电源的核心部分“开关管”和“变换器”设计技巧

最全LED驱动电源及散热设计方案介绍

常用的LED驱动电源有哪些？工作原理是什么？

LED驱动电源的类型可分为有哪些？

解散全部员工！深圳又一电子大厂宣布停产结业

崧盛股份：大功率LED驱动电源行业门槛高，新进入者面临三大壁垒

关于LED驱动电源的分类以及特点解析，你了解吗？

你知道常见的LED驱动电源种类以及它们有哪些特点吗？

关于LED驱动电源特点以及在设计时需要注意的关键点

多路 LED 驱动电源技术的开发与可靠性研究分析

值得大家学习的LED驱动电源的特点以及工作原理概述

Cree宣布彻底告别LED和照明行业