Improve Data Quality for Your LLMs with Unstructured.io and KDB.AI

4 minutes

High-quality data is necessary for fueling Large Language Models (LLMs) in the Retrieval Augmented Generation (RAG) workflow. The saying “Garbage In, Garbage Out” applies to generative AI as much as it does to traditional machine learning – this means ingesting high-quality, semantically relevant data into your vector database is essential for accurate LLM-generated RAG responses. 

While early RAG experiments often used simple text files, it has become clear that critical business information is frequently found in complex document types such as PDFs, .docx, .pptx, .rtf, .pages, .key, and .epub. These formats, though challenging to work with, are standard for publishing business content. 

The key is to efficiently parse and ingest these documents, accurately extracting embedded entities like images, tables, and graphs. Proper extraction allows all data within complex documents to be integrated into a RAG workflow, ensuring accurate and contextual responses for both users and businesses. 

 Unstructured is designed to facilitate the ingestion and pre-processing of diverse data formats such as images and text-based documents (PDFs, HTML files, Word documents, etc.), specifically tailored for use with LLMs. They offer an open-source library, as well as API services and the Unstructured Platform for production scenarios. There is a range of modular functions and connectors, ensuring seamless integration and efficient transformation of unstructured data into structured formats. As it relates to RAG, Unstructured is a workflow to ingest and pre-process data to prepare it for insertion into your RAG pipeline. The KDB.AI vector database is now integrated with Unstructured as a destination connector, making it easy to ingest complex documents into the vector database. 

Unstructured Features & Functionalities: 

  • Connectors: Facilitate the ingestion of data from various sources into processing pipelines. 
  • Partition: Decompose documents into their smallest logical units (e.g., titles, body text) to extract structured content from raw, unstructured documents. 
  • Clean: Remove irrelevant sections and headers, utilizing element metadata to curate datasets effectively, crucial for maintaining data integrity. 
  • Extract: Identify and isolate relevant information, including natural language and specific entities within documents. 
  • Structure: Organize partitioned elements into a normalized JSON format, essential for tasks like enterprise-scale RAG, model fine-tuning, and pretraining. 
  • Chunk: Implement smart-chunking to group documents into contextually relevant segments for more precise data retrieval. 
  • Generate Embeddings: Create embeddings using popular models to enhance data utility. 
  • Destination Connectors: Export processed data to vector databases like KDB.AI for further use. 

Ingesting Unstructured Elements into KDB.AI 

We can use Unstructured functionality to ingest a dataset of interest, partition it into useful elements, chunk the elements, embed them, and ingest the embeddings and metadata into the KDB.AI vector database with the KDB.AI Destination Connector. These steps can be implemented separately, or all together within one swift step. There are two ways to do this, using either Bash, or Python. 

Prerequisites: 

  • KDB.AI Endpoint and API Key (from KDB.AI)  
  • KDB.AI Table Deployed 

Step 0: Installs and Imports

%pip install "unstructured[kdbai]" 
%pip install kdbai_client 

Step 1: Connect to KDB.AI 

KDB.AI Cloud Version: 

# Set up KDB.AI endpoint and API key 
KDBAI_ENDPOINT = ( 
    os.environ["KDBAI_ENDPOINT"] 
    if "KDBAI_ENDPOINT" in os.environ 
    else input("KDB.AI endpoint: ") 
) 
KDBAI_API_KEY = ( 
    os.environ["KDBAI_API_KEY"] 
    if "KDBAI_API_KEY" in os.environ 
    else getpass("KDB.AI API key: ") 
) 

# Connect to KDB.AI 
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT) 

# Use the default database 
database = session.database('default') 

KDB.AI Server Version:

### start session with KDB.AI Server 
session = kdbai.Session("http://localhost:8082") 
 
# Use the default database 
database = session.database('default') 

Step 2: Create a KDB.AI Schema and Table

schema = [ 
    {'name': 'id', 'type': 'str'}, 
    {'name': 'text', 'type': 'bytes'}, 
    {'name': 'metadata', 'type': 'general'}, 
    {'name': 'embedding', 'type': 'float32s'} 
] 

indexes = [ 
    { 
        'name': 'flat_index',  
        'column': 'embedding',  
        'type': 'flat',  
        'params': {'dims': 1536, 'metric': 'L2'} 
    } 
] 

#Create the tables 
table = database.create_table("Unstructured_Table", schema=schema, indexes=indexes) 

Step 3: Implement the Unstructured Pipeline with KDB.AI Destination Connector 

Method 1: Bash 

#!/usr/bin/env bash 

unstructured-ingest \ 
 local \ 
 --input-path example-docs/book-war-and-peace-1p.txt \ 
 --output-dir local-to-kdbai \ 
 --strategy fast \ 
 --chunk-elements \ 
 --embedding-provider "<unstructured embedding provider, ie. langchain-huggingface>" \ 
 --num-processes 2 \ 
 --verbose \ 
 kdbai \ 
 --endpoint "KDB.AI endpoint" \ 
 --api-key "private api-key" \ 
 --table-name "table name" \ 
 --batch-size 80 

Method 2: Python 

import os 

from unstructured.ingest.connector.kdbai import ( 
    KDBAIAccessConfig, 
    KDBAIWriteConfig, 
    SimpleKDBAIConfig, 
) 
from unstructured.ingest.connector.local import SimpleLocalConfig 
from unstructured.ingest.interfaces import ( 
    ChunkingConfig, 
    EmbeddingConfig, 
    PartitionConfig, 
    ProcessorConfig, 
    ReadConfig, 
) 
from unstructured.ingest.runner import LocalRunner 
from unstructured.ingest.runner.writers.base_writer import Writer 
from unstructured.ingest.runner.writers.kdbai import ( 
    KDBAIWriter, 
) 

def get_writer() -> Writer: 
    return KDBAIWriter( 
        connector_config=SimpleKDBAIConfig( 
            access_config=KDBAIAccessConfig(api_key=os.getenv("KDBAI_API_KEY")), 
            endpoint="http://localhost:8082", # Or cloud endpoint 
            table_name="elements", 
        ), 
        write_config=KDBAIWriteConfig(), 
    ) 

if __name__ == "__main__": 
    writer = get_writer() 
    runner = LocalRunner( 
        processor_config=ProcessorConfig( 
            verbose=True, 
            output_dir="local-output-to-kdbai", 
            num_processes=2, 
        ), 
        connector_config=SimpleLocalConfig( 
            input_path="example-docs/book-war-and-peace-1225p.txt", 
        ), 
        read_config=ReadConfig(), 
        partition_config=PartitionConfig(), 
        chunking_config=ChunkingConfig(chunk_elements=True), 
        embedding_config=EmbeddingConfig( 
            provider="langchain-huggingface", 
        ), 
        writer=writer, 
        writer_kwargs={}, 
    ) 
    runner.run() 

Step 4: Query KDB.AI 

embedding_encoder = HuggingFaceEmbeddingEncoder( 
    config=HuggingFaceEmbeddingConfig() 
) 

query = "This is the query" 
query_embedding = embedding_encoder.embed_query(query=query) 

table.search(vectors={'flat_index':query_embedding}, n=3)[0] 

Try it Today 

Check out the sample notebook on GitHub and get started with KDB.AI and Unstructured today!