Enhance Retrieval with Fuzzy Filtering and Matching Algorithms in KDB.AI

6 minutes

KDB.AI provides a powerful suite of search functionalities, including similarity search, hybrid search, and Temporal Similarity Search (TSS). These advanced capabilities are further enhanced by metadata filtering, which improves both search speed and accuracy, and regex filtering, which allows for complex pattern matching. 

To further optimize search performance and flexibility, KDB.AI now introduces Fuzzy Filtering. This new feature allows for more robust and error-tolerant searches, ensuring that even imprecise or partially matched queries and filters return relevant results. 

What is a Fuzzy Filter? When is it useful?  

Data often contains errors such as typos and misspellings, which can hinder the accuracy of search results. Fuzzy filters address this issue by enabling the retrieval of documents that contain terms and metadata entries similar to the specified query term and filters, even if there are slight variations. 

Imagine a user wants to search for stock data for Apple Inc., whose ticker symbol is “AAPL”. However, due to a typo, the user enters “APPL” instead. Without fuzzy filtering, this typo might result in no relevant search results or incorrect data being retrieved. By leveraging KDB.AI’s fuzzy filtering capabilities, the system can intelligently handle such errors and still provide the correct stock data for “AAPL”. 

With KDB.AI, fuzzy filters can be applied to metadata columns during any similarity search, enhancing the flexibility and accuracy of your searches. Fuzzy Search supports the searching of metadata columns that are of type string, symbol, or enumerations. 

The Benefits of Fuzzy Filtering 

The addition of fuzzy filters to KDB.AI provides several significant benefits, enhancing the accuracy and user experience of data retrieval processes. Here are some key advantages: 

  • Handling Typos 
    • Fuzzy filters manage imprecise search terms, correcting typos and minor spelling errors to retrieve relevant results. For example, a search for “APPL” will still return data for “AAPL.” 
  • International Spelling Variations 
    • Fuzzy filters accommodate different spelling conventions across languages and regions. This feature allows users to find relevant information for example if they use American English (e.g., “color”) or British English (e.g., “colour”) spellings, ensuring a more inclusive and global-friendly search experience. 
  • User-Friendly Experience 
    • Fuzzy filters improve the user experience by allowing slight variations in search terms, ensuring users get meaningful results without needing exact inputs. 
  • Robustness 
    • Fuzzy filters enhance system robustness by accounting for real-world data imperfections, such as misspellings and abbreviations, ensuring comprehensive and reliable searches. 
  • Increased Recall 
    • Fuzzy filters increase recall by capturing relevant documents that might be missed due to minor spelling differences, ensuring a more complete retrieval of relevant data. 

Fuzzy Filter Implementation

A fuzzy filter is implemented within the filter expression of a given search. 

table.search( 
    vectors=query_vector,  
    n=3,  
    filter=[ 
        ("within","years",[2000,2010]), 
        ['fuzzy','colToScan',[["AAPL",1,"distance_metric"]]] 
    ]
) 

Supported arguments: 

Edit Distance:  

Fuzzy filters work by employing algorithms that measure the “edit distance” between strings. Common edit distance algorithms include Levenshtein distance and Damerau-Levenshtein distance. These algorithms count the minimum number of single-character edits (insertions, deletions, substitutions, transpositions) needed to change one string into another. The higher the distance, the greater the difference between two strings. 

To illustrate, consider the Levenshtein distance between the strings “kitten” and “sitting”. The Levenshtein distance here is 3, meaning three edits are required to transform “kitten” into “sitting”. 

Steps to Transform “kitten” into “sitting”: 

  1. Substitution: Replace “k” with “s”. 
    • “kitten” → “sitten” 
  1. Substitution: Replace “e” with “i”. 
    • “sitten” → “sittin” 
  1. Insertion: Add “g” at the end. 
    • “sittin” → “sitting” 

Thus, the transformation involves two substitutions and one insertion, resulting in a Levenshtein distance of 3. 

Definition of Edits 

  • Insertion: Adding a character to a string. 
    • Example: “sit” → “sitt” 
  • Deletion: Removing a character from a string. 
    • Example: “sitting” → “sittin” 
  • Substitution: Replacing one character with another. 
    • Example: “kitten” → “sitten” 
  • (For Damerau-Levenshtein) Transposition: Swapping two adjacent characters. 
    • Example: “abcdef” → “abcedf” 

Choosing a reasonable edit distance is important to balance accuracy and relevance. If the edit distance is set too high, the search algorithm may return a large number of irrelevant results, reducing the quality of the search output. Ideally edit distance should be set just high enough to capture intended matches like slight typos, while minimizing irrelevant results. 

Distance Metric: 

While fuzzy filtering defaults to Levenshtein distance, you have the ability to choose the distance metric from a variety of options including: Levenshtein, Damerau-Levenshtein, Hamming, Indel, Jaro, JaroWinkler, Longest Common Subsequence, Optimal String Alignment (OSA), Prefix, or Postfix. 

When to Use Fuzzy Filtering 

  • Similarity Search: Fuzzy filter on a metadata column 

There can be a variety of metadata columns attached to vectors with the KDB.AI vector database. Metadata columns are extra fields containing data that further describe the vector for example dates, symbols, descriptions, author, title, coordinates, duration, etc. These metadata fields can be used with KDB.AI’s metadata filtering functionality during similarity search to help improve the speed and accuracy of the search. Many of these fields are strings and could contain typos or other errors – this is where fuzzy filtering can help! 

Let’s consider a scenario where we’re identifying potentially fraudulent transactions using a “transaction_type” metadata field. The term “wire transfer” might be recorded with slight variations like “Wire Transfer,” “wire-transfer,” or “wiretransfer” due to different input methods.  

These minor differences can significantly impact metadata filtering in fraud detection. 

Without fuzzy filtering, a search for “Wire Transfer” would only return exact matches, missing transactions labeled as “wire-transfer” or “wiretransfer.” This could result in overlooking crucial data points in fraud analysis. 

By integrating fuzzy filtering, KDB.AI can effectively address this challenge. The algorithm recognizes these variations as similar terms, retrieving all relevant entries regardless of the specific spelling used. This ensures a more comprehensive fraud detection process, enhancing the accuracy and reliability of the system by reducing the risk of missing potentially fraudulent transactions due to minor data entry inconsistencies. 

To Explore Fuzzy Filtering in Depth: 

  1. Dive into our documentation to master the intricacies of fuzzy filtering. 
  1. Get hands-on experience:  

Whether you’re a curious beginner or an experienced developer, these resources will help you unlock the full potential of fuzzy filtering in your vector searches.