Open-source Semantic Search

June 22, 2023 ebookreaders

Use LLMs to search documents in Elasticsearch

img source: https://informania.wordpress.com/2010/04/02/overwhelmed-by-series/

Semantic search has been a trendy topic in the recent hype around Large Language Models (LLMs). A main reason is that the traditional text-based search uses a lexical approach that has known limitations. It looks for literal matches (or variants of it like the stemmed version) of the query words typed by the user. This approach misses the context and doesn’t understand what the whole query really means.

For example, when a user searches for “insurance”, a lexical-based searching solution will fail to surface documents that contain “Medicaid” but don’t have the word “insurance” in them explicitly. A quick and easy solution to address this issue is to take advantage of the powerful open-source LLMs available on Huggingface model hub.

In this post and the accompanying notebook, I show

Here is the high level steps:

Pick a tokenization strategy i.e. if word, sentence, or document -level embeddings are suitable for you.
Create document-level embeddings considering that many foundational models are suitable for semantic search and sentence representation in general.
Start a single-node Elasticsearch cluster locally following their docker set up documentation.
Index the documents and their embeddings.
Finally, generate query vectors using the same encoder at search time to retrieve the most similar vectors to the query vector.

For the purpose of this post, I use the Newsgroup dataset as my corpus, a T5-small model to get document-level embeddings, and a locally run Elasticsearch cluster. The complete notebook is on my GitHub.

Please note that you can always pick a better model (such as FLAN-T5) and fine-tune the selected model on the your own corpus to get better results. You should also look into ranked retrieval metrics like nDCG to measure how good your semantic search system is working.

1. Set-up single-node Elasticsearch cluster on your local environment

You can skip this section if you already have one.

pull the Elasticsearch Docker image docker pull docker.elastic.co/elasticsearch/elasticsearch:8.7.0
Create a new docker network for Elasticsearch and Kibana docker network create elastic
Start Elasticsearch in Docker docker run –name es01 –net elastic -p 9200:9200 -it docker.elastic.co/elasticsearch/elasticsearch:8.7.0
Copy the http_ca.crt security certificate from your Docker container to your local machine docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
[optional] set up Kibana: docker run \ –name kibana \ –net elastic \ -p 5601:5601 \ docker.elastic.co/kibana/kibana:8.2.2 note that you will need the enrollment token from step 3
Test that you have access to your cluster from your notebook. More info on Elasticsearch python cli.

2. Create an Elasticsearch index

from elasticsearch import Elasticsearch

ELASTIC_PASSWORD = "PASSWORD"
ES_HOST = "https://localhost:9200/"
index_name = "semantic-search"

# Create the client instance
client = Elasticsearch(
    hosts=ES_HOST,
    ca_certs='./http_ca.crt',
    basic_auth=("elastic", ELASTIC_PASSWORD)
)

# get cluster information
client.info()

# define index config
config = {
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "embeddings": {
                    "type": "dense_vector",
                    "dims": 512,
                    "index": True
                }
            }
    },
    "settings": {
        "number_of_shards": 2,
        "number_of_replicas": 1
    }
}

# create an index in elasticsearch
try:
    client.indices.create(
        index=index_name,
        settings=config["settings"],
        mappings=config["mappings"],
    )
except:
    print(f"Index already exists: {client.indices.exists(index=[index_name])}")

3. Get the embedding vectors

import torch
from datasets import load_dataset
from transformers import T5Model, T5Tokenizer

# load the dataset
dataset = load_dataset('newsgroup', '18828_alt.atheism')
# check an example of data
dataset['train'][0]['text']

# load the pre-trained model
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5Model.from_pretrained("t5-small")

# function to get embeddings
def get_embeddings(input_text, model=model, tokenizer=tokenizer, max_length=512):
        
    inputs = tokenizer.encode_plus(input_text, 
                                         max_length=max_length,
                                         pad_to_max_length=True,
                                         return_tensors="pt")
    
    outputs = model(input_ids=inputs['input_ids'], decoder_input_ids=inputs['input_ids'])
    
    last_hidden_states = torch.mean(outputs[0], dim=1)

    return last_hidden_states.tolist()

# index documents and their embedding in Elasticsearch
for i in range(small_dataset.num_rows):
    doc = {"text": small_dataset['text'][i],
           "embeddings": get_embeddings(small_dataset['text'][i])[0]
    }
    
    client.index(index= index_name, document=doc)

# check the number of saved documents
result = client.count(index=index_name)
print(result.body['count'])

4. Search!

After indexing is finished, we can search our data. Elasticsearch uses cosine similarity but also provides a python wrapper to perform KNN search. You also have the option to use a custom similarity function.

I have provided a code snippet for KNN with k=5. Remember that you have to provide embeddings for your query term as well.

query_embedding = get_embeddings(dataset['train']['text'][20])[0]
query_dict = {
    "field": "embeddings",
    "query_vector": query_embedding,
    "k": 5,
    "num_candidates": 5
}
res = client.knn_search(index=index_name, knn=query_dict, source=["text"])

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Open-source Semantic Search was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

https://ai.plainenglish.io/open-source-semantic-search-a656e4c5483a?source=rss—-78d064101951—4
By: Amir imani
Title: Open-source Semantic Search
Sourced From: ai.plainenglish.io/open-source-semantic-search-a656e4c5483a?source=rss—-78d064101951—4
Published Date: Wed, 21 Jun 2023 01:43:52 GMT

Did you miss our previous article…
https://e-bookreadercomparison.com/adapting-to-the-ai-revolution-a-comprehensive-upskilling-guide-for-all-industries/