Serverless Generative AI: How to Query Meta’s Llama 2 Model with Microsoft’s Semantic Kernel and AWS Services

Generative AI is a type of artificial intelligence that can create new content such as text, images, music, etc. in response to prompts. Generative AI models learn the patterns and structure of their input training data by applying neural network machine learning techniques, and then generate new data that has similar characteristics.

They are all the rage these days. 😀

Some types of generative AI include:

Foundation models, which are complex machine learning systems trained on vast quantities of data (text, images, audio or a mix of data types) on a massive scale. Foundation models can be adapted quickly for a wide range of downstream tasks without needing task-specific training. Examples of foundation models are GPT, LaMDA and Llama.

Generative adversarial networks (GANs), which are composed of two competing neural networks: a generator that creates fake data and a discriminator that tries to distinguish between real and fake data. The generator improves its ability to fool the discriminator over time. GANs can generate realistic images, videos, or audio. Examples of GANs are DALL-E and Stable Diffusion.

Variational autoencoders (VAEs), which are neural networks that encode input data into a latent space and then decode it back into output data. VAEs can generate new data that is similar but not identical to the input data. VAEs can also perform tasks such as denoising, inpainting, or style transfer. Examples of VAEs are Midjourney and StyleGAN.

What is Microsoft Semantic Kernel?

Microsoft’s Semantic Kernel is an open-source SDK that lets you easily combine AI services like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C# and Python. Semantic Kernel enables you to create AI apps that combine natural language semantic functions, traditional code native functions, and embeddings-based memory.

Semantic Kernel supports and encapsulates several design patterns from the latest in AI research, such as prompt chaining, recursive reasoning, summarization, zero/few-shot learning, contextual memory, long-term memory, embeddings, semantic indexing, planning, retrieval-augmented generation and accessing external knowledge stores.

What is Llama 2?

Llama 2 is Meta's answer to the Large Language Model challenge. Llama 2 has models ranging from 7B to 70B parameters, trained on 2 trillion tokens, and fine-tuned on over 1 million human annotations. It is available on the Amazon SageMaker Jumpstart to be used by anyone.

So what is the problem?

Semantic Kernel (referred as 'SK' from here onwards) unsurprisingly supports OpenAI and Azure OpenAI models out-of-the-box. It also has connectors for some of the HuggingFace and Oobabooga models. However, it still doesn't support Llama 2 models. Since I wanted to play around with Llama 2 models, I decided to modify the SK and add one more connector for Llama 2.

Architecture

Using a model on Amazon SageMaker Jumpstart is as simple as clicking a button and waiting for an endpoint to be ready. It doesn't take more than 5 minutes.

Implementation

Once the endpoints are ready, you will need to create an AWS Lambda and Amazon API Gateway. Lambda has following IAM permissions attached to it to enable it to query SageMaker endpoints.

{

"Version": "2012-10-17",

"Statement": [

{

"Sid": "VisualEditor0",

"Effect": "Allow",

"Action": [

"sagemaker:InvokeEndpointAsync",

"sagemaker:InvokeEndpoint"

"Resource": "*"

}

]

}

The API Gateway provides an API Key based security by default which aligns with SK approach to calling API endpoints.

Llama 2 expects input to be presented in a JSON format which is different from GPT based models which expect a simple string. Example:

[

{"role": "user", "content": "what is the recipe of mayonnaise?"},

]

This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...). So I needed to modify the SK to somehow transform the string input into the format which Llama understands.

Fortunately, SK supports {{ }} based input template. We can use the same to create a prompt which escapes these curly braces and the run the input through a simple Regex parser which transforms the input.

Input:

{{ '{{ system }}' }} You are a bot which can answer queries on Medicare. Don't talk about anything else.

{{ '{{ user }}' }} {{$input}}

SK Transformation:

{{ system }} You are a bot which can answer queries on Medicare. Don't talk about anything else.

{{ user }} I am 38 years old. Am I eligible for Medicare?

Regex Parser Output:

Role: system

Content: You are a bot which can answer queries on Medicare. Don't talk about anything else.

Role: user

Content: I am 38 years old. Am I eligible for Medicare?

Llama 2 also expects a custom attribute "accept_eula=true" to be sent with each inference request else it will reject the request. We can easily send this using the AWS SageMaker SDK.

private async Task<IActionResult> CallSageMakerModel(string endpointName, byte[] body, string contentType)

{

AmazonSageMakerRuntimeClient awsSageMakerRuntimeClient = new AmazonSageMakerRuntimeClient();

InvokeEndpointRequest request = new InvokeEndpointRequest();

request.EndpointName = endpointName;

request.ContentType = contentType;

request.Body = new MemoryStream(body);

request.CustomAttributes = "accept_eula=true";

var response = await awsSageMakerRuntimeClient.InvokeEndpointAsync(request);

var result = Encoding.UTF8.GetString(response.Body.ToArray());

return Ok(result);

}

We don't have to start from scratch when adding support for Llama to SK. We can use some of the existing connectors as base and build on it. I used HuggingFace connector and modified it as I went along.

To add support for Text Completions, you have to implement the ITextCompletion interface and for Text Embeddings, ITextEmbeddingGeneration interface. Here's an example of Text Completions implementation.

private async Task<IReadOnlyList<ITextStreamingResult>> ExecuteGetCompletionsAsync(string text, CancellationToken cancellationToken = default)

{

try

{

LlamaTextParser textParser = new LlamaTextParser();

var chats = textParser.Parse(text);

var completionRequest = new TextCompletionRequest();

completionRequest.Inputs = new List<List<Chat>> { chats };

completionRequest.Parameters = new AIParameters();

completionRequest.Parameters.MaxNewTokens = 1024;

completionRequest.Parameters.Temperature = 0.6;

completionRequest.Parameters.TopP = 0.9;

using var httpRequestMessage = HttpRequest.CreatePostRequest(this.GetRequestUri(), completionRequest);

httpRequestMessage.Headers.Add("User-Agent", HttpUserAgent);

if (!string.IsNullOrEmpty(this._apiKey))

{

httpRequestMessage.Headers.Add("x-api-key", this._apiKey);

}

.....

}

You can play around with max_new_token, temperature and top_p parameters to suit your needs.

What are Embeddings?

Embeddings are numerical representations of data that capture their semantic meaning and similarity. For example, words, sentences, images, or sounds can be converted into embeddings using machine learning models. Embeddings are useful for many tasks that involve natural language processing, computer vision, or audio processing.

Vector databases are specialized databases that store and search embeddings efficiently. They use algorithms such as approximate nearest neighbor search to find the most similar embeddings in a large collection. Vector databases can enable applications such as semantic search, recommendation systems, clustering, classification, and more.

Qdrant is an example of a vector database that is open-source and cloud-native. To run it locally,

docker pull qdrant/qdrant

docker run -p 6333:6333 -v ./qdrant:/qdrant/storage qdrant/qdrant

Why are Embeddings important?

LLMs like Llama 2 are trained on a vast amount of data. But if you want the LLM to answer queries based on your data, you have to find a way to provide them to the model. There are two ways to do it:

1. You can copy paste paragraphs of text into the chatbot/API that you prepared. But since these LLMs have a limited amount of context that it can keep in memory (sliding window), you will soon run into hallucinations.

2. Transform the text using a model like all-MiniLM-L6-v2 into Embeddings (array of numbers) that can be then stored in vector databases like Qdrant and queried by LLM while responding to user's inputs.

I chose to do it the second way.

Testing

I copied the Medicare SSA FAQs into a text file and fed into to the API. Next you have to "ground" the Llama 2 model so that it only answers the queries about Medicare from the data we provided to it. I wrote the following prompt to do it, though admittedly it has huge scope for improvement.

var prompt = @"

{{ '{{ system }}' }} You are a bot which can answer queries on Medicare.

Consider only 'Medicare FAQ' data while answering questions. Don't talk about anything else.

Medicare FAQ: {{recall $input}}

{{ '{{ user }}' }} {{$input}}

Here, the recall is a skill which is built-in to SK and can query vector databases.

Query:

I am 38 years old. Am I eligible for Medicare?

Answer:

Medicare eligibility in the United States typically starts at age 65. So, if you are currently 38 years old, you would not be eligible for Medicare based on age alone..... (truncated)

Conclusion

In this blog, we saw how Llama 2 models can be deployed using Amazon SageMaker and exposed as an endpoint. We also modified Microsoft Semantic Kernel to work with Llama 2 models and finally we tested the final result. Please note that since this field is rapidly evolving, the APIs may change by the time you are reading this blog post. So please verify the official documentation in case you run into issues.

Source Code: https://github.com/mayankthebest/GenerativeAILambda

Mayank's Blog

Search This Blog

Serverless Generative AI: How to Query Meta’s Llama 2 Model with Microsoft’s Semantic Kernel and AWS Services

Labels

Comments

Post a Comment

Popular posts from this blog

Integrating React with SonarQube using Azure DevOps Pipelines

Add Git Commit Hash and Build Number to a Static React Website using Azure DevOps

Chetan Bhagat Someone