Skip to main content

Serverless Generative AI: How to Query Meta’s Llama 2 Model with Microsoft’s Semantic Kernel and AWS Services

Generative AI is a type of artificial intelligence that can create new content such as text, images, music, etc. in response to prompts. Generative AI models learn the patterns and structure of their input training data by applying neural network machine learning techniques, and then generate new data that has similar characteristics.

They are all the rage these days. 😀

Some types of generative AI include:

Foundation models, which are complex machine learning systems trained on vast quantities of data (text, images, audio or a mix of data types) on a massive scale. Foundation models can be adapted quickly for a wide range of downstream tasks without needing task-specific training. Examples of foundation models are GPT, LaMDA and Llama.

Generative adversarial networks (GANs), which are composed of two competing neural networks: a generator that creates fake data and a discriminator that tries to distinguish between real and fake data. The generator improves its ability to fool the discriminator over time. GANs can generate realistic images, videos, or audio. Examples of GANs are DALL-E and Stable Diffusion.

Variational autoencoders (VAEs), which are neural networks that encode input data into a latent space and then decode it back into output data. VAEs can generate new data that is similar but not identical to the input data. VAEs can also perform tasks such as denoising, inpainting, or style transfer. Examples of VAEs are Midjourney and StyleGAN.

What is Microsoft Semantic Kernel?

Microsoft’s Semantic Kernel is an open-source SDK that lets you easily combine AI services like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C# and Python. Semantic Kernel enables you to create AI apps that combine natural language semantic functions, traditional code native functions, and embeddings-based memory.

Semantic Kernel supports and encapsulates several design patterns from the latest in AI research, such as prompt chaining, recursive reasoning, summarization, zero/few-shot learning, contextual memory, long-term memory, embeddings, semantic indexing, planning, retrieval-augmented generation and accessing external knowledge stores.

What is Llama 2?

Llama 2 is Meta's answer to the Large Language Model challenge. Llama 2 has models ranging from 7B to 70B parameters, trained on 2 trillion tokens, and fine-tuned on over 1 million human annotations. It is available on the Amazon SageMaker Jumpstart to be used by anyone.

So what is the problem?

Semantic Kernel (referred as 'SK' from here onwards) unsurprisingly supports OpenAI and Azure OpenAI models out-of-the-box. It also has connectors for some of the HuggingFace and Oobabooga models. However, it still doesn't support Llama 2 models. Since I wanted to play around with Llama 2 models, I decided to modify the SK and add one more connector for Llama 2.


Using a model on Amazon SageMaker Jumpstart is as simple as clicking a button and waiting for an endpoint to be ready. It doesn't take more than 5 minutes.


Once the endpoints are ready, you will need to create an AWS Lambda and Amazon API Gateway. Lambda has following IAM permissions attached to it to enable it to query SageMaker endpoints.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

The API Gateway provides an API Key based security by default which aligns with SK approach to calling API endpoints.

Llama 2 expects input to be presented in a JSON format which is different from GPT based models which expect a simple string. Example:

        {"role": "user", "content": "what is the recipe of mayonnaise?"},

This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...). So I needed to modify the SK to somehow transform the string input into the format which Llama understands.

Fortunately, SK supports {{ }} based input template. We can use the same to create a prompt which escapes these curly braces and the run the input through a simple Regex parser which transforms the input.

{{ '{{ system }}' }} You are a bot which can answer queries on Medicare. Don't talk about anything else.
{{ '{{ user }}' }} {{$input}}

SK Transformation:
{{ system }} You are a bot which can answer queries on Medicare. Don't talk about anything else.
{{ user }} I am 38 years old. Am I eligible for Medicare?

Regex Parser Output:
Role: system
Content: You are a bot which can answer queries on Medicare. Don't talk about anything else.

Role: user
Content: I am 38 years old. Am I eligible for Medicare?

Llama 2 also expects a custom attribute "accept_eula=true" to be sent with each inference request else it will reject the request. We can easily send this using the AWS SageMaker SDK.

private async Task<IActionResult> CallSageMakerModel(string endpointName, byte[] body, string contentType)
            AmazonSageMakerRuntimeClient awsSageMakerRuntimeClient = new AmazonSageMakerRuntimeClient();
            InvokeEndpointRequest request = new InvokeEndpointRequest();
            request.EndpointName = endpointName;
            request.ContentType = contentType;
            request.Body = new MemoryStream(body);
            request.CustomAttributes = "accept_eula=true";
            var response = await awsSageMakerRuntimeClient.InvokeEndpointAsync(request);
            var result = Encoding.UTF8.GetString(response.Body.ToArray());
            return Ok(result);

We don't have to start from scratch when adding support for Llama to SK. We can use some of the existing connectors as base and build on it. I used HuggingFace connector and modified it as I went along.

To add support for Text Completions, you have to implement the ITextCompletion interface and for Text Embeddings, ITextEmbeddingGeneration interface. Here's an example of Text Completions implementation.

 private async Task<IReadOnlyList<ITextStreamingResult>> ExecuteGetCompletionsAsync(string text, CancellationToken cancellationToken = default)
            LlamaTextParser textParser = new LlamaTextParser();
            var chats = textParser.Parse(text);
            var completionRequest = new TextCompletionRequest();
            completionRequest.Inputs = new List<List<Chat>> { chats };
            completionRequest.Parameters = new AIParameters();
            completionRequest.Parameters.MaxNewTokens = 1024;
            completionRequest.Parameters.Temperature = 0.6;
            completionRequest.Parameters.TopP = 0.9;

            using var httpRequestMessage = HttpRequest.CreatePostRequest(this.GetRequestUri(), completionRequest);

            httpRequestMessage.Headers.Add("User-Agent", HttpUserAgent);
            if (!string.IsNullOrEmpty(this._apiKey))
                httpRequestMessage.Headers.Add("x-api-key", this._apiKey);


You can play around with max_new_token, temperature and top_p parameters to suit your needs.

What are Embeddings?
Embeddings are numerical representations of data that capture their semantic meaning and similarity. For example, words, sentences, images, or sounds can be converted into embeddings using machine learning models. Embeddings are useful for many tasks that involve natural language processing, computer vision, or audio processing.

Vector databases are specialized databases that store and search embeddings efficiently. They use algorithms such as approximate nearest neighbor search to find the most similar embeddings in a large collection. Vector databases can enable applications such as semantic search, recommendation systems, clustering, classification, and more.

Qdrant is an example of a vector database that is open-source and cloud-native. To run it locally,

docker pull qdrant/qdrant
docker run -p 6333:6333 -v ./qdrant:/qdrant/storage qdrant/qdrant

Why are Embeddings important?
LLMs like Llama 2 are trained on a vast amount of data. But if you want the LLM to answer queries based on your data, you have to find a way to provide them to the model. There are two ways to do it:

1. You can copy paste paragraphs of text into the chatbot/API that you prepared. But since these LLMs have a limited amount of context that it can keep in memory (sliding window), you will soon run into hallucinations.
2. Transform the text using a model like all-MiniLM-L6-v2 into Embeddings (array of numbers) that can be then stored in vector databases like Qdrant and queried by LLM while responding to user's inputs.

I chose to do it the second way. 

I copied the Medicare SSA FAQs into a text file and fed into to the API. Next you have to "ground" the Llama 2 model so that it only answers the queries about Medicare from the data we provided to it. I wrote the following prompt to do it, though admittedly it has huge scope for improvement.

var prompt = @"
        {{ '{{ system }}' }} You are a bot which can answer queries on Medicare.
        Consider only 'Medicare FAQ' data while answering questions. Don't talk about anything else.
        Medicare FAQ: {{recall $input}}
        {{ '{{ user }}' }} {{$input}}

Here, the recall is a skill which is built-in to SK and can query vector databases.

I am 38 years old. Am I eligible for Medicare?

Medicare eligibility in the United States typically starts at age 65. So, if you are currently 38 years old, you would not be eligible for Medicare based on age alone..... (truncated)

In this blog, we saw how Llama 2 models can be deployed using Amazon SageMaker and exposed as an endpoint. We also modified Microsoft Semantic Kernel to work with Llama 2 models and finally we tested the final result. Please note that since this field is rapidly evolving, the APIs may change by the time you are reading this blog post. So please verify the official documentation in case you run into issues.


  1. Great content! I believe Llama 2 support is natively coming to Azure as per the announcement and it's already part of foundational models. So, SK would eventually provide native support for Llama 2?

    1. Yes it may eventually support it. If you want to use it today, follow this post. 😊


Post a Comment

As far as possible, please refrain from posting Anonymous comments. I would really love to know who is interested in my blog! Also check out the FAQs section for the comment policy followed on this site.

Popular posts from this blog

Creating a Smart Playlist

A few days earlier I was thinking that wouldn't it be nice if I had something which will automatically generate a playlist for me with no artists repeated. Also, it would be nice if I could block those artists which I really hate (like Himesh Reshammiya!). Since I couldn't find anything already available, I decided to code it myself. Here is the outcome -  This application is created entirely in .NET Framework 4/WPF and uses Windows Media Player Library as its source of information. So you have to keep your Windows Media Player Library updated for this to work. It is tested only on Windows 7/Vista. You can download it from here . UPDATE : You can download the Windows XP version of the application here . Please provide your feedback!

Integrating React with SonarQube using Azure DevOps Pipelines

In the world of automation, code quality is of paramount importance. SonarQube and Azure DevOps are two tools which solve this problem in a continuous and automated way. They play well for a majority of languages and frameworks. However, to make the integration work for React applications still remains a challenge. In this post we will explore how we can integrate a React application to SonarQube using Azure DevOps pipelines to continuously build and assess code quality. Creating the React Application Let's start at the beginning. We will use npx to create a Typescript based React app. Why Typescript? I find it easier to work and more maintainable owing to its strongly-typed behavior. You can very well follow this guide for jsx based applications too. We will use the fantastic Create-React-App (CRA) tool to create a React application called ' sonar-azuredevops-app '. > npx create-react-app sonar-azuredevops-app --template typescript Once the project creation is done, we

Centralized Configuration for .NET Core using Azure Cosmos DB and Narad

We are living in a micro services world. All these services are generally hosted in Docker container which are ephemeral. Moreover these service need to start themselves up, talk to each other, etc. All this needs configuration and there are many commercially available configuration providers like Spring Cloud Config Server, Consul etc. These are excellent tools which provide a lot more functionality than just storing configuration data. However all these have a weakness - they have a single point of failure - their storage mechanism be it a file system, database etc. There are ways to work around those but if you want a really simple place to store configuration values and at the same time make it highly available, with guaranteed global availability and millisecond reads, what can be a better tool than Azure Cosmos DB! So I set forth on this journey for ASP.NET Core projects to talk to Cosmos DB to retrieve their configuration data. For inspiration I looked at Steeltoe Con