Skip to main content

Serverless Generative AI: How to Query Meta’s Llama 2 Model with Microsoft’s Semantic Kernel and AWS Services

Generative AI is a type of artificial intelligence that can create new content such as text, images, music, etc. in response to prompts. Generative AI models learn the patterns and structure of their input training data by applying neural network machine learning techniques, and then generate new data that has similar characteristics.

They are all the rage these days. 😀

Some types of generative AI include:

Foundation models, which are complex machine learning systems trained on vast quantities of data (text, images, audio or a mix of data types) on a massive scale. Foundation models can be adapted quickly for a wide range of downstream tasks without needing task-specific training. Examples of foundation models are GPT, LaMDA and Llama.

Generative adversarial networks (GANs), which are composed of two competing neural networks: a generator that creates fake data and a discriminator that tries to distinguish between real and fake data. The generator improves its ability to fool the discriminator over time. GANs can generate realistic images, videos, or audio. Examples of GANs are DALL-E and Stable Diffusion.

Variational autoencoders (VAEs), which are neural networks that encode input data into a latent space and then decode it back into output data. VAEs can generate new data that is similar but not identical to the input data. VAEs can also perform tasks such as denoising, inpainting, or style transfer. Examples of VAEs are Midjourney and StyleGAN.

What is Microsoft Semantic Kernel?

Microsoft’s Semantic Kernel is an open-source SDK that lets you easily combine AI services like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C# and Python. Semantic Kernel enables you to create AI apps that combine natural language semantic functions, traditional code native functions, and embeddings-based memory.

Semantic Kernel supports and encapsulates several design patterns from the latest in AI research, such as prompt chaining, recursive reasoning, summarization, zero/few-shot learning, contextual memory, long-term memory, embeddings, semantic indexing, planning, retrieval-augmented generation and accessing external knowledge stores.

What is Llama 2?

Llama 2 is Meta's answer to the Large Language Model challenge. Llama 2 has models ranging from 7B to 70B parameters, trained on 2 trillion tokens, and fine-tuned on over 1 million human annotations. It is available on the Amazon SageMaker Jumpstart to be used by anyone.

So what is the problem?

Semantic Kernel (referred as 'SK' from here onwards) unsurprisingly supports OpenAI and Azure OpenAI models out-of-the-box. It also has connectors for some of the HuggingFace and Oobabooga models. However, it still doesn't support Llama 2 models. Since I wanted to play around with Llama 2 models, I decided to modify the SK and add one more connector for Llama 2.


Using a model on Amazon SageMaker Jumpstart is as simple as clicking a button and waiting for an endpoint to be ready. It doesn't take more than 5 minutes.


Once the endpoints are ready, you will need to create an AWS Lambda and Amazon API Gateway. Lambda has following IAM permissions attached to it to enable it to query SageMaker endpoints.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

The API Gateway provides an API Key based security by default which aligns with SK approach to calling API endpoints.

Llama 2 expects input to be presented in a JSON format which is different from GPT based models which expect a simple string. Example:

        {"role": "user", "content": "what is the recipe of mayonnaise?"},

This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...). So I needed to modify the SK to somehow transform the string input into the format which Llama understands.

Fortunately, SK supports {{ }} based input template. We can use the same to create a prompt which escapes these curly braces and the run the input through a simple Regex parser which transforms the input.

{{ '{{ system }}' }} You are a bot which can answer queries on Medicare. Don't talk about anything else.
{{ '{{ user }}' }} {{$input}}

SK Transformation:
{{ system }} You are a bot which can answer queries on Medicare. Don't talk about anything else.
{{ user }} I am 38 years old. Am I eligible for Medicare?

Regex Parser Output:
Role: system
Content: You are a bot which can answer queries on Medicare. Don't talk about anything else.

Role: user
Content: I am 38 years old. Am I eligible for Medicare?

Llama 2 also expects a custom attribute "accept_eula=true" to be sent with each inference request else it will reject the request. We can easily send this using the AWS SageMaker SDK.

private async Task<IActionResult> CallSageMakerModel(string endpointName, byte[] body, string contentType)
            AmazonSageMakerRuntimeClient awsSageMakerRuntimeClient = new AmazonSageMakerRuntimeClient();
            InvokeEndpointRequest request = new InvokeEndpointRequest();
            request.EndpointName = endpointName;
            request.ContentType = contentType;
            request.Body = new MemoryStream(body);
            request.CustomAttributes = "accept_eula=true";
            var response = await awsSageMakerRuntimeClient.InvokeEndpointAsync(request);
            var result = Encoding.UTF8.GetString(response.Body.ToArray());
            return Ok(result);

We don't have to start from scratch when adding support for Llama to SK. We can use some of the existing connectors as base and build on it. I used HuggingFace connector and modified it as I went along.

To add support for Text Completions, you have to implement the ITextCompletion interface and for Text Embeddings, ITextEmbeddingGeneration interface. Here's an example of Text Completions implementation.

 private async Task<IReadOnlyList<ITextStreamingResult>> ExecuteGetCompletionsAsync(string text, CancellationToken cancellationToken = default)
            LlamaTextParser textParser = new LlamaTextParser();
            var chats = textParser.Parse(text);
            var completionRequest = new TextCompletionRequest();
            completionRequest.Inputs = new List<List<Chat>> { chats };
            completionRequest.Parameters = new AIParameters();
            completionRequest.Parameters.MaxNewTokens = 1024;
            completionRequest.Parameters.Temperature = 0.6;
            completionRequest.Parameters.TopP = 0.9;

            using var httpRequestMessage = HttpRequest.CreatePostRequest(this.GetRequestUri(), completionRequest);

            httpRequestMessage.Headers.Add("User-Agent", HttpUserAgent);
            if (!string.IsNullOrEmpty(this._apiKey))
                httpRequestMessage.Headers.Add("x-api-key", this._apiKey);


You can play around with max_new_token, temperature and top_p parameters to suit your needs.

What are Embeddings?
Embeddings are numerical representations of data that capture their semantic meaning and similarity. For example, words, sentences, images, or sounds can be converted into embeddings using machine learning models. Embeddings are useful for many tasks that involve natural language processing, computer vision, or audio processing.

Vector databases are specialized databases that store and search embeddings efficiently. They use algorithms such as approximate nearest neighbor search to find the most similar embeddings in a large collection. Vector databases can enable applications such as semantic search, recommendation systems, clustering, classification, and more.

Qdrant is an example of a vector database that is open-source and cloud-native. To run it locally,

docker pull qdrant/qdrant
docker run -p 6333:6333 -v ./qdrant:/qdrant/storage qdrant/qdrant

Why are Embeddings important?
LLMs like Llama 2 are trained on a vast amount of data. But if you want the LLM to answer queries based on your data, you have to find a way to provide them to the model. There are two ways to do it:

1. You can copy paste paragraphs of text into the chatbot/API that you prepared. But since these LLMs have a limited amount of context that it can keep in memory (sliding window), you will soon run into hallucinations.
2. Transform the text using a model like all-MiniLM-L6-v2 into Embeddings (array of numbers) that can be then stored in vector databases like Qdrant and queried by LLM while responding to user's inputs.

I chose to do it the second way. 

I copied the Medicare SSA FAQs into a text file and fed into to the API. Next you have to "ground" the Llama 2 model so that it only answers the queries about Medicare from the data we provided to it. I wrote the following prompt to do it, though admittedly it has huge scope for improvement.

var prompt = @"
        {{ '{{ system }}' }} You are a bot which can answer queries on Medicare.
        Consider only 'Medicare FAQ' data while answering questions. Don't talk about anything else.
        Medicare FAQ: {{recall $input}}
        {{ '{{ user }}' }} {{$input}}

Here, the recall is a skill which is built-in to SK and can query vector databases.

I am 38 years old. Am I eligible for Medicare?

Medicare eligibility in the United States typically starts at age 65. So, if you are currently 38 years old, you would not be eligible for Medicare based on age alone..... (truncated)

In this blog, we saw how Llama 2 models can be deployed using Amazon SageMaker and exposed as an endpoint. We also modified Microsoft Semantic Kernel to work with Llama 2 models and finally we tested the final result. Please note that since this field is rapidly evolving, the APIs may change by the time you are reading this blog post. So please verify the official documentation in case you run into issues.


  1. Great content! I believe Llama 2 support is natively coming to Azure as per the announcement and it's already part of foundational models. So, SK would eventually provide native support for Llama 2?

    1. Yes it may eventually support it. If you want to use it today, follow this post. 😊


Post a Comment

As far as possible, please refrain from posting Anonymous comments. I would really love to know who is interested in my blog! Also check out the FAQs section for the comment policy followed on this site.

Popular posts from this blog

Integrating React with SonarQube using Azure DevOps Pipelines

In the world of automation, code quality is of paramount importance. SonarQube and Azure DevOps are two tools which solve this problem in a continuous and automated way. They play well for a majority of languages and frameworks. However, to make the integration work for React applications still remains a challenge. In this post we will explore how we can integrate a React application to SonarQube using Azure DevOps pipelines to continuously build and assess code quality. Creating the React Application Let's start at the beginning. We will use npx to create a Typescript based React app. Why Typescript? I find it easier to work and more maintainable owing to its strongly-typed behavior. You can very well follow this guide for jsx based applications too. We will use the fantastic Create-React-App (CRA) tool to create a React application called ' sonar-azuredevops-app '. > npx create-react-app sonar-azuredevops-app --template typescript Once the project creation is done, we

Creating a Smart Playlist

A few days earlier I was thinking that wouldn't it be nice if I had something which will automatically generate a playlist for me with no artists repeated. Also, it would be nice if I could block those artists which I really hate (like Himesh Reshammiya!). Since I couldn't find anything already available, I decided to code it myself. Here is the outcome -  This application is created entirely in .NET Framework 4/WPF and uses Windows Media Player Library as its source of information. So you have to keep your Windows Media Player Library updated for this to work. It is tested only on Windows 7/Vista. You can download it from here . UPDATE : You can download the Windows XP version of the application here . Please provide your feedback!

Bhagavad Gita Reader

Few days ago I had a heated argument with my friend about religion. Still fresh from watching " Religulous ", I was countering his every argument. Fed up, he asked me to read Bhagavad Gita and then tell him if I find anything wrong with it. Faced with the prospect of reading a book running into hundreds of pages, I asked him to keep it cool and let it go. He anyways went ahead and gifted me the holy book. Now I don't have the patience or passion to read such a thick religious book (yeah go ahead, judge me), I was thinking of a shortcut. That's when the idea of Bhagavad Gita Reader came into my mind. Snapshot - (Click for a larger view) This application divides Bhagavad Gita into chapters and verses. Then each verse and its translation is shown. There are no extra "interpretations to clutter up your mind." This application is built on WPF/.NET 4.0 with MVVM Light framework - A great framework with a very small learning curve as compared to Prism  albeit