Serve with FastAPI for deploying

Building an LLM bot and playing with it locally is simple enough, however, at some point you will want to put this bot in production, generally serving with through an API that the frontend can talk to. In this example, we are going to reuse the Simple Bot with Weather Tool example, but serve it through a FastAPI endpoint.

If you are not using FastAPI but something like Flask or Quartz, don't worry, the example should end up working pretty similar, we are just demonstrating it in FastAPI here because it's the more popular async-native alternative.

First we are going to reimplement the same bot code, with a slighly difference on the memory class, as by serving multiple users with FastAPI, we are going to need to have one memory history per user. It follows:

import json
from typing import AsyncGenerator, List, Literal
from langstream import Stream, debug, as_async_generator

from langstream.contrib.llms.open_ai import (
    OpenAIChatStream,
    OpenAIChatDelta,
    OpenAIChatMessage,
)


class Memory:
    history: List[OpenAIChatMessage]

    def __init__(self) -> None:
        self.history = []

    def save_message(self, message: OpenAIChatMessage) -> OpenAIChatMessage:
        self.history.append(message)
        return message

    def update_delta(self, delta: OpenAIChatDelta) -> OpenAIChatDelta:
        if self.history[-1].role != delta.role and delta.role is not None:
            self.history.append(
                OpenAIChatMessage(
                    role=delta.role, content=delta.content, name=delta.name
                )
            )
        else:
            self.history[-1].content += delta.content
        return delta


def get_current_weather(
    location: str, format: Literal["celsius", "fahrenheit"] = "celsius"
) -> OpenAIChatDelta:
    result = {
        "location": location,
        "forecast": "sunny",
        "temperature": "25 C" if format == "celsius" else "77 F",
    }

    return OpenAIChatDelta(
        role="function", name="get_current_weather", content=json.dumps(result)
    )


# Stream Definitions


def weather_bot(memory):
    weather_stream = (
        OpenAIChatStream[str, OpenAIChatDelta](
            "WeatherStream",
            lambda user_input: [
                *memory.history,
                memory.save_message(
                    OpenAIChatMessage(role="user", content=user_input),
                ),
            ],
            model="gpt-3.5-turbo-0613",
            functions=[
                {
                    "name": "get_current_weather",
                    "description": "Gets the current weather in a given location, use this function for any questions related to the weather",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "location": {
                                "description": "The city to get the weather, e.g. San Francisco. Get the location from user messages",
                                "type": "string",
                            },
                            "format": {
                                "description": "A string with the full content of what the given role said",
                                "type": "string",
                                "enum": ("celsius", "fahrenheit"),
                            },
                        },
                        "required": ["location"],
                    },
                }
            ],
            temperature=0,
        )
        .map(
            # Call the function if the model produced a function call by parsing the json arguments
            lambda delta: get_current_weather(**json.loads(delta.content))
            if delta.role == "function" and delta.name == "get_current_weather"
            else delta
        )
        .map(memory.update_delta)
    )

    function_reply_stream = OpenAIChatStream[None, OpenAIChatDelta](
        "FunctionReplyStream",
        lambda _: memory.history,
        model="gpt-3.5-turbo-0613",
        temperature=0,
    )

    async def reply_function_call(stream: AsyncGenerator[OpenAIChatDelta, None]):
        async for output in stream:
            if output.role == "function":
                async for output in function_reply_stream(None):
                    yield output
            else:
                yield output

    weather_bot: Stream[str, str] = (
        weather_stream.pipe(reply_function_call)
        .map(memory.update_delta)
        .map(lambda delta: delta.content)
    )

    return weather_bot

Now we are going to create a FastAPI endpoint, which takes a user message, stores its history on the "database", and call the bot, returning the streamed answer from it:

from typing import Dict
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from langstream import filter_final_output

app = FastAPI()


in_memory_database: Dict[str, Memory] = {}


@app.post("/chat")
async def chat(request: Request):
    params = await request.json()
    user_input = params.get("input")
    user_id = params.get("user_id")

    if user_id not in in_memory_database:
        in_memory_database[user_id] = Memory()

    bot = weather_bot(in_memory_database[user_id])
    output_stream = filter_final_output(bot(user_input))

    return StreamingResponse(output_stream, media_type="text/plain")

And then start the server to make some requests to it:

from uvicorn import Config, Server
import threading

config = Config(app=app, host="0.0.0.0", port=8000)
server = Server(config)

threading.Thread(target=server.run).start()

    INFO:     Started server process [96682]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

curl -X POST http://localhost:8000/chat \
    -H "Content-Type: application/json" \
    -d '{"user_id":"1", "input":"hi there"}'

    INFO:     127.0.0.1:59288 - "POST /chat HTTP/1.1" 200 OK
    Hello! How can I assist you today?

It's alive! It replies

curl -X POST http://localhost:8000/chat \
    -H "Content-Type: application/json" \
    -d '{"user_id":"1", "input":"I am traveling to Amsterdam, is it hot today there?"}'

    INFO:     127.0.0.1:59291 - "POST /chat HTTP/1.1" 200 OK
    According to the current weather information, it is sunny in Amsterdam with a temperature of 25°C. Enjoy your trip to Amsterdam!

Cool, and we can ask the bot about the weather too! Now, does it remember the last thing I said?

curl -X POST http://localhost:8000/chat \
    -H "Content-Type: application/json" \
    -d '{"user_id":"1", "input":"where am I traveling to?"}'

    INFO:     127.0.0.1:59296 - "POST /chat HTTP/1.1" 200 OK
    You mentioned that you are traveling to Amsterdam.

Yes it does! What if it were a different user_id talking to it, would it have access to the chat history as well?

curl -X POST http://localhost:8000/chat \
    -H "Content-Type: application/json" \
    -d '{"user_id":"2", "input":"where am I traveling to?"}'

    INFO:     127.0.0.1:59298 - "POST /chat HTTP/1.1" 200 OK
    I'm sorry, but as an AI assistant, I don't have access to personal information about users unless it is shared with me in the course of our conversation. Therefore, I don't know where you are traveling to.

No it doesn't, users have separate chat histories.

That's it, we now have a fully working API that calls our bot and answer questions about the weather, while also keeping the conversation history on the "database" separate per user, allowing to serve many users at the same time.

I hope this helps you getting your LLM bot into production! If there is something you don't understand in this example, join our discord channel to get help from the community!