Fine-Tuning : Involves retraining a pre-trained LLM on a smaller, domain-specific dataset to adapt it to particular tasks. This method requires substantial computational resources and expertise.
In-Context Learning : Leverages the LLM's existing capabilities by providing context through prompts, examples, or additional data at inference time, eliminating the need for retraining. This approach is more accessible and flexible for many applications. Purpose : Preprocesses and stores embeddings of private or proprietary data to facilitate efficient retrieval during inference.
Components : Vector Databases : Specialized databases optimized for storing and querying high-dimensional vector representations of data. They enable similarity searches to retrieve relevant information based on input queries.
Integration with Traditional Databases : Some systems enhance existing SQL or NoSQL databases with vector search capabilities, offering a balance between new functionality and familiar infrastructure. Function : Hosts the LLMs that generate responses based on input prompts and context.
Options : Proprietary Models : Such as OpenAI's GPT-4 and Anthropic's Claude, which offer robust performance but may come with usage restrictions and costs.
Open-Source Models : Including Meta's Llama 2 and various models available through Hugging Face, providing more flexibility and control over deployment.
Role : Acts as the central framework coordinating interactions between the data, model, and operational layers, as well as external components. Responsibilities : Prompt Construction : Builds prompts using templates and examples to guide the LLM's responses. Context Management : Retrieves relevant data from vector databases and integrates it into prompts.
API Interactions : Handles communication with LLM APIs and other external services. Tools : LangChain : An open-source framework providing libraries and interfaces for building LLM applications. Flowise : A GUI-based tool built on LangChain, allowing visual construction of LLM workflows.
Purpose : Ensures the reliability, efficiency, and security of LLM-powered applications in production environments. Key Functions : Monitoring : Tracks LLM outputs to assess performance and guide improvements. Caching : Stores frequent LLM responses to reduce latency and API usage costs.
Validation : Implements safeguards against prompt injection attacks and ensures output quality. Notable Tools : Commercial : Autoblocks, Helicone, HoneyHive, LangSmith, Weights & Biases. Open-Source : GPTCache, Redis, Guardrails AI, Rebuff.
The emerging LLM tech stack provides a structured approach to integrating large language models into applications, emphasizing modularity and scalability. By leveraging in-context learning and the layered architecture, developers can efficiently build and deploy AI-powered solutions tailored to specific use cases.
Pre-trained Large Language Models (LLMs), such as OpenAI’s GPT-4 and Meta’s Llama 2, have become increasingly prevalent in application development leveraging generative AI. As a software developer, how can you efficiently integrate LLM-powered capabilities into your application? An emerging tech stack, often referred to as the LLM application stack, is forming to facilitate interaction with these models via in-context learning.
In this article, we’ll define in-context learning and explore each layer of the emerging tech stack, serving as a reference architecture for AI startups and tech companies: Two general approaches for customizing pre-trained LLMs to unique use cases are:
Fine-Tuning : Involves additional training of a pre-trained LLM using a smaller, domain-specific, proprietary dataset. This process alters the model's parameters, making it more specialized.
In-Context Learning : Does not modify the underlying pre-trained model. Instead, it guides the LLM's output via structured prompting and relevant retrieved data, providing the model with the right information at the right time.
Fine-tuning involves additional training of a pre-trained LLM by providing it with a smaller, domain-specific, and proprietary dataset. This process will alter the parameters of the LLM, and thus modify the “ model's knowledge bank ” to be more specialized. Fine-tuning for GPT-3.5 Turbo is available via OpenAI's official API.
Fine-tuning for GPT-4 is offered through an experimental access program, with eligible users able to request access via the fine-tuning UI. Fine-tuning for Llama 2 can be performed using platforms like Google Colab and open-source libraries from Hugging Face. Typically yields higher-quality outputs than prompting.
Allows for more training examples compared to prompting. Results in lower costs and reduced latency post fine-tuning due to shorter prompts. Requires machine learning expertise and is resource-intensive. Risk of " Catastrophic Forgetting ," where the model loses previously learned skills.
Potential for overfitting, hindering generalization to new, unseen inputs.
In-context learning maintains the integrity of the pre-trained model, guiding its output through structured prompts and relevant retrieved data. This approach provides the model with pertinent information at the appropriate time .
To condition the LLM for specific tasks and desired output formats, few-shot prompting can be employed. This technique involves supplying the LLM with examples of expected input-output pairs as part of the input context. The LLM's context, comprising tokenized data, functions as the model's "attention span." These example pairs act as a targeted, mini training dataset.
A compiled prompt typically combines elements such as: Relevant documents retrieved from a vector database.
Given that pre-trained LLMs were trained on publicly available data with a cutoff date, they lack awareness of recent events and private data. To supplement the LLM's knowledge, the Retrieval Augmented Generation (RAG) technique can be utilized. This involves retrieving additional required information from various sources—such as vector or SQL databases, internal or external APIs, and document repositories—and including it as part of the input context.
GPT-4 Turbo offers a maximum context length of 128,000 tokens . Llama 2 supports a maximum context length of 4,096 tokens . Does not require machine learning expertise and is less resource-intensive than fine-tuning. No risk of altering the underlying pre-trained model.
Allows for separate management of specialized and proprietary data. Typically produces lower-quality outputs compared to fine-tuning. Limited by the LLM's maximum context length. Higher costs and increased latency due to longer prompts. The emerging LLM tech stack comprises three main layers and one supplementary layer:
Data Layer : Preprocessing and storing embeddings of private data. Orchestration Layer : Coordinating various components, retrieving relevant information, and constructing prompts. Operational Layer (Supplementary): Tooling for monitoring, caching, and validation to enhance functionality and efficiency .
Model Layer : The LLM accessed for prompt execution.
We will illustrate these layers using a simple application example: a customer service chatbot knowledgeable about a company's products, policies, and FAQs.
The data layer encompasses the full preprocessing and storage of private and supplementary information. The data processing involves three main steps: extracting, embedding, and storing. A table of available offerings for the Data Layer (as of December 2023, not exhaustive).
Source: Inspired by Emerging Architectures for LLM Applications written by Matt Bornstein and Rajko Radovanovic & The New Language Model Stack written by Michelle Fradin and Lauren Reeder.
Relevant data may originate from multiple sources in various formats. Connectors are established to ingest data from these sources for extraction. For a customer service chatbot, data sources might include: Client information from a CRM system accessed via an external API.
Product catalogs stored in an SQL database. Team processes documented in a collaborative wiki.
Optional steps include cleaning the extracted data by removing unnecessary or confidential parts and transforming it into a standardized format, such as JSON, for efficient downstream processing.
Document Loaders : Suitable for a small number of data sources with infrequent changes and common formats (e.g., text, CSV, HTML, JSON, XML).
Data Pipelines : Appropriate for aggregating diverse and massive data sources, including real-time streams requiring intensive processing.
An embedding is a numerical representation capturing semantic meaning, expressed as a vector. Embeddings enable quick classification and search of unstructured data by comparing their vector representations. Utilize embedding models like OpenAI's Ada V2 , which accepts input text and returns embedding outputs.
Ada V2 can process multiple inputs in a single request by passing an array of strings or token arrays. As of December 2023, Ada V2 is accessible via the API endpoint at https://api.openai.com/v1/embeddings .
Chunking : Breaking up large input text into smaller fragments to accommodate embedding model size limits. Libraries support various chunking methods, such as fixed-size or sentence-splitting. For instance, Ada V2 has a maximum input length of 8,192
Once data is embedded, the output along with the original content is stored in a vector database or in a traditional database enhanced with a vector search extension .
A vector database is built specifically for indexing and querying vectorized data. It supports CRUD operations and is optimized for high-performance similarity search , real-time updates , and data security .
The model layer contains the off-the-shelf LLM to be used for your application development, such as GPT-4 or Llama 2. Select the LLM suitable for your specific purposes as well as requirements on costs, performance, and complexity. For a customer service chatbot, we may use GPT-4, which is optimized for conversations, offers robust multilingual support, and has advanced reasoning capabilities.
Proprietary model APIs play a crucial role in the inference process, including submitting prompts to a pre-trained language model along with other systems like logging and caching.
The access method depends on the specific LLM, whether it is proprietary or open-source, and how the model is hosted. Typically, there will be an API endpoint for LLM inference, or prompt execution, which receives the input data and produces the output. At the time of writing, the API endpoint for GPT-4 is “https://api.openai.com/”
A table of available offerings for the Model Layer (as of December 2023, not exhaustive). Orchestration Layer in Emerging Architectures for LLM Applications
The orchestration layer is the core framework responsible for coordinating all other layers in the LLM application stack , as well as any external systems. It provides libraries, templates, and tools to handle key operations such as prompt construction and execution . Functionally, it resembles the controller in the Model-View-Controller (MVC) architecture.
With the in-context learning approach, the orchestration framework: Constructs the prompt based on a template and few-shot examples Retrieves relevant data through vector similarity search Sends the full data pipeline to the LLM API endpoint For a basic customer service chatbot , if a user asks about refund policies:
A prompt template is already configured with instructions and sample inputs/outputs. The instruction is “You are a helpful and courteous customer service representative that responds to the user's inquiry: {query}. Here are some example conversations.”
A couple of examples: [{input: “Where is your headquarters located?”, output: “Our company headquarters is located in Los Angeles, CA.”}, {input: “Can you check my order status?”, output: “Yes, I can help with checking the status of your order.”}]. The framework queries a vector database for content related to “refund policy.”
It integrates the result into the prompt and sends it to the selected LLM , such as GPT-4 .
GPT-4 returns the output “We allow refunds on new and unused items within 30 days of purchase. You should receive your refund back to the original form of payment within 3-5 business days.” The framework responds to the user with the LLM output. This illustrates how orchestration frameworks enable dynamic, context-aware LLM applications.
One example framework is LangChain (libraries available in JavaScript and Python), containing interfaces and integrations to common components as well as the ability to combine multiple steps into “chains.” Aside from programming frameworks, there are GUI frameworks available for orchestration. For instance, Flowise , built on top of LangChain, has a graphical layout for visually chaining together major components.
A table of available offerings for the Orchestration Layer (as of December 2023, not exhaustive).
As large language model applications reach production scale, an LLMOps layer can be introduced to manage reliability, efficiency, and security. This layer addresses: Monitoring : Track and analyze inputs/outputs to improve prompt design and model selection Caching : Reduce API calls and response latency by storing previous results using semantic caching
Validation : Detect and prevent prompt injection attacks or other harmful input, and apply rules-based corrections to outputs
In a chatbot context, frequent questions like refund policies can be served from cache. Queries can be validated before prompt execution, and outputs reviewed to ensure they meet performance and accuracy standards. A table of available offerings for the Operational Layer (as of December 2023, not exhaustive).
This article outlined how in-context learning provides an easier path to building LLM applications compared to model fine-tuning. The emerging LLM tech stack now includes: An orchestration layer for chaining and control An operational layer for observability and safety
Each layer offers a modular entry point into LLM app development and helps teams quickly build scalable and production-ready solutions.
Diana Cheung (ex-LinkedIn software engineer, USC MBA, and Codesmith alum) is a technical writer on technology and business. She is an avid learner and has a soft spot for tea and meows.
