LLMs Project Guide: Key Considerations

This guide provides a reference of key considerations for projects using language models.

These considerations are based on lessons learned from projects using language models in different industries. The list is intended to help you think of things that might be relevant to your project. The list is not exhaustive, and you might need to add more goals for your project. Different portions of this guide will apply as your project proceeds from experimentation to production.

Problem Definition & Solution Planning

Problem Definition

  • Clearly define the problem in business terms. Identify potential components that could apply the capabilities of LLMs, involving tasks such as language generation and understanding.
  • Identify how project success will be evaluated by the business.
  • Identify the metrics that can be used to measure success. Choose metrics that relate to business key performance indicators (KPIs) when possible.
  • Identify the target personas of the use case and their intended modes of usage.
  • Identify the target languages to be supported by the solution.

Solution Planning

  • Describe the overall solution’s requirements and intended use cases.
  • Use an LLM playground to perform handcrafted initial exploration. Make sure that LLMs can potentially produce the desired results. Help set expectations to the art of the possible. Validate results with SMEs and end users.
  • Gather and document SLA requirements for the solution, ensuring clarity on performance expectations.
  • Describe high-level solution components, including needed data, any existing systems, and other ML/non-ML components. Confirm the availability of each of those components for the project.
  • Identify a community of potential end users and subject matter experts. This community can help build needed datasets and validate candidate solutions.
  • Survey and understand if there are other existing AI/ML technologies that might be able to solve the problem without the use of an LLM. Some examples of existing AI/ML technologies are task-specific or domain-specific AI/ML solutions.
  • Validate back-of-envelope cost estimates with project leadership. Add cost reduction to scope if there are concerns (examples).
  • Determine if the proposed solution will be subject to requirements that may influence major architectural or operational considerations. Incorporating these requirements as you develop your solution would avoid major rework at later stages.
  • Determine if there is elevated risk associated with presenting incorrect or malicious LLM-generated output directly to users, and design the system and user experience to mitigate those risks as necessary. Start by completing a Responsible AI impact assessment and determining if there is a sensitive use case. If there is, take caution on how LLMs are used when generating rewritten or summarized output to an end user. For example, instead of presenting LLM-generated output to a user directly, consider designing a user experience that uses the LLM to orchestrate a process that returns information using a templated response with details about how the answer was produced.

Responsible AI Considerations

During solution design, review the responsible AI considerations and ensure that the proposed solution identifies and mitigates potential harms.

  • Identify any potential risks or harms that may apply to the solution, including:
    • Data security and privacy
    • Low quality or ungrounded output
    • Misuse and excessive dependence on AI
    • Generation of harmful content
    • Susceptibility to adversarial attack
    • Unintended consequences of delayed or interrupted responses
  • Use Microsoft’s responsible AI tools and methodologies to help mitigate the risks associated with generative AI applications: Responsible AI tools and practices in your LLMOps | Microsoft Azure
  • Use a risk assessment framework such as the OWASP Top 10 for LLM Applications to review the solution for potential risks. Plan to address any risks that are identified.

Ensure that the solution plan addresses common responsible AI fundamentals:

  • Communicate the purpose and implications of AI applications to users.
  • Assess biases and fairness in training data, algorithms, and the overall solution design.
  • Ensure that the solution is accessible to users of varying abilities.

Ensure that the solution can be monitored and managed continuously once deployed:

  • Ensure your solution complies with relevant laws and regulations governing AI in the jurisdictions where it is deployed, and make sure you have a plan to stay in compliance.
  • Ensure the presence of ongoing mechanisms for human oversight in critical decision-making processes. For example, after the development cycle has concluded with a production deployment.
  • Ensure you can have real-time human intervention in the solution to prevent harm when detected, enabling you to manage situations when the AI model does not perform as required.
  • Consider providing a feedback loop that allows users to report issues with the solution once deployed.

Solution Development (inner loop)

The solution development phase focuses on the iterative process to develop, test, and refine the solution.

Data Curation

Data Exploration

  • Obtain access to necessary data and dependencies, including any internal or external APIs. Only use data and information that the team has been given company or data owner consent.

  • Understand the data sources and adhere to relevant data access policies, including security, authentication, and compliance requirements.

  • Understand access requirements and methods for both development and production. For example, the use of role-based access control (RBAC) or attribute-based access control (ABAC).

  • Understand the data available to you and assess its quality through exploratory data analysis (EDA). It is imperative to address data quality issues early on.

  • For solutions involving an unstructured corpus of documents, understand the following things:

    • Domain language
    • Document styles and formats
    • Identification of informative parts of documents

Data Collection

  • Begin to collect data that can be used to create initial ground truth datasets for solution evaluation.
  • Ensure that the data collected is sufficient to compute your evaluation metrics with respect to the high-level success measures.
  • Design a schema for input and output data that supports consistent usage with the tool sets for iterative experimentation. Format initial datasets to match.

Data Governance

Safe adoption of LLMs will require comprehensive and reliable data governance across the data lifecycle.

  • Ensure that there is clear ownership and traceability for all data that the solution will utilize.

  • Identify the data governance processes and technologies that apply to the solution such as:

    • A data catalog, data tagging scheme, and glossary to facilitate the discovery and understanding of data
    • A data lineage tracking system that records the provenance of data through transformation and enrichment
    • A data steward (or council) that serves as the authority to review and grant data use requests
  • Validate that the solution adheres to applicable regulatory requirements (for example, GDPR, EU-AIA, CCPA, etc.) with regular compliance checks, for areas including:

    • Consent management, especially for LLMs operating on PII or PHI data
    • Data retention, including notification for end users and processes to ensure adherence
    • Regular system auditing and reporting, confirming that systems are being used in described and intended ways
  • Ensure that prompt inputs and outputs are properly validated and masked to prevent unintended leakage of sensitive data.

  • Review recorded LLM outputs, including logs and conversation transcripts, for adherence to the following requirements:

    • Data privacy
    • Data retention
    • Data anonymization
    • Data masking

Data Management

For supporting data that is used as part of the inference flow, such as few-shot banks, text corpora, and the like:

  • Establish methods to version datasets, track data lineage, and record the datasets used for each experiment.
  • Pre-process and check the quality of all data sets. Ensure that the data was not corrupted during transfer. Ensure it does not contain unexpected special characters, and is formatted suitably for ingestion and use.

For datasets and document corpora that will be used for search or retrieval:

  • Choose a baseline data-chunking strategy.
  • Include a simple keyword-based search baseline.
  • Identify target query intents and ensure that the datasets are representative of all intents in an unbiased distribution.
  • Identify the language characteristics that influence the selection of embedding models and/or text pre-processing and tokenization, based on the domain language used in the text corpus.
  • Establish a mechanism to evaluate the quality of data retrieval.

Experiment

Design

  • Survey existing reusable assets and solution accelerators to help accelerate your progress. Evaluate the suitability of candidates based on the following characteristics:

    • Domain language characteristics
    • Target user intents of the use case
  • Identify candidate language models that align with business requirements, and use experiments to confirm suitability. Consider both proprietary and open-source pre-trained models. Also consider large and small language models, as well as fine-tuned models, as appropriate to the use case.

  • Identify applicable solution patterns, such as Retrieval-Augmented Generation (RAG), to augment the model with external knowledge sources, if applicable.

  • Identify candidate tool sets (for example, PromptFlow, Semantic Kernel, and/or LangChain) and use experiments to confirm suitability to the solution.

Develop

  • Develop a basic prototype using candidate tool sets or by calling the LLM API directly.
  • Test various techniques to satisfy the cost, speed, and accuracy goals of the project:
    • Explore prompt engineering techniques (see also OpenAI Prompt Engineering Guide).
    • For search-based solutions, revisit chunking, caching and search strategy based on evaluation results.
    • Evaluate both inner loop components and outer loop results to identify areas of improvement.
    • Consider alternative language models.
    • Consider whether or not fine-tuning a model is a feasible approach to improving performance.

Evaluate

  • Define your evaluation framework to evaluate and track your experiments.
  • Track each experiment with version numbers that associate changes in prompts, input data, and configuration parameters with the resulting output and performance evaluation.
  • Measure performance of outputs for each experiment against pre-defined metrics. Assess responses against ground truth for accuracy and effectiveness for both inner loop and outer loop.
  • Establish a baseline to be used for model evaluation. This will provide a reference point for assessing the effectiveness and improvements of the LLM solution throughout development.
  • Instrument the prototype to report the cost of evaluation experiments, so that the cost impact of techniques under consideration is observable during development and experimentation.
  • Determine associated costs for all alternative approaches and optimizations.
  • Include an approach to red teaming in weekly sprints for multiple weeks.

Solution Deployment (outer loop)

The solution deployment phase focuses on deploying and managing solutions in production.

Validate and deploy

Deployment Preparation

  • Develop expected LLM utilization projections and a capacity plan for LLM resources based on expected user throughput and usage characteristics.
  • Create a deployment architecture that meets requirements including:
    • Deployment of LLM models based on tenancy and resiliency requirements
    • Use of shared LLM models based on customer cost requirements
    • Use of intermediate API load-balancing architectures to facilitate observability and cross-region load shedding or failover if necessary
  • Create and test procedures to scale up capacity, if necessary, or implement safeguards to prevent overloading. Confirm that subscriptions contain capacity quotas for any expected scale up.
  • Conduct comprehensive solution reviews, covering design and evaluation approaches, and overall solution performance.

Deployment Process

  • Implement CI/CD pipelines to automate testing and deployment processes.
  • Deploy the model and solution to the QA environment for performance assessment before production deployment.
  • Establish a robust data pipeline for inferencing, including end-to-end tests for reliability.
  • Validate application security standards to safeguard against potential vulnerabilities.
  • Enable A/B testing or blue/green for comparing updated solutions to existing deployments.
  • Consider shadow testing where appropriate.
  • Implement an iterative red teaming approach to identify potential harms or problems.

Monitoring

  • Apply microservice monitoring best practices to your LLM solution. Use a customer-preferred monitoring infrastructure and solutions.
  • Track and analyze solution performance, cost, and latency in production environment.
  • Correlate performance metrics and signals with system changes such as app deployments, model updates, and configuration changes.
  • Implement dashboards tailored for different roles (for example, business roles, data science, and engineering, among others). Enable focused analysis of service performance and LLM responses, with relevant metrics for each role.
  • Monitor data feeds and model output to promptly identify, alert, and address unexpected model or system behavior.
  • Monitor for attempts to “jailbreak” the system by malicious users by tracking rejected user queries or query volume.

Feedback & Data Collection

  • Incorporate manual or autogenerated user feedback into experiments to enhance solution performance.
  • Capture interaction data with the solution for insights onto user usage and expectations, and to improve evaluation datasets.

Operations

  • Develop a comprehensive runbook detailing standard operating procedures, troubleshooting guides, and escalation paths for effective incident management.
  • Onboard and train support teams in system-specific operations and best practices, ensuring readiness for handling live scenarios.
  • Execute regular mock live site drills to simulate real-world incidents, enhancing team preparedness and response capabilities.

References

The following resources and best practice guides provide more considerations that may be helpful.