Behind the Scenes: Testing and Validation

Behind the Scenes: Testing and Validation
April 30, 2024
The development of large language model (LLM) assistants introduces a unique set of challenges, especially when engaging in dynamic, conversational contexts. Therefore, robust testing is crucial for delivering a seamless user experience. Testing is also particularly important in the development of intelligent virtual assistants (IVA), ensuring to meet user expectations, handling ambiguity and variability, and preventing misinterpretations.

In this article, we are going to explore two layers of testing for LLM assistants.

  1. Deterministic:

    Before delving into the complexities of dynamic conversations, it's essential to ensure that the fundamental functionalities of the LLM assistant work as intended. This involves testing each function, or capability, to verify that it produces the correct output for a given input.

  1. Non-deterministic:

    Due to the nature of natural language, non-deterministic testing is essential. This refers to examining scenarios where outcomes are unpredictable or variable. Specifically, in the context of natural language processing, this relates to instances where the model's response may vary due to the inherent uncertainty in the structure and interpretation of language. The evaluation aims to measure how effectively the LLM adapts and performs in situations where non-deterministic outcomes are challenging to assess.


Let’s take a beverage machine assistant as an example for better understanding. A beverage machine assistant is designed to facilitate user interactions. Users can place orders for various beverages, customize their drinks, and manage their orders through a conversational interface.

The beverage machine has some basic functionalities that we have defined in code:  

  • add_item  
  • remove_item  
  • cancel_order
  • Submit_order

First, we want to make sure that the assistant understands when to call these functions. That's where the unit tests come in handy. As we write unit tests, we're fine-tuning our assistant to recognize exactly when to trigger specific functions. By being careful with these tests, we're basically upgrading and refining our assistant, making sure it's sharp and can handle requests smoothly and without any hiccups!

Example code below, is to ensure that the assistant accurately recognizes and triggers the "cancel_order" function under various input scenarios. This meticulous testing process helps refine and improve the assistant's responsiveness and handling of user requests.  

     @pytest.mark.parametrize("order_input", [ 
         "I think I don't want anything anymore! ciao!", 
         "Cancel my order. Goodbye!", 
         "I changed my mind. No more drinks for me. Bye.", 
         "I'm not in the mood for coffee anymore. Cancel my order.", 
         "Sorry, cancel the order. I've changed my mind." 
     ], indirect=True) 
   def test_cancel_order_function_call(self, order_input): 
		assistant = BeverageAssistant() 
        reply = assistant.run_order(order_input) 
        assistant = None 
        print("reply from assistant: ", reply) 
        assert reply[1] == "cancel_order" 


Once we have ensured that the assistant comprehends its fundamental functions, the next step involves subjecting it to more dynamic scenarios. This phase is crucial for validating that the assistant's responses align with expectations. To accomplish this, we conducted testing on our beverage assistant by pairing it with another language model (LLM).

Distinct from the beverage assistant model, we introduced a separate validation model designed specifically to assess and validate the responses generated by the assistant. The aim is to define various test scenarios that encompass a range of interaction cases with the agent. For this purpose, we employed the Gherkin model.

The input for our validation LLM includes the user's input prompt, the assistant's response, and the specific test scenario being evaluated. Below is an illustrative sample test scenario:

SCENARIO-3: User orders an unavailable drink

GIVEN: The user provides any size or extras.
WHEN: The user requests a specific drink.
THEN: The assistant refrains from adding the drink to the final order list.
AND: The assistant informs the user about the unavailability of the item.


Here's an actual example output of a scenario:

User Prompt: Can I get a milkshake?
Assistant: I'm sorry, but we currently don't have milkshakes available. Is there any other drink you would like to order?
Validation Result: Valid. The user requested a milkshake, aligning with the given scenario of ordering an unavailable drink. The assistant accurately identifies the unavailability of milkshakes and communicates this to the user. The final order result remains empty, indicating that the milkshake was not added to the order. Therefore, the assistant's response is correct within the context of the specified scenario.


In conclusion, rigorous testing, including basic functionality testing and dynamic scenario testing, is imperative for the development of large language model (LLM) assistants, ensuring their fundamental capabilities and responsiveness. The presented example of a beverage machine assistant illustrates the importance of meticulous testing, refining the assistant's understanding of user inputs, and validating its dynamic responses. The incorporation of a separate validation LLM further enhances the assessment of the assistant's performance in diverse scenarios, ultimately contributing to the delivery of a reliable and user-friendly conversational experience.

Mediocre programmer - Obsessed open-source enthusiast and Eternal amateur at everything. Recently fell in love with Rust.

Other articles by same author

Article collaborators

SABO Newsletter icon


Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

About SABO Mobile IT

We focus on developing specialized software for our customers in the automotive, supplier, medical and high-tech industries in Germany and other European countries. We connect systems, data and users and generate added value for our customers with products that are intuitive to use.
Learn more about sabo