UI Testing Unleashed: Frameworks, Streamlit, and the Testing of AI/ML Apps

Published by Karan Narula on

In the ever-evolving landscape of software development, testing has remained a linchpin, playing a pivotal role in the creation of seamless, user-centred applications. The process not only ensures that products align with their intended vision but also safeguards against potential pitfalls that could mar the user experience. One central aspect of this landscape is User Interface (UI) testing, an indispensable facet that orchestrates the harmonious interaction between users and applications.

What is UI testing?

UI testing, as its name suggests, revolves around the meticulous evaluation of an application’s User Interface. However, its significance extends beyond the mere assessment of visual elements—it encompasses a thorough examination of the user pathways inherent in the app’s design. The objective is twofold: to ensure the seamless functionality of the application, thus elevating the user experience, and to create a safeguard against UI regression due to software version updates

Furthermore, UI testing functions as a crucial insurance measure before releasing an application. This practice empowers development teams to swiftly traverse a multitude of scenarios and interactions, all without relying on manual intervention. Its platform-agnostic nature is another advantage, facilitating its execution across various devices and operating systems. This adaptability significantly augments the reach and efficiency of the testing process.

Why test the UI?

In the evolving world of AI/ML, there is an increase in awareness regarding the requirements of a front end for any application. The industry standard has been for AI/ML practitioners to use notebook-like environments for interacting with their data. These interactions allowed visualizing the model and fine-tuning it. These notebooks provide a great temporary solution to the problem but are not scalable in the long term, as they are not easy to integrate with testing methodologies. As we chase the pot of gold under the LLM rainbows, having insurance policies that protect us from crash and burn outcomes is imperative – any AI/ML-driven application that is not coupled with a rigorous UI testing suite is a Ferrari that you drive without an insurance policy.

At Flectēre, we place ourselves at the forefront of user-centric AI/ML development. Our development endeavours utilise Streamlit as the cornerstone of our User Interface, facilitating seamless interaction with our applications. Streamlit stands out as a potent and user-friendly Python web framework, offering exceptional utility to data scientists and AI/ML practitioners by providing effortless data interaction and visualisation capabilities. It is rapidly becoming a widely used platform as illustrated by the graph below:

Usage metrics of different data visualisation tools based on GitHub Stars(source: https://www.datarevenue.com/en-blog/data-dashboarding-streamlit-vs-dash-vs-shiny-vs-voila)

While streamlit is a great service, we were unable to find a way to test and verify the UI. We have been writing robust automated tests that validate our ‘backend’ and integrate with our CI/CD pipelines, but the gap in providing the same level of assurance with our UI was becoming too wide to ignore. At the time, there was limited knowledge available on what we could use to test the functionality of our applications. So to fill this vacuum, we decided to go on a hunt to discover what platform would work best for validating Streamlit’s UI.

The Criterion for a good UI Platform

Our selection criteria encompassed the following aspects:

1. Seamless Integration into our Software Development Process: The chosen framework should seamlessly integrate with our existing development workflow.

2. Compatibility with Github Actions: Integration with Github Actions is paramount to streamline our testing and deployment pipeline.

3. Testing Framework Convenience: The framework’s ease of use and implementation were critical factors in our decision-making process.

4. Scalability for Complexity: The framework’s scalability, particularly in handling increasingly complex testing scenarios, was a key consideration.

5. Streamlit-Specific Support: Given our reliance on Streamlit, we sought to determine if the framework offered tailored support for this technology.

6. Maturity of the Platform: This is to investigate the support (community, enterprise, paid) and long-term plan for the platform.

After a thorough evaluation, we narrowed down our search to the two most prominent web testing frameworks compatible with Python: Selenium and Playwright.

Selenium, an established player in the field, has been in development since 2004. Its longevity has enabled it to amass a wide range of browser and language support due to its maturity. Being an open-source solution, it boasts a vibrant community of developers actively contributing to its evolution.

Playwright, in contrast, emerged as a more recent entrant into the scene, introduced by Microsoft in 2020. While it may support fewer languages and browsers, it compensates with impressive speed advantages over Selenium.

How did each platform do?

Criterion/PlatformSeleniumPlaywright
Integration into the software development process🟰🟰
Integration into GitHub Actions🟰🟰
Ease of testing framework👎👍
Scalability of testing👎👍
Bespoke support for streamlit🟰🟰
Maturity of the Platform🟰🟰

Integration into the software development process

Both platforms align seamlessly with our software development workflows, earning them a resounding check. Our biggest concern was ensuring compatibility with Pytest, the designated testing framework. In this context, both Selenium and Playwright emerge as strong contenders, each equipped with plugins that harmoniously integrate with Pytest. These plugins offer a valuable array of capabilities, including using fixtures and command-line arguments to enhance testing functionality.

These plugins not only facilitate the manual execution of tests but also extend the option to run tests in both headed and headless modes. This duality is instrumental in simplifying test progress tracking. Moreover, it proves invaluable in swiftly pinpointing breakpoints within the codebase, thereby expediting the identification of potential issues.

Integration into GitHub Actions

Fortunately, both frameworks do not require any extra requirements to run on GitHub actions. As long as the packages are correctly set up, the tests will run without a problem. The only note worth making here is that for a Python project, other than installing the packages on GitHub actions, we also need to install the web drivers for the browsers we want to use for the test.

Ease of testing framework

At this juncture, the distinctions between the two platforms became markedly pronounced. While Selenium and Playwright are both widely acclaimed frameworks for automation testing, the deviation in ease of setup and usability became notably evident.

The initial divergence manifested during the setup phase. Setting up Playwright proved remarkably straightforward. 

  1. A seamless installation of the package via pip followed by a solitary command executed via Playwright effortlessly installed all necessary dependencies and web drivers. This streamlined process enabled the swift configuration of the essential environment in one fell swoop. 
  1. Selenium required a more intricate approach: the individual web drivers necessitated separate installations, inevitably extending the setup timeline. Furthermore, I grappled with persistent conflicts between my Selenium installation and Pytest, demanding a substantial investment of time for debugging and resolution.

Scalability of testing

In this situation, both frameworks are relatively scalable. They can interact with websites using relatively similar interactions such as clicks, locating objects using HTML/CSS/xpath data, and text entries. No platform had a real advantage or disadvantage in this discussion.

When it came to the speed of the platforms, Playwright unequivocally outshone its counterpart. While existing research supported the prevailing consensus that Playwright exhibits superior speed compared to Selenium for website automation, I undertook empirical tests to substantiate this assertion. Employing a designated website as the testing arena, I honed in on two pivotal objectives:

  1. Validating interaction capabilities with all tabs present on the website.
  2. Verifying the accuracy of article counts within the blog section of the website.

With these objectives guiding me, I harnessed the Code Generators inherent to both frameworks to script the test cases. Below, I present the generated code for your reference:

Playwright:

Selenium:

Before I even run the tests I have some observations I have to make.

  1. The length of code and setup requirements:
    1. Playwright generated all of its code using the codegen feature which can be called through the CLI. It creates a popup of an incognito window and a playwright inspector where all the clicks and interactions get recorded. Additionally, Playwright’s Pytest plugin manages the setup of the web driver and browser on its own so it dramatically reduces the amount of code needed to run a test.
    2. Selenium’s codegen required me to use one of the browsers on my laptop and their plugin called Selenium IDE. The test it wrote included a fixture that managed the setup of the driver and browser.
  2. Test readability:
    1. The test generated by the playwright codegen is more readable. It refers to the HTML elements and names assigned to tabs to identify and click on them so they seem a lot more readable
    2. Selenium’s IDE instead refers to them by their CSS selector, which makes the code generated a lot less human-readable. While Selenium can refer to objects in a way similar to Playwright, its IDE does not do this.

Now moving onto a speed comparison. I ran both tests on Python 3.11 with Pytest 7.4.0 on the same machine. Additionally, I was on the latest version of Playwright and Selenium at the time of writing this article: Playwright 1.37.0 and Selenium 4.11.2. I reran the tests 100 times sequentially using Pytest-Repeat Plugin and here were my findings:

FrameworkTotal Time for 100 Tests (seconds)Average Time(Seconds)
Playwright361.653.61
Selenium786.277.86

Playwright ran 100 tests more than 2 times faster than selenium! Moreover, these were not particularly difficult tests since they only clicked on 6 buttons and then checked the number of articles on the last tab.

Bespoke support for streamlit

Currently, neither framework has any bespoke support for streamlit so again, they were both on an equal playing field in this matter.

Maturity of the Platform

Selenium is the more mature platform when it comes to broad compatibility and a strong user base. Having been established in 2004, it has a huge community of active users and continues to receive regular updates. This age also means that there exists a rich repository of problems faced by users that have also been addressed by the community. This Community is through a Slack channel and an official user group

Playwright is very much the new kid on the block. Having been released in 2020, it lacks the same community but it makes up for this with detailed documentation. The website explains each of its tools with an example. Further, it has an active discord server to receive help related to the framework.

Conclusion

UI testing has become essential for bridging the gap between developers’ intentions and user expectations. At Flectēre, where we emphasise Test Driven Development (TDD), selecting the right UI testing platform was crucial. After evaluating options, Selenium and Playwright emerged as contenders. Both integrated smoothly into our workflow, supported Pytest, and worked well with GitHub Actions.

However, Playwright stood out in ease of use and setup. Its CLI-based code generation streamlined installation and created more readable test scripts. Speed-wise, Playwright was a winner, running tests over two times faster on average compared to Selenium.

Both frameworks proved scalable and compatible while lacking specific Streamlit support. Automated UI testing with these platforms ensures a seamless user experience and strengthens project integrity, aligning development with user needs and technological advancement.