Are there currently workable JavaScript alternatives

Visual regression testing on the web: a viable alternative to DOM-based E2E tests?

Visual regression tests (VRT) should make it easier for software developers to make reliable statements about the quality of the developed applications in the web front-end area. Screenshots of user interfaces are made and these are compared with an existing reference set. In this way, CSS and template regressions can be detected that cannot be detected with DOM-based E2E tests. This includes, for example, unwanted overwriting of CSS classes or graphics that are accidentally no longer visible. This test procedure thus closes a gap in the series of automated software tests. Tedious manual testing efforts are reduced and problems of DOM-based E2E tests are overcome. In this article I try to clarify what visual regression tests can do in practice and whether they can actually replace DOM-based E2E tests.

As part of my bachelor thesis "Visual regression tests for web frontends", the suitability of this test form was examined. To do this, I evaluated existing tools and compared the most promising solutions with one another. Doe work forms the basis of this article.

Visual regression tests

As already described at the beginning, the visual regression tests (VRT) are about the detection of CSS and template regressions. Automated pixel-by-pixel comparisons should enable front-end developers to visually understand changes in the layout of a web application and to better recognize unwanted changes.

In addition to the usual test strategies such as unit, integration and end-to-end testing, web applications are mostly tested manually. In other words, developers or testers manually click through an application and use it to simulate use cases for the end application. This method is time-consuming and difficult to scale. The larger the application, the more time and therefore money has to be invested in manual testing. This is error-prone and not very efficient. Humans can simply overlook marginal changes, which cannot happen to a machine. This phenomenon is called change blindness. In the following illustration of our application to be tested (the TodoMVC app is used), for example, the input element for ticking off the to-do entry is missing in the newly captured screenshot (center). These marginal changes are difficult to perceive by the human eye.

For the VRT it must be possible to automatically create screenshots of a web application and then to compare these with a reference set of existing screenshots in order to detect regressions.


The process or workflow of visual regression tests is divided into four stages, with stages three and four being optional and depending on whether there are differences in the screenshots. The application to be tested is referred to here as "AUT" (Application Under Test):

  1. In the first step, the AUT is carried out in a controlled test setting. For this purpose either real data is used or test data is used. Then the parts of the application described in the tests are recorded and saved. This means that different statuses from different sides of the AUT are recorded in the form of screenshots.
  2. Then, in step two, the captured screenshots are compared with an already existing set of reference screenshots. These reference screenshots are known as the baseline. They are recorded before the first test and manually checked for correctness. After this initial test, an automated, machine comparison of the images recorded in the test with the baseline is possible.

    In most cases, the screenshots are checked without any further logic - pixel by pixel. This means that the image data belonging to one another are superimposed and the RGB color data are then compared with one another row by row, pixel by pixel.

  3. The third step involves recording and reporting any differences found in the compared screenshots. A copy of the respective baseline is made and the differing pixels are marked in red (or in another conspicuous color) in the newly created image. In addition to the newly created differential image, metadata such as the number of different pixels can also be saved for further use.
  4. The last step is to manually check the screenshots marked as "failed" in the comparison. At this point it can be determined whether the changes in the AUT were intentional or not. If the changes were wanted, for example because a new feature was implemented, the newly captured screenshot is used as the new baseline. The baseline can be easily versioned via Git. If the changes were not wanted, the test was able to successfully identify an error and this can now be corrected.

Classification and criticism

Visual regression tests are part of the E2E tests. Since we are only restricting ourselves to use in the web front-end area, UI tests are understood to be synonymous with E2E tests in this article. UI tests are one of the black box approaches in quality assurance. Black box testing techniques perceive the system under test as a whole. They have no knowledge of any details of the underlying source code. The only two factors that can be influenced or compared are the input and output values ​​of the software. Black box tests largely correspond to the end user’s view of the software product. This makes user stories particularly suitable for identifying suitable test scenarios.

In contrast to the unit and integration tests, however, the black box approach reveals some problems with the UI tests. It makes sense to take a closer look at Mike Cohn's test automation pyramid.

Mike Cohn described a three-level pyramid for an effective and efficient automation of software tests in his book "Succeeding with Agile" in 2010. The lowest level, and thus the basis of the pyramid, are the unit tests. They represent the largest proportion of tests, which, according to Cohn, is also the most comfortable. They have extremely short runtimes and deliver extremely granular error messages about the source of the problem. The middle level represents the service or integration tests, i.e. those test cases that test several modules "below" the user interface in an integrated manner. The top, third level of the pyramid are the UI tests, of which, according to Cohn, as many as necessary and as few as possible should be written. He complains that these are unstable, expensive and time-consuming. In addition, this would create redundancies. Since the complete code base is always called for UI tests, functionalities are always tested that can be tested more easily and quickly by testing the levels below.

The instability of the UI tests mentioned by Cohn is based on the problem of controlling the status of the AUT. If, for example, the CSS, images or DOM elements have not yet been fully loaded at the time the test is run, the tests will fail even though the application works correctly. These and similar errors are called false negatives. In practice, however, the term “false positives” is used as a synonym for it. Depending on the interpretation, both are correct. This article says false positives about this. In “classic” or “DOM-based” E2E tests, false positives are mostly caused by missing DOM elements. In principle, these tests do not differ in their execution from visual regression tests. For these, the complete application is also loaded for their execution. Thereupon the existence or the absence of DOM elements or their content is tested in the browser. False positives in VRT are also induced by differences in the rendering of the content in the browser. This is discussed in more detail below.

The longer time required for UI tests is not only due to the longer execution time, but also to less comprehensible sources of error. In contrast to unit tests, UI tests usually cannot refer to a specific point in the code in order to fix the cause. With the black box approach, only the final result is viewed in the browser. However, this in turn is usually the result of an interaction between templates, controllers and services.

The longer the tests take to run, the harder they are to maintain and the more unstable they are, the less likely it is that developers will run UI tests repeatedly. This in turn tends to lead to a higher error rate in development and thus to a longer development time.

For the reasons listed, the productive purpose of visual regression tests is somewhat controversial. There are some advocates of the opinion (e.g. Pete Hunt, author of the VRT tool Huxley for the Instagram web application; further development of the tool has meanwhile been discontinued) that tests of this kind are only available as a code review tool with the existing solutions good, but not as a full-fledged test tool. This means that they do not see their use integrated into an automated test setup, but rather as a local tool for the visual comparison of the changes made.

In order to be able to establish visual regression tests as a serious test method, the goal must therefore be to find out which scenarios are sensible to test, while at the same time ensuring the stability of the tests and keeping their false positive rate as low as possible.

Evaluation of existing VRT tools

In total, a little more than 30 testing tools were examined as part of my thesis. Both open source and commercial solutions were considered. Tools that are either no longer actively maintained or do not meet the requirements (see next section) were sorted out over several filter levels. In the end, there should be three VRT tools that were used as part of an implementation part to write tests for a real AngularJS application. In this blog article, the tests are implemented using the TodoMVC app shown above.


The following requirements for the testing tools were developed as part of the thesis:

[table “7” not found /]

The four steps of the workflow described above are to be understood under the basic functions for the functional requirements. Many tools only cover part of these basic functions. For example, Specter is just a web application for comparing images. This means that there is an interface via which image data can be passed to the application, which is then compared with one another and then evaluated. Therefore it makes sense to combine the strengths of the individual tools here.

The extended functions are aimed at meaningful productive use. For this purpose, it should be possible to interact with the application in order to put it into the desired status to be tested (e.g. correctly completed form in an order process). It should also be possible to save only parts of the application as a screenshot using the CSS selector, for example. The otherwise common approach is to record the entire page, either just the area visible in the browser window or the entire scrollable height. The most important points in the extended functional requirements are the last two. The detection of false positives should guarantee the stability of the tests and the automation should ensure the integration into an existing test setup.

In the case of non-functional requirements, particular focus must be placed on robustness. This is closely linked to the detection of false positives and is intended to improve acceptance among developers.

False positives

False positives, or false positives for short, can have many causes. In addition to intended changes (features) that lead to errors that can be expected when comparing the screenshots, there are causes of errors that are unintentional, not or only barely visible to the human eye, and are of a technical nature. These include, for example, anti-aliasing and image scaling. Furthermore, dynamic content, shifts in content and delays in loading individual components of the AUT lead to incorrect test results. It is important to recognize and avoid these false positives.

There are different approaches to preventing false positives. Like the LooksSame tool from Yandex, many orientate themselves towards the human perception of the content. This means that color deviations that are barely visible to a person are calculated using the CIE-DE2000 model. The differences in the pixel rasterization of edges (anti-aliasing) are also recognized well by most tools and therefore do not cause false positives. However, it becomes problematic if, for example, the content shifts down by only one pixel. In this way, even though nothing has changed in terms of content and this shift cannot be perceived by the human eye, the entire shifted pixels are marked as “error pixels”. To get this type of false positives under control, some testing tools offer the option of specifying a mismatch tolerance. This mismatch tolerance measures the percentage of pixels that may differ before the newly captured screenshot is marked as "different" from the baseline. However, this harbors a problem: the higher the resolution of the screenshot, of course, the more pixels can differ without the "alarm" being sounded. The mismatch tolerance must therefore be treated with caution. A more intelligent approach would be to detect content (e.g. classify as objects via OCR or layout elements) and then determine whether there has been any change in the status of the AUT in terms of content or structure. This requires significantly more effort in image processing and is currently only used by the commercial service Applitools Eyes.

Tool selection

In addition to the requirements listed above, great importance was attached to two factors in particular when selecting the right test tools. On the one hand, the technology stack for front-end developers should not be unnecessarily expanded and, on the other hand, the tests should be executable in real browsers (not only in headless browsers such as PhantomJS). The former prefers a JavaScript API and the latter a Selenium-based tools. The Selenium basis enables testing the peculiarities of the individual browsers. If you don't have your own infrastructure with test devices, you can use services such as BrowserStack or Sauce Labs.

The Yandex team has developed a solid solution to the visual regression problem with its Gemini tool. With a good community and performance, this was part of the selection for the thesis. The original selection also included the WebdriverCSS tool in conjunction with WebdriverIO for browser automation and Applitools Eyes for image comparison. However, WebdriverCSS is no longer actively developed. Instead, the SDK eyes.webdriverio.javascript developed by Applitools is used for WebdriverIO. Applitools Eyes is the only commercial tool in the selection and offers the possibility of inspecting the results of the image comparisons via its web application. The main aim of Applitools Eyes is to show where the advantages of intelligent image processing lie when comparing images.

Implementation of the tests

At this point, two things in particular should be shown:

  1. Difference between DOM-based and visual regression tests
  2. The problem of false positives in visual regression tests

The “tick” function of the to-do list of our test app is used as a test scenario. The following figure shows the scenario. The initial status of the AUT can be seen on the left and the target state to be tested on the right.

DOM-based vs. visual regression tests

For the first part of the implementation, the associated DOM-based test with Protractor and the visual test with Gemini are to be implemented for the test scenario.

describe ('todo-check', () => {
beforeEach (() => {
browser.get ('http: // localhost: 8080 / examples / angular2 /');
it ('should mark "Write UI tests" as done and update UI accordingly', () => {
let todoCount = element (by.css ('todo-app .todoapp .footer span.todo-count strong'));
let todoEntry = element (by.css ('ul.todo-list li: last-child'));
let todoEntryCheckbox = todoEntry (by.css ('input')); ()
.then (() => {
expect (todoCount.getText ()). toBe ('0');
expect (todoEntryCheckbox.isSelected ()). toBe (true);
expect (todoEntry.getAttribute ('class')). toMatch ('completed');
gemini.suite ('todo-check', (suite) => {
suite.setUrl ('http: // localhost: 8080 / examples / angular2 /')
.setCaptureElements ('html')
.before ((actions, find) => {
this.todoEntryCheckbox = find ('ul.todo-list li: last-child input');
this.todoEntryLabel = find ('ul.todo-list li: last-child label');
.capture ('todo-ui_tests-done', (actions, find) => { (this.todoEntryCheckbox);

After the baseline has been created for this test, a change is made in the CSS, which has "unwanted" effects on the display.For this purpose, the padding for all input elements was overwritten with 50px (see the following illustration of the result of the visual regression test). Since this is a pure change to the CSS, the DOM-based test does not notice anything and continues to run without errors. However, since after this change the input elements protrude into other clickable areas, the UI is no longer easy to use for an end user and its functionality is therefore restricted.

The result of a new test run with Gemini shows that the tool recognizes the regression and the test fails. At this point, the developer has the option of correcting the errors or applying changes to the new baseline using the "Accept" button.

At first, the two implementations differ only minimally in the code length. However, if you also want to check in the DOM-based tests whether the other to-dos are in the correct state after the test has been run and have not been influenced, more code is added for each element. With the visual tests you get this additional coverage of all visible UI elements free about this (see the following figure).

The problem of false positives

In the second part of the implementation, in addition to the Gemini test, another visual test is to be created using WebdriverIO and Applitools Eyes. WebdriverIO is a Node.js-based implementation of the W3C WebDriver protocol and is intended to help bring the AUT into the state to be tested. This state is then recorded in the form of a screenshot and sent via HTTP request to Applitools Eyes, where the image comparison takes place. The baseline is recorded again in advance and automatically saved in Applitools Eyes. At this point the two tool setups (Gemini vs. WebdriverIO + Applitools Eyes) should be compared.

Nothing has changed in the Gemini test. The only difference is the built-in change in the To-Do app, which should be recognized in the second pass. Again, this is a pure CSS change. To do this, the margin-top of the section.todoapp (corresponds to the to-dos list; the title also depends on its position) is enlarged by one pixel. I. E. the entire content slides down one pixel.

The result of the Gemini test can be seen in the figure below. The test fails and marks all non-white pixels as error pixels, although the difference between the baseline (called “Reference” in Gemini) and the current recording is not visible to humans. Another difficulty with this type of false positives is that the source of the error is not immediately apparent. If the background of the to-do list were not white, but had a structure (with a height> 1 pixel), not only the content in the difference image would be highlighted, but the entire area from the height of the shifted pixel.

The implementation of the test with WebdriverIO in conjunction with Applitools Eyes looks like this:

constwdio = require ('webdriverio');
constbrowserOptions = {
remoteHost: 'http: // localhost: 4444',
desiredCapabilities: {
browserName: 'chrome',
chromeOptions: {
args: [
constdriver = wdio.remote (browserOptions);
let browser = driver.init ();
const {Eyes, Target} = require ('@ applitools / eyes.webdriverio'