The WebDriver specification came to life as a blueprint description of Selenium’s behaviour to turn what is the de facto browser automation solution, into a specification that would become a de jure standard. Along the way we have rectified and corrected quirks and oddities in the existing work to make commands cohesive units that form part of a larger, more consistent protocol.
Now that almost all of the formal remote end steps are defined, we are looking closer at the relationship among different commands. A part of this is questioning the command primitives, and a current burning issue is the approximation to visibility.
For looking up and interacting with elements, a series of precondition steps must be met, one of which is that the element is deemed visible. The visibility of an element is said to be guided by what is “perceptually visible to the human eye”. This is a tough mark to hit since other web standards we rely on are refusing to go anywhere near this topic. Defining what it means to be visible to the user, it turns out, is extremely difficult to give an exhaustive definition of.
From Selenium the specification has inherited a long and complex algorithm to give a crude approximation about the element’s nature and its relationships in the tree. This was further developed to take into account more things that we knew Selenium was missing.
The specification gives a highly non-normative summary of what it means by element visibility:
An element is in general to be considered visible if any part of it is drawn on the canvas within the [boundaries] of the viewport.
Through the special meaning of many HTML features and how the ancestral relationship between different elements have effect on elements’ visibility, it goes on to describe the steps of an algorithm that traverses the tree up and down, starting from the element it is trying to determine the visibility of.
Practically it looks at certain computed (resolved) styling properties
that are generally known to make an element invisible,
The visibility of certain elements,
such as an
depend upon the characteristics of its parent:
For these it traverses up the document until it finds
then applies the same checks there.
Because many HTML elements need special treatment,
each has separate rules defined for them.
Some of these rules include
<area>, textual nodes,
<img>, but also many more.
Following explicit hiding rules, if the element has more than a single direct descendant element its own visibility will depend upon the visibility of its children. The same is true if any of its direct ancestral elements in tree order fail the visibility test, in which case the visibility will cascade or trickle down to all child elements.
What we arrive at is a recursive algorithm that, although extremely ineffective, tells if a node is visible within the constraints of the tree it is part of. But looking solely at the part of the tree an element is in, overlapping elements from other trees are not considered.
The tree-traversal approach also entirely avoids addressing issues around absolute positioning, custom elements, overflow, and device concepts such as initial- and actual viewport. Owing to some of the provisions it makes around how border widths influence a parent element’s effectual dimensions, or preconceived ideas on how input widgets are styled by user agents, it furthermore ascribes meaning, or interpretation, to areas where the web platform is bad at exposing the primitives.
A suggested alternative to tree traversal is a form of hit-testing that involves testing which element is under the cursor, and then do this for each coordinate of the element’s bounding box that is inside the viewport. This has the upside of avoiding the problems associated with tree traversal altogether, but the downside of being extremely inefficient, and can in the worst scenario be as bad as O(n).
It is also complicated by the fact is that the element inside the bounding box
does not necessarily fill the entire rectangle.
This is true if the element is clipped by
has a degree of rotation, or similar.
The primitives offered to us give little guidance
for tracking the exact path of an element’s shape.
The tree traversal algorithm is also limited to HTML. Other document types such as SVG and canvas have entirely different visibility primitives, and the idea of implicit ancestor visibility simply makes no sense in these contexts.
Shadow DOM is similarly problematic
because it introduces the concept of multiple DOMs.
Elements that have a Shadow DOM attached
do not expose the same standard DOM traversal and query APIs
as regular elements.
They also have the idea of scoped stylesheets,
whereby it’s possible to style implementation details
<style> element that just applies to the local scope.
Matching rules are also constrained to the same DOM,
meaning style selectors in the host document
do not match inside a Shadow DOM.
CSS Transforms in itself isn’t an issue, but since the tree traversal algorithm is not a single atomic action, it does not take into account that the state of the document may change whilst it is executing. It should be possible to work around this issue by evaluating the entire algorithm on a sandboxed snapshot of the tree, but again, this exposes us to non-standardised realms of the web platform.
By this point it should almost go without saying that providing a consistent, future-proof method of ascertaining visibility is futile. Whilst WebDriver does have access to the browser internals required to solve this problem, the solution likely lies outside the scope of specifying a remote control protocol.
As we cannot guarantee that new platform features will be taken into account, the tree-traversal approach offered by Selenium is not a building block for the future. It’s a hacky approach that may work reasonably well given the naïve narritive of the simple and undynamic web document world of 10 years ago.
To provide a future-proof notion of naked-eye visibility, there’s a need to push for separate, foundational platform APIs that other standards, in turn, must relate to. WebDriver’s primary role is exposing a privileged remote control interface that enables introspection and control of user agents; not to define element- or textual visibility. When such primitives are made available WebDriver should be an obvious consumer of these.