This is the final installment of my three-part saga on the technical details of how screen readers communicate with other software. Part 1 introduced accessibility APIs (application programming interfaces) and the accessibility tree, and Part 2 provided a history lesson of how screen readers learned to cope with graphical user interfaces (GUIs).
Although I thought I understood the topic, a little reading turned up nagging gaps and discrepancies among accounts circulating within the accessibility community: the APIs are part of the operating system; no, the APIs are hard-coded into the browser; screen readers do or do not read the DOM directly; different accessibility APIs are supported by different browsers, with no explanation why one would be used over another. Even how application programming interfaces actually work was at issue. Have you ever ducked down a rabbit hole only to fall feet-first out the other side surrounded by people speaking Hindustani? I have.
I needed a developer's perspective and therefore reached out to Matt Campbell, whose resume includes authorship of the System Access screen reader, as well as work at Microsoft on the built-in Narrator screen reader and the latest version of its accessibility API, UI Automation (UIA). Matt has been incredibly generous with his time and detailed explanations. We had the most luck when we pretended I was a four-year-old learning English as a second language.
The most surprising thing I learned was that, when screen readers process web pages, they very often inject their own code into the running web browser application process in order to extract information. Know what else does this? Malware. Know what else? Basically nothing.
Below, we'll dig into assistive technology support within web browsers, past and present.
Browse Mode, MSAA, and the DOM
Windows screen readers very early hit upon a terrific strategy for highly-efficient web page review that has endured, largely unchanged, for over twenty years. It centers around creating a buffered copy of a web page that the user can review like a standard text document. This may in part be for historical reasons, since most browsers themselves now offer a limited caret navigation mode. Each screen reader calls its review mode something different, but NVDA's "browse mode" tends to be used as the generic term. In addition to virtual cursor movement, users can jump among headings, lists, tables, figures, forms, specific form controls, and dozens of other element types using single keypresses while in browse mode. To interact with most controls on the page, one exits browse mode—a state often referred to as forms mode for historical reasons, even though a more accurate description would be "the absence of browse mode" (Narrator takes this tack). VoiceOver does not require a special review mode but has long featured QuickNav for single-key element navigation, a fact often overlooked by (sighted) accessibility professionals. Finally, users can instantly review a web page's structure by pulling up lists of certain common elements found on the page—testers love that one. These features underscore why it is critical for web developers to use available markup that reflects the meaning of page elements, a practice known as semantic HTML.
The story often told is that screen reader access to page semantics came all at once with Internet Explorer 5's Microsoft Active Accessibility (MSAA) support, at which point we partied like it was 1999, because it was. We already had a virtual document we could arrow around, and the off-screen models (OSMs) discussed in Part 2 could generally recognize links and standard form controls using heuristics. On the other hand, OSMs supposedly could not make sense of columns, tables, frames, or even headings, much less other markup.
Two factors complicate this picture. One is purely of historical interest, while the other remains very pertinent today.
First, just for the record, browse mode features were not necessarily tied to MSAA. Although Window-Eyes immediately went all-in with its "MSAA Mode," JAWS support came more slowly and perhaps even grudgingly. JAWS users instead touted the far greater speed of the "reformat page" hotkey that removed columns , among other things I can't recall. JAWS already offered navigation by heading and a links list through direct parsing of the HTML source code. When JAWS introduced its single-key element navigation in 2002, the release notes make a point of stating, "information comes right from the HTML used to create the page". Window-Eyes added the same feature within a year, relying on MSAA. OSMs meanwhile continued to play a role in Internet Explorer for both screen readers.
Second, at least as important as MSAA was the DOM API, introduced with Internet Explorer 5. As Matt Campbell told me, "The original MSAA API was so simplistic that IE couldn't expose very much that way, and we all had to use the DOM API. With the rise of ARIA, those of us that relied exclusively on the IE DOM had to retrofit some use of MSAA after all." The shiny new DOM specification provided the means for platform-independent dynamic scripting of web pages, but the same object-oriented information empowered screen readers: both needed object roles and their properties to be programmatically determinable.
Today, Chrome and Firefox implement the ISimpleDOM API, which JAWS and NVDA use to access information unavailable through accessibility APIs. MathML, for instance, is a W3C standard that provides semantic and presentational tags for math content. It is XML, not HTML, and isn't currently part of the accessibility tree (Safari similarly exposes MathML through its WebKit engine. You can use a MathML test page to evaluate the state of support on other platforms).
Other software applications can have their own DOMs, and screen readers tap them. Window-Eyes release notes from 2005 announced "100% text accuracy, 100% of the time thanks to the use of the Microsoft Word DOM … This is a first in the screen reader industry." Multi-column documents likewise became accessible, though all this in my experience rendered Window-Eyes sluggish because of information being passed one piece at a time through inter-process communication. By the end of that year, JAWS introduced virtual cursor mode and navigation quick keys into Word, which could not have been done through an OSM. Unlike the W3C standard, however, each application's DOM is different, requiring special-case programming by screen reader developers.
The combination and proportions of OSM, DOM, HTML parsing, and MSAA constituted each screen reader's secret sauce. Taken together, these tools provided fairly stable web access through the early 2000s. Then, despite having died shortly after the Trojan War, AJAX came along and broke it.
IAccessible2: The API Strikes Back
Two factors at the heart of "Web 2.0" rendered MSAA and the DOM API inadequate. The first was the host of new control types invented by web developers and packed into <div> and <span> elements, among other techniques. MSAA simply had no vocabulary for their roles or properties. As noted in Part 1, ARIA was the solution from the source-code side.
Second, recall that the browse mode buffer is constructed when the page loads. MSAA event notifications were very often unable to report dynamic changes thereafter. One work-around was the introduction of a screen reader hotkey to refresh the virtual buffer; sometimes, this worked. Press a submit button, apparently nothing happens, explore the entire page to find nothing, refresh the browse mode buffer, explore the page again, possibly finding the form submission confirmation or error if one is lucky. Note that, unless the developer alerts the user through keyboard focus management, an ARIA live region, or pop-up dialog, all of these steps, other than refreshing the buffer, remain very much ongoing issues today—but at least the dynamic content generally exists now…someplace… if one keeps arrowing down a few hundred more times to find it.
Microsoft introduced its newer UIA accessibility API in 2006 with Windows Vista. Because UIA was a considerable departure from the MSAA code, however, its scope and performance required years to evolve. That process continues today. In late 2006, the IAccessible2 API arrived, a platform-independent open standard developed by IBM, working closely with screen reader developers and corporate stakeholders. Unlike UIA, IAccessible2 extended MSAA's IAccessible code library to fix its shortcomings. It therefore constituted a mature solution almost immediately. According to Matt, Firefox quickly implemented it alongside its existing MSAA support, while Google Chrome followed suit in the early 2010s. Meanwhile, Internet Explorer would ultimately rely on a scaled-down version of UIA. To clear up one of the puzzles noted at the outset, IAccessible2, which is what JAWS and NVDA use today, is not a Windows platform API: the libraries are part of the browser.
Screen reader developers have preferred IAccessible2 for another reason as well: it allows direct access to the browser's API implementation using the same low-level hooking techniques employed by off-screen models (see Part 2).
Remember inter-process communication (IPC) from Part 1? That's the polite, secure, reliable process of handing information back and forth between applications one by one through an operating system API acting as intermediary. Conversely, insertion of low-level hooks directly into running applications is an example of what is politely known as in-process communication and more descriptively as code injection. Matt provided a very clear explanation, so I'll just let it rip.
- ATs are basically the only non-malicious programs that use this technique, and it really is as invasive as the word "injection" sounds, in that the AT has forced some of its code to run inside the other application's space (though with the operating system's permission). But what the code does once it's injected depends on the application.
- For applications that use GDI to render text, NVDA still intercepts those graphics function calls and creates an off-screen model. This is the most invasive use of code injection, as NVDA is redirecting some operating system functions inside the application to call into code provided by NVDA first.
- When you access web content, none of the screen readers are using an off-screen model (though they used to a long time ago). The screen readers are using IAccessible2 and ISimpleDOM according to their documented interfaces. However, the injected code allows the screen reader to access those interfaces without repeatedly going back and forth between the screen reader and browser processes.
- When a web page loads, JAWS and NVDA need to go through every element on the page to create a virtual buffer. If they were to use IAccessible2 only through IPC, which is the most robust and secure way of doing it, then they'd have to send many, many messages back and forth between the screen reader and browser processes; and, even as fast as computers are, that's relatively slow. But with code injection, some of the screen reader's code can run directly inside the browser, gather all the information it needs for a virtual buffer (which requires some complex logic specific to the screen reader), then communicate back to the main screen reader process at the end.
In fact, a remarkable passage from early Window-Eyes release notes suggests that they may have made this same discovery and switched from IPC to code injection in Internet Explorer: "a web page that once took seven minutes to load now takes 12 seconds. It is important to note that this is the MSAA load time, not the download time." Anyone who used Narrator prior to the version of UIA available in Windows 11 will likely have found it sluggish compared to JAWS or NVDA. IPC was a major culprit.
Apple and Google operating systems don't allow code injection. Windows-based Firefox and Chrome increasingly keep their doors locked while continuing to give assistive technology a pass. Its days are numbered. However, while the use of code injection to access browser content isn't as unreliable as an off-screen model, it's still a concern for both reliability and security. A bug in the screen reader code that runs inside the browser process can crash or hang the browser—maybe you've seen this happen. It's also plausible that a bug in the injected screen reader code could cause a security vulnerability in the browser for screen reader users. I'm not aware of an actual instance of this, but it could happen. Unfortunately, the developers of NVDA and JAWS don't have a strong incentive to do all the work to completely replace code injection with IPC, and it's not entirely clear how they'd go about doing that, even with the new APIs in Windows 11. If they're going to preserve all the current functionality with anything close to the current speed, it's a lot more complicated than flipping a switch. The AT developers might still need more help from the browser developers and even Microsoft. Unless some kind of security incident forces Windows or the browser developers to crack down on code injection, it's easier for everyone to stick with the status quo.
The injected code can continue to process dynamic page events as well. However, according to Matt, NVDA uses IPC when it can, and user actions like activating controls or typing text would normally be sent to the browser using IPC.
NVDA is open source and is therefore easy to examine—not to mention that its creators, Jamie Teh and Mic Curran, freely share their expertise. The same concepts apply to JAWS, however. Given its age, in fact, JAWS almost inevitably relies much more on code injection than NVDA across the system. Many of its user-configurable heuristics lead me to suspect it gathers much more from the DOM than NVDA as well. For that matter, if some need came along to once again parse HTML source code, I have no doubt that bits of long-dormant JAWS code would awaken like prehistoric bacteria rising from thawed permafrost.
The State of Play in 2023
Well, now that this series has dispensed with the preamble [audience members turn to one another in horror], here are the facts as I now understand them about accessibility API implementations.
Windows screen readers rely on MSAA, as well as a few other Windows APIs, in older areas of Windows like the desktop and taskbar, while UI Automation provides access to components added since Windows 8.
JAWS and NVDA use IAccessible2 in Chrome, Firefox, and Chromium-based Edge. They additionally use ISimpleDOM when they need information not able to be plucked from the accessibility tree. These are code libraries incorporated into the browsers, not Windows.
According to Matt, "Both Firefox and Chrome have more or less ignored UI Automation for all this time. The Edge accessibility team have contributed their UIA implementation to Chromium, but it's still not turned on by default in Chrome."
A few years ago, Microsoft incorporated a bridge that allows ATs that rely on UIA in web browsers—i.e., Narrator—to communicate with applications that use IAccessible2—i.e., Chrome and Firefox. In a case of "tail wags dog," this bridge continues to interact with ATs solely through IPC but injects its code into the browser whenever possible for the performance boost. This is what's happening under the hood when using Narrator in those browsers. On the other hand, Narrator predictably uses UIA in Microsoft Edge.
Mac, IOS, and Android all implement their platform APIs throughout their systems, including third-party browsers. If VoiceOver began to support IAccessible2 or UIA, other Mac and IOS browsers would be ready. But that's not likely to happen, and Apple's UIAccessibility API performs well. Is anyone here surprised that Windows is way more complicated than anything else?
The nice thing about the past—other than the fact that I was younger, thinner, and could eat carbs like it was the eve of Judgment Day—is that it helps us make educated guesses about the future. What seems likely is that Windows will sooner or later fall in line with other operating systems by shutting down third-party code injection. Screen reader developers will then be forced to undertake the work Matt mentioned, and everyone will indeed use the Windows platform API, the performance of which will by then very likely be up to the task.
And once that happens, dear reader, likely only you and I will notice.