Improving the Accessibility of Scientific Documents: Current State, User Needs, and a System Solution to Enhance Scientific PDF Accessibility for Blind and Low Vision Users
The majority of scientific papers are distributed in PDF, which pose challenges for accessibility, especially for blind and low vision (BLV) readers. We characterize the scope of this problem by assessing the accessibility of 11,397 PDFs published 2010--2019 sampled across various fields of study, finding that only 2.4% of these PDFs satisfy all of our defined accessibility criteria. We introduce the SciA11y system to offset some of the issues around inaccessibility. SciA11y incorporates several machine learning models to extract the content of scientific PDFs and render this content as accessible HTML, with added novel navigational features to support screen reader users. An intrinsic evaluation of extraction quality indicates that the majority of HTML renders (87%) produced by our system have no or only some readability issues. We perform a qualitative user study to understand the needs of BLV researchers when reading papers, and to assess whether the SciA11y system could address these needs. We summarize our user study findings into a set of five design recommendations for accessible scientific reader systems. User response to SciA11y was positive, with all users saying they would be likely to use the system in the future, and some stating that the system, if available, would become their primary workflow. We successfully produce HTML renders for over 12M papers, of which an open access subset of 1.5M are available for browsing at https://scia11y.org/.
When researchers, students, and other individuals who are blind or low vision (BLV) interact with scientific PDFs through screen readers, the availability of document structure tags, labeled reading order, labeled headers, and image alt-text are necessary to facilitate these interactions. However, these features must be painstakingly added by authors using proprietary software tools, and as a result, are often missing from papers. Low vision or dyslexic readers who interact with PDFs through screen magnification or text-to-speech may also find the complexity of certain academic paper PDF formats challenging, e.g., non-linear layout can interrupt the flow of text in a magnifying tool. Inaccessible paper PDFs can lead to high cognitive overload, frustration, and abandonment of reading for BLV readers.
Unfortunately, we find that the majority of scientific PDFs lack basic accessibility features. We estimate based on a sample of 11,397 PDFs from multiple fields of study that only around 2.4% of paper PDFs released in the last decade satisfy all of the aforementioned accessibility requirements. Accessibility challenges for academic PDFs are largely due to three factors: (1) the complexity of the PDF file format, which make it less amenable to certain accessibility features, (2) the dearth of tools, especially non-proprietary tools, for creating accessible PDFs, and (3) the dependency on volunteerism from the community with minimal support or enforcement  . The intent of the PDF file format is to support faithful visual representation of a document for printing, a goal that is inherently divergent from that of document representation for the purposes of accessibility. Though some professional organizations like the Association for Computing Machinery (ACM) have encouraged PDF accessibility through standards and writing guidelines, 1 uptake among academic publishers and disciplines more broadly has been limited.
While policy changes help, the fact remains that most academic PDFs produced today, and historically, are inaccessible, yet remain as the dominant way to read those papers. A long-range solution will necessitate buy-in from multiple stakeholders-publishers, authors, readers, technologists, granting agencies, and the like. But in the interim, there are technological solutions that can be offered as a sort of "band-aid" to the problem. We use this paper to offer an in-depth qualitative and quantitative description of the problem as it stands, and to introduce one such technological solution:
the SciA11y system that automatically extracts semantic information from paper PDFs and re-renders this content in the form of an accessible HTML document. Though the process is imperfect and can introduce errors, we demonstrate the ability of the rendered HTMLs to reduce cognitive load and facilitate in-paper navigation and interactions for BLV users.
The goals and contributions of this paper are three-fold:
(1) We characterize the state of academic-paper PDF accessibility by estimating the degree of adherence to accessibility criteria for papers published in the last decade (2010-2019), and describe correlations between year, field of study, PDF typesetting software, and PDF accessibility.
(2) We propose an automated approach for extracting the content of academic PDFs and displaying this content in a more accessible HTML document format. We build a prototype that re-renders 12 million PDFs in HTML, and describe the design decisions, features, and quality of the renders (assessed as faithfulness to the source PDF). We perform expert grading of the rendered HTML and report an error analysis. A demo of our system is available at scia11y.org, which makes available 1.5M HTML renders of open access PDFs.
(3) We conduct an exploratory user study with six BLV scholars to better understand the challenges they experience when reading academic papers and how our proposed tool might augment their current workflow. During Fig. 1 . A schematic for creating the SciA11y HTML render from a paper PDF. Starting with the raw two-column PDF on the left, S2ORC  is used to extract title, authors, abstract, section headers, body text, and references. S2ORC also identifies links between inline citations and references to figures and table objects. DeepFigures  is used to extract figures and tables, along with their captions. The output of these two models are merged with metadata from the Semantic Scholar API. Heuristics are used to construct a table of contents, to insert figures and tables in the appropriate places in the text, and to repair broken URLs. We add HTML headers as illustrated (header tags for sections, paragraph tags for body text, and figure tags for figures and tables); highlighted components (table of contents and links in references) are not in the PDF and novel navigational features that we introduce to the HTML render. An example HTML render of parts of a paper document is show to the right (actual render is single column, which is split here for presentation).
the study, we ask users to interact with the prototype and offer feedback for its improvement. We perform open coding of interviews to identify existing reading challenges, coping mechanisms, as well as positive and negative responses to prototype features. We summarize the findings of this user study into a set of design recommendations.
Our analysis reveals that PDF accessibility adherence is low across all fields of study. Of the five accessibility criteria we assess, only 2.4% of the PDFs we assess demonstrate full compliance. Though compliance for several criteria seems to be increasing over time, author awareness and contribution to accessibility remains low, as Alt-text has the lowest compliance of the five criteria at between 5-10% (Alt-text is the only criterion of the five that requires author intervention in all cases using current tools). We also find that typesetting software is strongly associated with accessibility compliance, with LaTeX and publishing software like Arbortext APP producing low compliance PDFs, while Microsoft Word is generally associated with higher compliance.
To offset the reading challenges of inaccessible papers for BLV researchers, we propose and test the SciA11y system for rendering academic PDFs into accessible HTML documents. As shown in Figure 1 , our prototype integrates several machine learning text and vision models to extract the structure and semantic content of papers. The content is represented as an HTML document with headings and links for navigation, figures and tables, as well as other novel features to assist in document structure understanding. Our evaluation of the SciA11y system identifies common classes of extraction problems, and finds that though many papers exhibit some extraction errors, the majority (55%) have no major problems that impact readability, and another 32% have only some problems that impact readability. Through our user study, we identify numerous challenges faced by BLV users when reading paper PDFs, including some that affect the whole document or limit navigation, and many that affect the ability of the reader to understand text or various elements of a paper like math content or tables. Responses to SciA11y were positive; participants especially liked navigation features such as headings, the table of contents, and bidirectional links between inline citations and references. Of the extraction errors in SciA11y, missed or incorrectly extracted headings were the most problematic, as these impact the user's ability to navigate between sections and fully trust the system. All users reported being likely to use the system in the future. When asked how the system might be integrated into their workflow, one participant replied "I think it would become the workflow. " Another participant said, "for unaccessible PDFs, this is life-changing. "
We condense these findings into a set of recommendations for designing and engineering accessible reading systems (Section 6.3) . Most importantly, documents should be structured to match a reader's mental model, objects should be properly tagged, and care should be taken to reduce the reader's cognitive load and increase trust in the system.
Features that emulate the external memory that visual layout provides to sighted users can be especially beneficial.
This paper is organized as follows. Following a description of related work in Section 2, we first provide a metascientific analysis of the current state of academic PDF accessibility in Section 3. In Section 4, we document our pipeline for converting PDF to HTML and describe the SciA11y prototype for rendering papers. An evaluation of HTML render quality and faithfulness is provided in Section 5. Section 6 describes our user study and findings. We recognize that no PDF extraction system is perfect, and many open research challenges remain in improving these systems. However, based on our findings, we believe SciA11y can dramatically improve screen reader navigation of most papers compared to PDFs, and is well-positioned to assist BLV researchers with many of their most common reading use cases. Our hope is that a system such as SciA11y can improve BLV researcher access to the content of academic papers, and that these design recommendations can be leveraged by others to create better, more faithful, and ultimately more usable tools and systems for scholars in the BLV community.
2 Related Work
Accessibility is an essential component of computing, which aims to make technology broadly accessible to as many users as possible, including those with differing sets of abilities. Improvements in usability and accessibility falls to the community, to better understand the needs of users with differing abilities, and to design technologies that play to this spectrum of abilities  . In computing, significant strides have been made to increase the accessibility of web content.
For example, various versions of the Web Content Accessibility Guidelines (WCAG) [8, 10] and the in-progress working draft for WCAG 3.0, 2 or standards such as ARIA from the W3C's Web Accessibility Initiative (WAI) 3 have been released and used to guide web accessibility design and implementation. Similarly, positive steps have been made to improve the accessibility of user interfaces and user experience [5, 35, 36, 46] , as well as various types of media content [19, 29, 32] .
We take inspiration from accessibility design principles in our effort to make research publications more accessible to users who are blind and low vision. Blindness and low vision are some of the most common forms of disability, affecting an estimated 3-10% of Americans depending on how visual impairment is defined  . BLV researchers also make up a representative sample of researchers in the United States and worldwide. A recent Nature editorial pushes the scientific community to better support researchers with visual impairments  , since existing tools and resources can be limited. There are many inherent accessibility challenges to performing research. In this paper, we engage with one of these challenges that affects all domains of study, accessing and reading the content of academic publications.
BLV users interact with papers using screen readers, braille displays, text-to-speech, and other assistive tools. A WebAIM survey of screen reader users found that the vast majority (75.1%) of respondents indicate that PDF documents are very or somewhat likely to pose significant accessibility issues. 4 Most paper are published in PDF, which is inherently inaccessible, due in large part to its conflation of visual layout information with semantic content [6, 34] . Bigham et al.  describe the historical reasons we use PDF as the standard document format for scientific publications, as well as the barriers the format itself presents to accessibility. Prior work on scientific accessibility have made recommendations for how to make PDFs more accessible [11, 38] , including greater awareness for what constitutes an accessible PDF and better tooling for generating accessible PDFs. Some work has focused on addressing components of paper accessibility, such as the correct way for screen readers to interpret and read mathematical equations [1, 4, 16, 17, 26, 44, 45] , describe charts and figures    , automatically generate figure captions [9, 37] , or automatically classify the content of figures  . Other work applicable to all types of PDF documents aims to improve automatic text and layout detection of scanned documents  and extract table content [15, 39] . In this work, we focus on the issue of representing overall document structure, and navigation within that structure. Being able to quickly navigate the contents of a paper through skimming and scanning is an essential reading technique  , which is currently under-supported by PDF documents and PDF readers when reading these documents by screen reader.
There also exists a variety of automatic and manual tools that assess and fix accessibility compliance issues in PDFs, including the Adobe Acrobat Pro Accessibility Checker 5 , Common Look 6 , ABBYY FineReader 7 , PAVE 8 , and PDFA Inspector 9 . To our knowledge, PAVE and PDFA Inspector are the only non-proprietary, open-source tools for this purpose. Based on our experiences, however, all of these tools require some degree of human intervention to properly tag a scientific document, and tagging and fixing must be performed for each new version of a PDF, regardless of how minor the change may be.
Guidelines and policy changes have been introduced in the past decade to ameliorate some of the issues around scientific PDF accessibility. Some conferences, such as The ACM CHI Virtual Conference on Human Factors in Computing Systems (CHI) and The ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), have released guidelines for creating accessible submissions. 10 The ACM Digital Library 11 provides some publications in HTML format, which is easier to make accessible than PDF  . Ribera et al.  conducted a case study on DSAI 2016 (Software Development and Technologies for Enhancing Accessibility and Fighting Infoexclusion). The authors of DSAI were responsible for creating accessible proceedings and identified barriers to creating accessible proceedings, including lack of sufficient tooling and lack of awareness of accessibility. The authors recommended creating a new role in the organizing committee dedicated to accessible publishing. These policy changes have led to improvements in localized communities, but have not been widely adopted by all academic publishers and conference organizers. Table 1 lists prior studies that have analyzed PDF accessibility of academic papers, and shows how our study compares.
Prior work has primarily focused on papers published in Human-Computer Interaction and related fields, specific to certain publication venues, while our analysis tries to quantify paper accessibility more broadly. Brady et al.  quantified the accessibility of 1,811 papers from CHI 2010-2016, ASSETS 2014, and W4A, assessing the presence of 4 https://webaim.org/projects/screenreadersurvey8/ 5 https://www.adobe.com/accessibility/products/acrobat/using-acrobat-pro-accessibility-checker.html 6 https://monsido.com/monsido-commonlook-partnership 7 https://pdf.abbyy.com/ 8 https://pave-pdf.org/faq.html 9 https://github.com/pdfae/PDFAInspector 10 See http://chi2019.acm.org/authors/papers/guide-to-an-accessible-submission/ and https://assets19.sigaccess.org/creating_accessible_pdfs.html 11 https://dl.acm.org/
Our Analysis 11,397
Venues across various fields of study 2010-2019 Adobe Acrobat Accessibility Plug-in Version 21.001.20145 Table 1 . Prior work has investigated PDF accessibility for papers published in specific venues such as CHI, ASSETS, W4A, DSAI, or various disability journals. Several of these works were conducted manually, and were limited to a small number of papers, while the more thorough analysis was conducted for CHI and ASSETS, two conference venues focused on accessibility and HCI. Our study expands on this prior work to investigate accessibility over 11,397 PDFs sampled from across different fields of study.
document tags, headers, and language. They found that compliance improved over time as a response to conference organizers offering to make papers accessible as a service to any author upon request. Lazar et al.  conducted a study quantifying accessibility compliance at CHI from 2010 to 2016 as well as ASSETS 2015, confirming the results of Brady et al.  . They found that across 5 accessibility criteria, the rate of compliance was less than 30% for CHI papers in each of the 7 years that were studied. The study also analyzed papers from ASSETS 2015, an ACM conference explicitly focused on accessibility, and found that those papers had significantly higher rates of compliance, with over 90% of the papers being tagged for correct reading order and no criteria having less than 50% compliance. This finding indicates that community buy-in is an important contributor to paper accessibility. Nganji  conducted a study of 200 PDFs of papers published in four disability studies journals, finding that accessibility compliance was between 15-30% for the four journals analyzed, with some publishers having higher adherence than others. To date, no large scale analysis of scientific PDF accessibility has been conducted outside of disability studies and HCI, due in part to the challenge of scaling such an analysis. We believe such an analysis is useful for establishing a baseline and characterizing routes for future improvement. Consequently, as part of this work, we conduct an analysis of scientific PDF accessibility across various fields of study, and report our findings relative to prior work.
3 Analysis Of Academic Pdf Accessibility
To capture and better characterize the scope and depth of the problems around academic PDF accessibility, we perform a broad meta-scientific analysis. We aim to measure the extent of the problem (e.g., what proportion of papers have accessible PDFs?), whether the state of PDF accessibility is improving over time (e.g., are papers published in 2019 more likely to be accessible than those published in 2010?), and whether the typesetting software used to create a paper is associated with the accessibility of its PDF (e.g., are papers created using Microsoft Word more or less accessible than papers created with other software?).
Prior studies on PDF accessibility have been limited to papers from specific publication venues such as CHI, ASSETS, W4A, DSAI, and journals in disability research. Notably, these venues are closer to the field of accessible computing, and are consequently more invested in accessibility. 12 We expand upon this work by investigating accessibility trends across various fields of study and publication venues. Our goal is to characterize the overall state of paper PDF accessibility and identify ongoing challenges to accessibility going forward.
3.1 Data & Methods
We sample PDFs from the Semantic Scholar literature corpus  for analysis. We construct a dataset of papers by sampling PDFs published in the years of 2010-2019 stratified across the 19 top level fields of study defined by Microsoft Academic Graph [42, 47] . Examples of fields include Biology, Computer Science, Physics, Sociology, and others. This dataset allows us to investigate the overall state of PDF accessibility for academic papers, and to study the relationship between field of study and PDF accessibility.
For each field of study, we sample papers from the top venues by total citation count, along with some documents without venue information, which include things like books and book chapters. The resulting papers come from 1058 unique publication venues; for each field of study, between 29 We analyze the PDFs in our dataset using the Adobe Acrobat Pro DC PDF accessibility checker. 13 Though this checker is proprietary and requires a paid license, it is the most comprehensive accessibility checker available and has been used in prior work on accessibility [23, 33, 40] . Alternatively, non-proprietary PDF parsers such as PDFBox 14 do not consistently extract accessibility criteria from sample PDFs, even when the criteria are met. We also prefer Adobe's checker to PDFA Inspector, used by Brady et al.  , because PDFA Inspector only analyzes three criteria, whereas we are interested in other accessibility attributes as well, like the presence of alt-text.
For each PDF, the Adobe accessibility checker generates a report that includes whether or not the PDF passes or fails tests for certain accessibility features, such as the inclusion of figure alt-text or properly tagged headings for navigation.
Because there is no API or standalone application for the Adobe accessibility checker, it can only be accessed through the user interface of a licensed version of Adobe Acrobat Pro. We develop an AppleScript program that enables us to automatically process papers through the Adobe checker. Our program requires a dedicated computer running MacOS and a licensed version of Adobe Acrobat Pro. It takes 10 seconds on average to download and process each PDF, which 12 See submission and accessibility guidelines for ASSETS (https://assets19.sigaccess.org/creating_accessible_pdfs.html), CHI (https://chi2021.acm.org/ for-authors/presenting/papers/guide-to-an-accessible-submission), W4A (http://www.w4a.info/2021/submissions/technical-papers/) and DSAI (http://dsai.ws/2020/submissions/). 13 https://www.adobe.com/accessibility/products/acrobat/using-acrobat-pro-accessibility-checker.html 14 Table 2 . We reproduce the analysis conducted by Lazar et al.  on PDFs of papers published in CHI, showing the percentage of papers that satisfy each of the five accessibility criteria. We find similar compliance rates, indicating that our automated accessibility checker pipeline is comparable to previous analysis methods. We also show the percentage of papers in our full dataset of 11, 397 PDFs that satisfy each criterion, along with the percent that satisfy Adobe-5 Compliance.
enables us to scale up our analysis to tens of thousands of papers. Accessibility reports from the checker are saved in HTML format for subsequent analysis.
• Alt-text: Figures have alternate text.
• Table headers: Tables have headers. • Tagged PDF: The document is tagged to specify the correct reading order.
• Default language: The document has a specified reading language.
• Tab order: The document is tagged with correct reading order, used for navigation with the tab key.
For our analysis, we also report Total Compliance, which refers to the sum number of accessibility criteria met (e.g. if a paper has met 3 out of the 5 criteria we specify, then Total Compliance is 3). In some cases, we report the Normalized Total Compliance, which is computed as the Total Compliance divided by 5, and can be interpreted as the proportion of the 5 criteria which are satisfied. We also report Adobe-5 Compliance, a binary value of whether a paper has met all 5 criteria we specify (1 if all 5 criteria are met, 0 if any are not met), and the rate of Adobe-5 Compliance for papers in our dataset.
In addition to running the accessibility checker, we also extract metadata for each PDF, focusing on metadata related to the PDF creation process. PDF metadata are generated by the software used to create each file, and we analyze the associations between different PDF creation software and the accessibility of the resulting PDF document. Our hypothesis is that some classes of software (such as Microsoft Word) produce more accessible PDFs.
3.2 Accuracy Of Our Automated Accessibility Checker
Previous work employed different versions of the Adobe Accessibility Checker to generate paper accessibility reports.
To confirm the accuracy of our checker, as well as the automated script we create to perform the analysis, we run our checker on CHI 2010 papers to reproduce the results of Lazar et al.  . We identify CHI papers using DOIs reported by the ACM, and resolve these to PDFs in the Semantic Scholar corpus  . We identify 3,248 CHI papers in the corpus, and generate accessibility reports for these using our automated checker. Our results shows similar rates of compliance compared to what was measured by Lazar et al.  (see Table 2 for results). This indicates that our automated accessibility checker produces comparable results to previous studies.
3.3 Proportion Of Papers With Accessible Pdfs
Around 1.6% of PDFs we attempted to process failed in the Adobe checker (i.e., we could not generate an accessibility report). The accessibility checker most commonly fails because the PDF file is password protected, or the PDF file is corrupt. In both of these cases, the PDF is inaccessible to the user. We exclude these PDFs from subsequent analysis.
Accessibility compliance over all papers is low. Table 2 In fact, only 854 PDFs (7.5%) in the whole dataset have alt-text for figures. This is intuitive as Alt-text is the only criterion that always requires author input to achieve, while the other four criteria can be derived from the document or automatically inferred, depending on the software used to generate the PDF.
As shown in Figure 3 , all fields have an Adobe-5 Compliance of less than 7%. The fields with the highest rates of compliance are Philosophy (6.3%), Art (6.2%), Business (5.7%), Psychology (5.7%), and History (5.3%) while the fields with the lowest rates of compliance are Geology (0.2%), Mathematics (0.3%), and Biology (0.6%). Fields associated with higher compliance tend to be closer to the humanities, and those with lower levels of compliance tend to be science and engineering fields. The prevalence of different document editing and typesetting software by field of study may explain some of these differences, and we explore these associations in Section 3.5.
3.4 Trends In Paper Accessibility Over Time
We show changes in compliance for all fields of study over time in Figure 4 . Table 3 . Count of papers per Typesetting Software. "Other" includes PDFs created with an additional 24 unique software programs, each with counts of less than 350, as well as those created with an unknown typesetting software.
10% compliance in 2010 to more than 25% in 2019. This may be due to changes in PDF generation defaults in various typesetting software. Though this improvement is good, Default Language is the easiest of the five criteria to bring into compliance, and arguably the least valuable in terms of improving the accessible reading experience. The criterion with the lowest rate of compliance is Alt-text, which has remained stable between 5-10% and has been lower in recent years.
Since Alt-text is the only criterion of the five which always necessitates author intervention, we believe this is a sign that authors have not become more attuned to accessibility needs, and that at least some of the improvements we see over time can be attributed to typesetting software or publisher-level changes.
3.5 Association Between Typesetting Software And Paper Accessibility
Typesetting software is extracted from PDF metadata and manually canonicalized. We extract values for three metadata fields:
xmp:CreatorTool, pdf:docinfo:creator_tool, and producer. All unique PDF creation tools associated with more than 20 PDFs in our dataset are reviewed and mapped to a canonical typesetting software. For example, the values (latex, pdftex, tex live, tex, vtex pdf, xetex) are mapped to the LaTeX cluster, while the values (microsoft, for word, word) and other variants are mapped to the Microsoft Word cluster. We realize that not all Microsoft Word versions, LaTeX distributions, or other versions of typesetting software within a cluster are equal, but this normalization allows us to generalize over these software clusters. For analysis, we compare the five most commonly observed typesetting software clusters in our dataset, grouping all others into a cluster called Other.
We report the distribution of typesetting software in Table 3 . The most popular PDF creators are Adobe InDesign, LaTeX, Arbortext APP, Microsoft Word, and Printer. "Printer" refers to PDFs generated by a printer driver (by selecting "Print" → "Save as PDF" in most operating systems); unfortunately, creating a PDF through printing provides no indicator of what software was used to typeset the document, and is generally associated with very low accessibility compliance. The "Other" category aggregates papers created by all other clusters of typesetting software; each of these clusters is associated with less than 350 PDFs, i.e., the falloff is steep after the Printer cluster. For the following analysis, we present a comparison between the five most common PDF creator clusters. Figure 5 shows histograms of the Total Compliance score for PDFs in the five most common typesetting software clusters. While the vast majority of papers do not meet any accessibility criteria, it is clear that Microsoft Word produces the most accessible PDFs, followed by Adobe InDesign. To determine the significance of this difference, we compute the ANOVA and Kruskal-Wallace  statistics with the PDF typesetting software clusters as the sample groups and the Total Compliance as the measurements for the groups. We compute an ANOVA statistic of 2587.1 ( < 0.001) and Compliance scores between the five most common PDF typesetting software.
In Figure 6 , we observe again that usage of Microsoft Word is highly correlated with accessibility compliance. Here, we plot the proportion of Microsoft Word usage per field of study and the corresponding mean normalized Total
Compliance rates for those fields. Higher rates of Microsoft Word usage are statistically correlated with higher mean normalized Total Compliance ( = 0.89, < 0.001).
In Figure 7 show modest improvements over time, though only between 10-15% of papers in our sample meet any one of these individual criteria. Alt-text compliance is the lowest of our measured criteria, and as the only criterion of the five requiring author intervention in all cases, the lack of alt-text may be indicative of the general lack of author awareness and contribution to accessibility efforts for scientific papers.