Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

Improving the accessibility of scientific documents

Authors

  • Lucy Lu Wang
  • Jonathan Bragg

Abstract

The majority of scientific papers are distributed in PDF, which pose accessibility challenges for blind and low vision (BLV) readers. We characterize the scope of this problem by assessing the accessibility of 11K PDFs published 2010-2019 sampled across various fields of study, finding that less than 2.4% of these PDFs satisfy defined accessibility criteria. We conduct a user study to better understand the needs and mitigation strategies of BLV researchers when reading papers, and to solicit design and usability feedback on a proposed system solution. We iterate on the design of our system, PaperToHTML, which uses several machine learning models to extract content from PDFs and render this content as accessible HTML. Our prototype focuses on providing high-level navigational support and paper content for users of screen readers. An intrinsic evaluation of extraction quality indicates that the majority of HTML renders (87%) produced by our system have no or only some readability issues. Our system is publicly available at anonymized_url, where users can upload and render scientific documents as HTML on-demand. CCS Concepts: • Human-centered computing → Empirical studies in accessibility; Accessibility systems and tools; HCI design and evaluation methods; Accessibility design and evaluation methods.

1. Introduction

Fig. 1. The PaperToHTML system for converting scientific documents to HTML. A user begins the process by uploading a scientific PDF into our system via our web interface (step 1). Our system then runs a series of models and API calls on the uploaded document (step 2): metadata is extracted and linked to entries in the Semantic Scholar corpus; section headers, body text, and references are extracted using the S2ORC pipeline [40], which leverages Grobid [41]; and DeepFigures [64] is used to extract figures and tables, along with their captions. The system then generates the final HTML render (step 3), with inferred logical reading order and added navigational features. Features include a) a heuristically generated table of contents, b) figures and tables inserted in the appropriate places in the text, near first mentions, and c) and d) bidirectional links between inline citations and entries in the reference list. We add HTML tags: header tags for sections, paragraph tags for body text, and figure tags for figures and tables.

Scientific literature is most commonly available in the form of PDFs, which pose significant challenges for accessibility [13, 51] . To enable reading by blind and low vision (BLV) researchers or users of screen readers, these PDFs must be annotated with proper reading order, headings, tags, table structure, and image alt-text. These annotations are laborious, require proprietary tooling, and require that the PDF authors have both the motivation and know-how to make their Fig. 1 . The PaperToHTML system for converting scientific documents to HTML. A user begins the process by uploading a scientific PDF into our system via our web interface (step 1). Our system then runs a series of models and API calls on the uploaded document (step 2): metadata is extracted and linked to entries in the Semantic Scholar corpus; section headers, body text, and references are extracted using the S2ORC pipeline [40] , which leverages Grobid [41] ; and DeepFigures [64] is used to extract figures and tables, along with their captions. The system then generates the final HTML render (step 3), with inferred logical reading order and added navigational features. Features include a) a heuristically generated table of contents, b) figures and tables inserted in the appropriate places in the text, near first mentions, and c) and d) bidirectional links between inline citations and entries in the reference list. We add HTML tags: header tags for sections, paragraph tags for body text, and figure tags for figures and tables.

PDFs accessible [13] . The process must also be repeated for each variant of a PDF produced, regardless of how small the change. As a result of these barriers, the vast majority of paper PDFs are not accessible, leading to high cognitive load and frustration for BLV researchers trying to read these papers.

While poor paper PDF accessibility has been documented in prior work [13, 14, 38, 50, 60] , these studies have been focused on fields adjacent to accessible computing-human-computer interaction, disability studies, etc-and do not necessarily generalize to other fields of study. In this work, we aim to provide a broad description of the scope of the problem for BLV scholars. Employing both quantitative and qualitative techniques, we perform (1) a large-scale corpus-level analysis of scientific PDF accessibility over multiple fields of study, and (2) a formative user study to characterize the challenges faced by BLV researchers when reading inaccessible PDFs. We explore the feasibility of a technological solution, a system we develop called PaperToHTML that automatically converts paper PDFs into tagged HTML, and how such a system might impact the experience of BLV researchers when reading scientific papers.

Our corpus-level analysis reveals that accessibility adherence is low across all fields of study; however, we also identified differences across fields of study, and limitations with current measurement techniques, which may explain some of these differences. For example, fields closer to the humanities tended to have higher compliance numbers.

However, these differences may be explained in part by differences in typesetting software, which we find to be strongly associated with accessibility compliance but may not indicate meaningful human intervention (e.g., software may automatically set the default language). Further, we discovered that the vast majority of figure alt-text that passed automated accessibility checkers is in fact devoid of any meaningful content (e.g., is a file path; details in Section 3.4), leading overall to inflated estimates for that criterion. Still, not all estimates are inflated to the same degree; papers published at CHI, for example, are inflated to a much lesser degree. Overall, our analysis suggests that human intervention is still low, and automatic measurement of accessibility rates may give a false sense of progress; we provide methodology for more nuanced analysis and baselines to assist future measurement efforts.

Results from our user study reveal complementary qualitative insights into the challenges faced by BLV users when reading papers. We condense these findings into a set of recommendations for designing and engineering accessible reading systems (Section 4.6), and use these recommendations to iterate on the design of our proposed system. Participants responded positively to our prototype, emphasizing the benefits of having access to navigational features such as headings, the table of contents, and bidirectional links between inline citations and references. All users reported being likely to use such a system in the future were it to be broadly available for scientific papers.

We describe the architecture and design of the PaperToHTML system in Section 5. As shown in Figure 1 , PaperToHTML integrates several machine learning text and vision models to extract the structure and semantic content of papers.

The content is then represented as an HTML document with headings and links added for navigation, figures and tables inserted in logical locations, as well as other novel features to assist in document structure understanding. We incorporate feedback received during the user study to iterate upon and arrive at the system design presented in this paper. We perform an intrinsic evaluation of the quality of HTML renders produced by our system and identify common classes of extraction problems. We find that though many papers exhibit some extraction errors, the majority (55%) have no major problems that impact readability, and another 32% have only some problems that impact readability. This result suggests that current models for document understanding may already be sufficient for improving BLV reader experience in a majority of settings. In user interviews, we also observed the ability of rendered HTMLs to facilitate better in-paper navigation and interactions for BLV users. Going forward, we hope to make further improvements to our document understanding models, especially those focused on specific components of papers like equations, tables, algorithms, and figures, which are the most challenging elements for current models.

The goals and contributions of this paper are three-fold:

(1) We characterize the state of scientific-paper PDF accessibility by estimating the degree of adherence to accessibility criteria for papers published in the last decade (2010-2019), and describe correlations between year, field of study, PDF typesetting software, and PDF accessibility (Section 3). We highlight issues with automatic measurement and introduce heuristics to mitigate measurement error.

(2) We conduct a formative user study with BLV scholars to understand the challenges they currently experience when reading scientific papers. During each interview session, we ask the user to discuss their reading experiences, demonstrate their current workflow, and interact with initial prototypes of our system to offer feedback for its improvement. We summarize the findings of this user study into a set of design recommendations, which we use to refine our system.

(3) We introduce a system, PaperToHTML, which automatically extracts the content of scientific PDFs and rerenders the content as tagged HTML on-demand. We perform a quantitative and qualitative evaluation of the HTML renders produced by our system, through expert grading of the accuracy of the HTML compared to the source PDF. Our system is publicly available at anonymized_url. This paper is organized as follows. We begin with a description of related work in Section 2. We provide a metascientific analysis of the current state of scientific PDF accessibility in Section 3. In Section 4, we detail our user study and findings. In Section 5, we describe our pipeline for converting PDF to HTML, and the technical components and UI features in PaperToHTML. An evaluation of HTML render quality and faithfulness is provided in Section 6.

We recognize that no PDF understanding system is perfect, and many open research challenges remain in improving these systems. However, based on our findings, we believe PaperToHTML can dramatically improve screen reader navigation of most papers compared to reading the raw PDFs, and is well-positioned to assist BLV researchers with many of their common reading use cases. Our hope is that a system such as ours can improve BLV researcher access to the content of scientific papers, and that our design recommendations and learnings can be leveraged by others to create better, more faithful, and ultimately more usable tools and systems for scholars in the BLV community.

2. Related Work

Accessibility is an essential component of computing, which aims to make technology broadly accessible to as many users as possible, including those with differing sets of abilities. Improvements in usability and accessibility falls to the community, to better understand the needs of users, and to design technologies that play to a spectrum of abilities [71] .

In computing, significant strides have been made to increase the accessibility of web content. For example, various versions of the Web Content Accessibility Guidelines (WCAG) [15, 17] and the in-progress working draft for WCAG 3.0, 1 or standards such as ARIA from the W3C's Web Accessibility Initiative (WAI) 2 have been released and used to guide web accessibility design and implementation. Similarly, positive steps have been made to improve the accessibility of user interfaces and user experience [12, 52, 53, 67] , as well as various types of media content [28, 46, 49] .

We take inspiration from accessibility design principles in our effort to make research publications more accessible to users who are blind and low vision. Blindness and low vision are some of the most common forms of disability, affecting an estimated 3-10% of Americans depending on how visual impairment is defined [27] . BLV researchers also make up a representative sample of researchers in the United States and worldwide. A recent Nature editorial pushes the scientific community to better support researchers with visual impairments [61] , since existing tools and resources are limited. In this work, we engage with the challenge of accessing and reading the content of academic publications.

2.1 Accessibility And Scientific Publishing

As summarized in Bigham et al. [13] , accessibility challenges for scientific PDFs are largely due to three factors: (1) the complexity of the PDF file format, which make it less amenable to certain accessibility features, (2) the dearth of tools, especially non-proprietary tools, for creating accessible PDFs, and (3) the dependency on volunteerism from the community with minimal support or enforcement. The intent of the PDF file format is to support faithful visual representation of a document for printing, a goal that is inherently divergent from that of document representation for the purposes of accessibility. Though some professional organizations like the Association for Computing Machinery (ACM) have encouraged PDF accessibility through standards and writing guidelines, 3 uptake among academic publishers and disciplines more broadly has been limited.

Guidelines and policy changes have been introduced in the past decade to ameliorate some of the issues around scientific PDF accessibility. Conferences such as The ACM CHI Virtual Conference on Human Factors in Computing Systems (CHI) and The ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), have released guidelines for creating accessible submissions. 4 The ACM Digital Library 5 provides some publications in HTML format, which is easier to make accessible than PDF [32] . Ribera et al. [60] conducted a case study on DSAI 2016 (Software Development and Technologies for Enhancing Accessibility and Fighting Infoexclusion). The authors of DSAI were responsible for creating accessible proceedings and identified barriers to creating accessible proceedings, including lack of sufficient tooling and lack of awareness of accessibility. The authors recommended creating a new role in the organizing committee dedicated to accessible publishing. In recent years, some publishers (including Science, Nature, and PLoS) now provide HTML reading experiences for their papers, which can dramatically mitigate challenges for BLV researchers. These policy changes have led to improvements in localized communities, but have not been widely adopted by academic publishers and conference organizers.

In many fields outside of computing, such as Biology and Medicine, versions of record (the final published versions of papers) are produced by publishers from an author's submitted manuscript, which moves the control of paper PDF accessibility from authors to publishers. In Section 3.6, we show how the choice of typesetting software (e.g. commonly used publisher software such as Adobe InDesign and Arbortext APP) can impact PDF accessibility, sometimes inflating our perceptions of compliance.

Though progress is trending in the right direction, a large proportion of papers published now and historically are still not accessible. In this work, we address this challenge by introducing a system that converts paper PDFs into HTML documents, preserving high level structure and organization, allowing BLV readers to more easily navigate the paper. Being able to quickly navigate the contents of a paper through skimming and scanning is an essential reading technique [45] , which is currently under-supported by PDF documents and PDF readers when reading these documents using screen readers.

2.2 Accessibility Tools For Scientific Pdfs

BLV users interact with papers using screen readers, braille displays, text-to-speech, and other assistive tools. A WebAIM survey of screen reader users found that the vast majority (75.1%) of respondents indicate that PDF documents are very or somewhat likely to pose significant accessibility issues. 6 Prior work on scientific document accessibility have made recommendations for how to make PDFs more accessible [20, 58] , including greater awareness for what constitutes an accessible PDF and better tooling for generating accessible PDFs. Some work has focused on addressing components of paper accessibility, such as the correct way for screen readers to interpret and read mathematical equations [6, 10, 25, 26, 43, 65, 66] , describe charts and figures [21] [22] [23] , automatically generate figure captions [16, 56, 57] , or automatically classify the content of figures [35] . Other work applicable to all types of PDF documents aims to improve automatic text and layout detection of scanned documents [48] and extract table content [24, 59] .

There also exists a variety of automatic and manual tools that assess and fix accessibility compliance issues in PDFs, including the Adobe Acrobat Pro Accessibility Checker 7 , Common Look 8 , ABBYY FineReader 9 , PAVE 10 , and PDFA

Our Analysis

Table 1. Prior work has investigated PDF accessibility for papers published in specific venues such as CHI, ASSETS, W4A, DSAI, or various disability journals. Several of these works were conducted manually, and were limited to a small number of papers, while the more thorough analysis was conducted for CHI and ASSETS, two conference venues focused on accessibility and HCI. Our study expands on this prior work to investigate accessibility over 11,397 PDFs sampled from across different fields of study.
Prior work PDFs analyzed Venues Year Accessibility checker
Brady et al. [14] 1,811 CHI, ASSETS and W4A 2011-2014 PDFA Inspector
Lazar et al. [38] 465 + 32 CHI and ASSETS 2014-2015 Adobe Acrobat Action Wizard
Ribera et al. [60] 59 DSAI 2016 Adobe PDF Accessibility Checker 2.0
Nganji [50] 200 Disability & Society, Journal of Developmental and Physical Disabilities, Journal of Learning Disabilities, and Research in Developmental Disabilities 2009-2013 Adobe PDF Accessibility Checker 1.3
Our analysis 11,397 Venues across various fields of study 2010-2019 Adobe Acrobat Accessibility Plug-in Version 21.001.20145

11,397 Venues across various fields of study 2010-2019 Adobe Acrobat Accessibility Plug-in Version 21.001.20145 Table 1 . Prior work has investigated PDF accessibility for papers published in specific venues such as CHI, ASSETS, W4A, DSAI, or various disability journals. Several of these works were conducted manually, and were limited to a small number of papers, while the more thorough analysis was conducted for CHI and ASSETS, two conference venues focused on accessibility and HCI. Our study expands on this prior work to investigate accessibility over 11,397 PDFs sampled from across different fields of study.

Inspector 11 . To our knowledge, PAVE and PDFA Inspector are the only non-proprietary, open-source tools for this purpose. Based on our experiences, however, all of these tools require some degree of human intervention to properly tag a scientific document, and tagging and fixing must be performed for each new version of a PDF, regardless of how minor the change may be. Alt-text, in particular, requires significant detail to be meaningful [42] , necessitating author intervention as no current tools are capable of automatically generating suitable alt-text for the complex scientific figures that occur in a realistic setting [57] . Table 1 lists previous studies that have analyzed PDF accessibility of academic papers, and shows how our study compares. Prior work has primarily focused on papers published in Human-Computer Interaction and related fields, specific to certain publication venues, while our analysis tries to quantify paper accessibility more broadly.

2.3 Quantitative Studies Of Academic Pdf Accessibility

Brady et al. [14] quantified the accessibility of 1,811 papers from CHI 2010-2016, ASSETS 2014, and W4A, assessing the presence of document tags, headers, and language. They found that compliance improved over time as a response to conference organizers offering to make papers accessible as a service to any author upon request. Lazar et al. [38] conducted a study quantifying accessibility compliance at CHI from 2010 to 2016 as well as ASSETS 2015, confirming the results of Brady et al. [14] . They found that across 5 accessibility criteria, the rate of compliance was less than 30% for CHI papers in each of the 7 years that were studied. The study also analyzed papers from ASSETS 2015, an ACM conference explicitly focused on accessibility, and found that those papers had significantly higher rates of compliance, with over 90% of the papers being tagged for correct reading order and no criteria having less than 50% compliance. This finding indicates that community buy-in is an important contributor to paper accessibility. Nganji [50] conducted a study of 200 PDFs of papers published in four disability studies journals, finding that accessibility compliance was between 15-30% for the four journals analyzed, with some publishers having higher adherence than others.

To date, no large scale analysis of scientific PDF accessibility has been conducted outside of disability studies and HCI, due in part to the challenge of scaling such an analysis. We believe such an analysis is useful for establishing a 11 https://github.com/pdfae/PDFAInspector baseline and characterizing routes for future improvement. Consequently, as part of this work, we conduct an analysis of scientific PDF accessibility across various fields of study, and report our findings relative to prior work. We also identify limitations with past automatic evaluation methodology, and provide new methodology for interpreting results while considering typesetting software and alt-text content.

2.4 Scientific Pdf Understanding

A first step to automatically making a paper PDF more accessible is being able to understand its layout and semantic content. Scientific document understanding systems that rely on a combination of PDF parsers, visual models, and text classifiers can be used to perform structured content extraction from scientific PDFs. These include text-based methods such as Grobid [41] , ScienceParse [8] , CERMINE [69] , Neural ParsCit [55] , and others [68] . These tools generally work by using a PDF parser to extract a sequence of tokens from the PDF, then applying a trained machine learning model like Conditional Random Field [37] to classify the tokens into categories such as title, author, section header, body text, references, etc. Many of these tools were developed initially with a focus on extracting metadata and references from papers at scale to better construct the bibliographic network, and have been enhanced with the ability to extract full text and other document components. Other methods rely on visual signals [78] , performing image object detection on each PDF page and classifying the objects into paper semantic categories. Some approaches leveraging visual signals also focus on extracting specific components of papers, such as figures and tables [64] .

Our PaperToHTML system relies on the S2ORC paper parsing pipeline [40] , which converts papers to a structured JSON object containing metadata fields, section headers, body text split into paragraphs, references, figure and table captions, as well as links between inline citations and reference items. The S2ORC library uses Grobid [41] to process PDFs into TEI XML, and has additional provisions for converting TEI XML, LaTeX source, and JATS XML documents into a unified JSON representation. 12 PaperToHTML also uses the DeepFigures model [64] to extract figure and table images, combining these images with the JSON output of S2ORC before rendering as HTML. We elect to use DeepFigures as it achieves state-of-the-art performance over the previous baseline of PDFFigures 2.0 [19] and because its code base is openly available. 13 Recent advances in document understanding integrate and jointly model text and visual signals. Layout-aware language models such as LayoutLM [74] , LayoutLMv2 [75] , SelfDoc [39] , and VILA [62] have established new state-ofthe-art on document understanding tasks. However, these models are expensive for deployment at scale, and do not provide a representation of the full-text document containing all of the features needed to reinterpret that document as HTML. For example, classes relevant to scientific documents (affiliations, references, etc) must be added during further training, additional logic is necessary to infer reading order and merge line breaks at page boundaries, and inline citations must be identified. VILA [62] was developed specifically for scientific documents, and is most relevant to our needs, though significant work remains to achieve feature parity with our current system. We leave to future work the adapting of more performant layout-aware language models for the task of converting scientific papers to HTML.

2.5 Tools For Reading Papers

Scientific papers can be difficult to read due to their technical content and the increasing presence of field-specific jargon [9] . Users also employ many different browsing [11] and reading [31] behaviors, not all of which are well supported by current reading formats. Many tools and interfaces have arisen to help simplify parts of the process. For example, reading environments like those first introduced by Graham [29] or Zhang et al. [76] can help users make sense of collections of web documents or papers, facilitating actions like link-chasing or note-taking. Reading interfaces like eLife Lens [2] , Pubmed's PubReader [4] , or arXiv Vanity [1] allow users to read certain papers in a browser in web format with additional navigation and search support. For example, inline citations may be more easily resolved, and footnotes and other annotations may be placed closer to their origin. Other tools like ScholarPhi [30] , Semantic Reader [5] , and Paperly [3] provide augmented PDF reading experiences, introducing additional features for term [33] and equation [30] understanding, or highlighting and note-taking [3] . There have also been efforts to improve the experience of reading older PDFs by overlaying annotations [54] , or intergrating neural models for text understanding or question answering into paper reading interfaces [30, 33, 77] .

BLV users experience unique challenges with the most common tools for finding, reading, and organizing academic papers. Digital library tools may lack the necessary information for blind users to properly assess the content or suitability of files before opening them; linear documents can make it such that BLV users must take more time to evaluate whether the content of a document matches their goals [72, 73] , especially when limited facilities are available to support auditory skimming [34] . Similarly, many of the interface developments described above may not translate directly to improved experiences for BLV users. For the design of PaperToHTML, we attempt to support BLV users in the oft-observed behaviors of skimming, scanning, and fragmented reading of scientific papers [31] . Our system provides a high-level overview of document structure (through a table of contents and section headings), allowing users to quickly determine the relevance of the paper to their needs and navigate to the most pertinent content.

3. Quantitative Analysis Of Academic Pdf Accessibility

To capture and better characterize the scope and depth of the problems around academic PDF accessibility, we perform a broad meta-scientific analysis. We aim to measure the extent of the problem (e.g., what proportion of papers have accessible PDFs?), whether the state of PDF accessibility is improving over time (e.g., are papers published now more likely to be accessible than those published in 2010?), and whether the typesetting software used to create a paper is associated with the accessibility of its PDF (e.g., are papers created using Microsoft Word more or less accessible than papers created with other software?).

Prior studies on PDF accessibility have been limited to papers from specific publication venues such as CHI, ASSETS, W4A, DSAI, and journals in disability research. Notably, these venues are closer to the field of accessible computing, and are consequently more invested in accessibility. 14 We expand upon this work by investigating accessibility trends across various fields of study and publication venues. Our goal is to characterize the overall state of paper PDF accessibility and identify ongoing challenges to accessibility going forward. Further, we introduce analysis methodology that newly considers typesetting software and alt-text information content; we hope these methods will guide more accurate monitoring of PDF accessibility in the future.

3.1 Data & Methods

We sample PDFs from the Semantic Scholar literature corpus [8] for analysis. We construct a dataset of papers by sampling PDFs published in the years of 2010-2019 stratified across the 19 top level fields of study (e.g. Biology, Computer Science, Sociology, etc) defined by Microsoft Academic Graph [63, 70] . For each field of study, we sample papers from the top venues by total citation count, along with some documents without venue information such as books and book chapters. The resulting documents come from 1058 unique publication venues; for each field of study, between 29 We analyze the PDFs in this dataset using the Adobe Acrobat Pro DC PDF accessibility checker. 15 Though this checker is proprietary and requires a paid license, it is the most comprehensive accessibility checker available and has been used in prior work on accessibility [38, 50, 60] . Alternatively, non-proprietary PDF parsers such as PDFBox 16 do not consistently extract accessibility information from PDFs, even when we found those criteria to be met. We also prefer Adobe's checker to PDFA Inspector, used by Brady et al. [14] , because PDFA Inspector only analyzes three criteria, whereas we are interested in other accessibility attributes like the presence of alt-text on figures.

For each PDF, the Adobe accessibility checker generates a report that includes whether or not the PDF passes or fails tests for certain accessibility features, such as the inclusion of figure alt-text or properly tagged headings for navigation.

Because there is no API or standalone application for the Adobe accessibility checker, it can only be accessed through the user interface of a licensed version of Adobe Acrobat Pro. We developed an AppleScript program that enables us to automatically process papers through the Adobe checker. Our program requires a dedicated computer running MacOS and a licensed version of Adobe Acrobat Pro. It takes 10 seconds on average to download and process each PDF, which enables us to scale up our analysis to tens of thousands of papers. Accessibility reports from the checker are saved in HTML format for subsequent analysis.

Each report contains a total of 32 accessibility criteria, marked as "Passed," "Failed," or "Needs manual check." 17 Following Lazar et al. [38] , we analyze the following five criteria 18 :

• Alt-text: Figures have alternate text.

Table headers: Tables have headers. • Tagged PDF: The document is tagged to specify the correct reading order.

• Default language: The document has a specified reading language.

• Tab order: The document is tagged with correct reading order, used for navigation with the tab key.

For our analysis, we report the pass rate for each of the 5 criteria, as well as Total Compliance, the sum number of accessibility criteria met (e.g., if a paper meets 3 out of 5 criteria, Total Compliance is 3). In some cases, we report the Normalized Total Compliance, which is the proportion of the 5 criteria which are satisfied. We also report Adobe-5 15 https://www.adobe.com/accessibility/products/acrobat/using-acrobat-pro-accessibility-checker.html 16 https://github.com/apache/pdfbox 17 Please see https://helpx.adobe.com/acrobat/using/create-verify-pdf-accessibility.html for a description of the accessibility report. 18 For papers containing no tables and/or figures, we observe that the Adobe checker can still return either pass or fail for the Table header and Alt-text criteria. When objects in the PDF are not tagged, the checker appears to fail these criteria even when the paper has no tables and/or no figures. When objects in the PDF are tagged and the PDF is accessible, the checker appears to pass these criteria even when the paper has no tables or no figures. Additionally, we note that if tables are not tagged as tables (and interpreted as paragraphs instead), the Table headers criteria may also pass.

Criterion

Table 2. We reproduce the analysis conducted by Lazar et al. [38] on PDFs of papers published in CHI, showing the percentage of papers that satisfy each of the five accessibility criteria. We find similar compliance rates, indicating that our automated accessibility checker pipeline is comparable to previous analysis methods. We also show the percentage of papers in our full dataset of 11,397 PDFs that satisfy each criterion, along with the percent that satisfy Adobe-5 Compliance.
Criterion CHI 2010 [38] Ours-CHI 2010 Ours-FOS All (11,397)
Alt-text 3.6% 4.0% 7.5%
Table headers 0.7% 1.0% 13.3%
Tagged PDF 6.3% 7.4% 13.4%
Default language 2.3% 3.0% 17.2%
Tab order 0.3% 1.0% 9.3%
Adobe-5 Compliance - - 2.4%

CHI 2010 [38] Ours-CHI 2010 Ours-FOS All (11, 397) Alt-text 3.6% 4.0% 7.5% Table 2 . We reproduce the analysis conducted by Lazar et al. [38] on PDFs of papers published in CHI, showing the percentage of papers that satisfy each of the five accessibility criteria. We find similar compliance rates, indicating that our automated accessibility checker pipeline is comparable to previous analysis methods. We also show the percentage of papers in our full dataset of 11, 397 PDFs that satisfy each criterion, along with the percent that satisfy Adobe-5 Compliance.

Compliance, a binary value of whether a paper has met all 5 criteria (1 if all 5 criteria are met, 0 if any are not met), and the rate of Adobe-5 Compliance for papers in our dataset.

In addition to running the accessibility checker, we also extract metadata for each PDF, focusing on metadata related to the PDF creation process. PDF metadata are generated by the software used to create each file, and we analyze the associations between different PDF creation software and the accessibility of the resulting PDF document. Our hypothesis is that some classes of software (such as Microsoft Word) produce more accessible PDFs.

For PDFs that "Passed" the alt-text criteria, we further process a sample of these documents to extract the authorwritten alt-text. Upon examination of these PDFs, we realized that passing the alt-text criteria is not equivalent to a document containing informative alt-text, e.g., many alt-text are auto-generated rather than author-written, and contain text such as "Image" or a filepath, without any concrete description of the actual figure contents. PDFs with this sort of alt-text will pass the Adobe checker yet do not have meaningfully accessible figures. To extract alt-text, we use the Adobe Acrobat Pro PDF to HTML conversion utility to convert a sample of documents into HTML, from which we can access the alt-text associated with each figure. Given that significant information content is required of figure alt-text to satisfy BLV user needs [42] , we analyze these alt-texts to determine whether they contain any kind of meaningful description of the figure content. Because the proportion of PDFs containing meaningful alt-text is fantastically low, essentially 0% in practice, we present a separate discussion of this in Section 3.4 separately from our baseline criteria.

3.2 Accuracy Of Automated Accessibility Checker

Previous work employed different accessibility checkers (Table 1) to generate accessibility reports. To confirm the accuracy of our checker, as well as the automated AppleScript we develop to perform the analysis, we run our checker on CHI 2010 papers to reproduce the results of Lazar et al. [38] . We identify CHI 2010 papers using DOIs reported by the ACM, and resolve these to PDFs in the Semantic Scholar corpus [8] . We generate accessibility reports for these papers using our automated checker and report compliance in Table 2 .

Our results shows similar rates of compliance compared to what was measured by Lazar et al. [38] . For all criteria, the difference amounts to an additional 1-3 papers passing the select criteria (among the total 302 papers published at CHI 2010). This variation can be explained by differences in the accessibility checker used, and from our having reconstructed the CHI 2010 corpus from the Semantic Scholar corpus, which may contain different versions of PDFs for those papers. We believe these results confirm that our accessibility checker results are reliable; if anything, the checker we use errs towards slightly higher accessibility compliance than the variant used by Lazar et al. [38] , indicating that the compliance rates we report may be on the high side, though are still very low.

3.3 Proportion Of Papers With Accessible Pdfs

Around 1.6% of PDFs we attempted to process failed in the Adobe checker (i.e., we could not generate an accessibility report). The accessibility checker most commonly fails because the PDF file is password protected, or the PDF file is corrupt. In both of these cases, the PDF is inaccessible to the user. We exclude these PDFs from subsequent analysis.

Fig. 2. The distribution of numbers of PDFs in our dataset that meet our defined accessibility compliance criteria. A large majority (8519) of PDFs in our sample meet 0 out of 5 accessibility criteria. Of those meeting 1 criterion (Total Compliance = 1), the most commonly met criterion is Default Language (793 of 1010, 78.5%). Of those meeting 4 criteria (Total Compliance = 4), the most common missing criterion is Alt-text (396 of 494, 80.2%).

Accessibility compliance over all papers is low. Table 2 shows the percent of papers meeting each of the five criteria, as well as the Adobe-5 Compliance rate associated with this sample of papers. Figure 2 shows that the vast majority of papers do not meet any of the five accessibility criteria (8519 papers, 74.7% do not meet any criteria) and very few (275 papers, 2.4%) meet all five. Of those PDFs meeting 1 criterion, the most commonly met criterion is Default Language (793 of 1010, 78.5%). Of those PDFs meeting 4 criteria, the most common missing criterion is Alt-text (396 of 494, 80.2%).

In fact, only 854 PDFs (7.5%) in the whole dataset have alt-text for figures. This is intuitive as Alt-text is the only criterion that always requires author input to achieve, while the other four criteria can be derived from the document or automatically inferred, depending on the software used to generate the PDF.

Fig. 3. Percent of papers per field of study that meet all 5 accessibility criteria defined in Adobe-5 Compliance. Philosophy, Art, and Psychology have the highest rates of Adobe-5 Compliance satisfaction while Biology, Mathematics, and Geology have the lowest rates. None of the fields had more than 6.5% of PDFs satisfying Adobe-5 Compliance.

As shown in Figure 3 , all fields have an Adobe-5 Compliance of less than 7%. The fields with the highest rates of compliance are Philosophy (6.3%), Art (6.2%), Business (5.7%), Psychology (5.7%), and History (5.3%) while the fields with the lowest rates of compliance are Geology (0.2%), Mathematics (0.3%), and Biology (0.6%). Fields associated with higher compliance tend to be closer to the humanities, and those with lower levels of compliance tend to be science and engineering fields. The prevalence of different document editing and typesetting software by field of study may explain some of these differences, and we explore these associations in Section 3.6.

3.4 Accounting For Meangingful Alt-Text

The Adobe accessibility checker is able to identify the presence of alt-text, but lacks the ability to assess the quality of alt-text. We successfully convert and extract alt-texts from 773 of 854 PDFs that "Passed" the Alt-text criteria. We define a series of heuristics to filter the extracted alt-text to identify meaningful author-written alt-text. These heuristic filters include removing alt-text that say nondescript things like "Image, " "Figure, " or "Logo, " and removing alt-text that are file paths or URLs to the images. These types of alt-text are likely auto-generated by typesetting software during PDF creation.

Following filtering, only 8 PDFs contain alt-text that could be considered meaningful, offering some description of the figure content. For reference, the best alt-text among these papers include text like "Image result for six sigma graph, " "Product-Limit Survival Curves with Number of Subjects at Risk, " and "European Parliament. " Based on these results, less than 1% of PDFs that passed the Alt-text criterion (already only 7.5% among our total sample of 11,397

PDFs) actually have something akin to useful alt-text. Extrapolating to the entire sample, this equates to around 0.07% of papers passing the more stringent criteria of having "meaningful" alt-text. This number, in practice, is essentially 0.

Table 3. Count of papers per Typesetting Software. “Other” includes PDFs created with an additional 24 unique software programs, each with counts of less than 350, as well as those created with an unknown typesetting software.
Typesetting Software Count (%)
Adobe InDesign 1591 (14.0%)
LaTeX 1431 (12.6%)
Arbortext APP 1374 (12.1%)
Microsoft Word 1318 (11.6%)
Printer 1021 (9.0%)

We note that this observation is different for papers published at CHI. When we perform the same extraction and filtering for papers and extended abstracts published at CHI 2019, more than 40% of PDFs that "Passed" the Alt-text criteria also pass our more stringent "meaningful" alt-text filters. Our conclusion is that PDF accessibility checkers are imperfect tools and additional analysis may be necessary to accurately gauge the status of accessibility for any particular field, venue, or other group of publications. We ask the reader to bear this in mind when interpreting the remaining analysis, or when considering the use of automated accessibility checkers for validating accessibility. Table 3 . Count of papers per Typesetting Software. "Other" includes PDFs created with an additional 24 unique software programs, each with counts of less than 350, as well as those created with an unknown typesetting software.

3.5 Trends In Paper Accessibility Over Time

We show changes in compliance for all fields of study over time in with the lowest rate of compliance is Alt-text, which has remained stable between 5-10% and has been lower in recent years. Since Alt-text is the only criterion of the five which always necessitates author intervention, and based on our prior observation that most alt-text among papers that passed the checker lacked sufficient information about the associated figure, we believe this is a sign that authors have not become more attuned to accessibility needs, and that at least some of the improvements we see over time can be attributed to typesetting software or publisher-level changes.

3.6 Association Between Typesetting Software And Paper Accessibility

Typesetting software is extracted from PDF metadata and manually canonicalized. We extract values for three metadata fields:

xmp:CreatorTool, pdf:docinfo:creator_tool, and producer. All unique PDF creation tools associated with more than 20 PDFs in our dataset are reviewed and mapped to a canonical typesetting software. For example, the values (latex, pdftex, tex live, tex, vtex pdf, xetex) are mapped to the LaTeX cluster, while the values (microsoft, for word, word) and other variants are mapped to the Microsoft Word cluster. We realize that not all Microsoft Word versions, LaTeX distributions, or other versions of typesetting software within a cluster are equal, but this normalization allows us to generalize over these software clusters. For analysis, we compare the five most commonly observed typesetting software clusters in our dataset, grouping all others into a cluster called Other.

Fig. 4. Accessibility compliance over time (2010-2019). The rate of Adobe-5 Compliance has remained relatively stable over the last decade, at around 2–3%. Compliance along several criteria have improved over time, though the largest improvements have been in Default Language, the simplistic criteria to meet. Modest improvements are seen for Table headers, Tagged PDFs, and Tab order. The presence of alt-text has remained stable and lower, around 5–10%.
Fig. 5. Histograms showing the distribution of Total Compliance scores for each of the top 5 typesetting software, ordered by decreasing mean Total Compliance. Microsoft Word stands out as producing PDFs with significantly higher Total Compliance than other typesetting software. Three of the top five PDF typesetting software clusters, Arbortext APP, Printer, and LaTeX, produce PDFs with low Total Compliance, with the majority of PDFs at 0 compliance.

We report the distribution of typesetting software in Table 3 . The most popular PDF creators are Adobe InDesign, LaTeX, Arbortext APP, Microsoft Word, and Printer. "Printer" refers to PDFs generated by a printer driver (by selecting "Print" → "Save as PDF" in most operating systems); unfortunately, creating a PDF through printing provides no indicator of what software was used to typeset the document, and is generally associated with very low accessibility compliance. The "Other" category aggregates papers created by all other clusters of typesetting software; each of these clusters is associated with fewer than 350 PDFs, i.e., the falloff is steep after the Printer cluster. For the following analysis, we present a comparison between the five most common PDF creator clusters. Figure 5 shows histograms of the Total Compliance score for PDFs in the five most common typesetting software clusters. While the vast majority of papers do not meet any accessibility criteria, it is clear that Microsoft Word produces the most accessible PDFs, followed by Adobe InDesign. To determine the significance of this difference, we apply the Kruskal-Wallis -test [36] , a non-parametric method for analysis of variance that can be applied to non-normally distributed data. With the PDF typesetting software clusters as the sample groups and the Total Compliance as the measurements for the groups, we compute a Kruskal-Wallis statistic of 4422.0 ( < 0.001). This indicates a significant difference in the distribution of Total Compliance scores between the five most common PDF typesetting software.

Microsoft Word in particular, demonstrates significantly higher accessibility compliance than other typesetting software;

additional analysis supporting this association and further interpretation of trends in typesetting software usage are given in Appendix A.

3.7 Summary Of Analyses

Overall, accessibility compliance over the past decade and across all fields of study have slowly improved. Full compliance based on Adobe-5 Compliance, however, has remained around 2.4% on average and does not show trends towards improving. Alt-text compliance is the lowest of our measured criteria, and its absence may be indicative of the general lack of author awareness and contribution to accessibility efforts for scientific papers.

Typesetting software may play an increasing role in document accessibility. Of the most common PDF creator software, Microsoft Word appears to produce the most accessibility-compliant PDFs, while LaTeX produces PDFs with the lowest compliance. Microsoft has recently made investments in the accessibility of their Office 365 Suite. 19 Software can clearly help to increase accessibility compliance by prioritizing accessibility concerns during document creation, and we encourage other developers of typesetting and publishing software to priortize accessibility concerns in their development process. However, we also caution that not all parts of PDF accessibility can be automated through software. Our assessment of alt-text quality for PDFs that passed the Alt-text criteria reveals that the vast majority of alt-text are not meaningful and may be auto-generated. Rather than addressing accessibility in a meaningful way, the increasing presence of this type of auto-generated alt-text may actually increase the difficulty of measuring and benchmarking accessibility improvements in the future.

Improvements in accessibility compliance have been limited in the past decade, likely because accessibility concerns are considered marginal, and are outside of the awareness of most publishing authors and researchers. Significant changes in the authorial and publication processes are needed to change this status quo, and to increase the accessibility of scientific papers for BLV readers. Though we believe and encourage change in the academic paper authorial and publication process in relation to accessibility, the likelihood of rapid improvement is low and these changes will not impact the many millions of academic PDFs that have already been published. For a fuller characterization of the problem, we conduct a user study to understand these challenges from a user perspective, before introducing a technological solution that could serve some of the immediate needs of the BLV research community.

4. Formative Study And Design Recommendations

We conduct a formative user study to better understand the needs of BLV scientists when reading papers, and to iterate upon the design of an automated system that might better support these needs. During the study, we discuss the user's current challenges reading papers and examine the users' typical PDF reading workflow. We then introduce an early prototype of our system, and observe how the user interacts with the converted HTML document. Based on observations and user feedback, we iterate upon the design of our system.

4.1 Study Design

The study consists of a preliminary questionnaire and semi-structured video interview. Interviews are conducted remotely on Zoom. 20 All recruitment materials, questionnaires, and the interview plan are reviewed and approved by the internal review board at anonymized. We recruited and interviewed six participants. Following several groups of participants, we made design modifications to our prototype as detailed in Section 5.

The inclusion criteria for participants are:

• The participant is over 18 years of age;

• The participant identifies as blind or low vision;

• The participant reads scientific papers regularly (more than 5 per year);

• The participant must have used a screen reader to read a paper in the last year; and

• The participant must complete the pre-interview questionnaire.

Participants were recruited through mailing lists, word-of-mouth, and snowball sampling. Prior to each interview, the participant was asked to provide several keywords corresponding to their subject areas of interest, and between 3-5 papers where they experienced difficulty reading the PDF. Among the 3-5 papers, we selected one paper to use for the study, based on the availability of the PDF, and maximizing the features that could be observed during the user study The primary research questions we investigate in this phase are:

-What methods and/or tools do BLV researchers use to assist in reading the literature?

-What main accessibility challenges do BLV researchers face?

-How do BLV researchers cope with these challenges?

We first asked the participant to describe their current workflow and the challenges they face when reading papers, clarifying how the user copes with challenges when their workflow does not adequately address the problem. We then asked the participant to demonstrate how they currently read a paper, by opening a paper PDF and walking us through the usage of their tools (PDF viewer, screen reader, magnifier, speech-to-text, etc).

Participants kept their computer audio on so we could hear the output of their reader tools. The participant was asked to think aloud and describe their actions when reading the paper. We asked the participant to demonstrate any reading challenges they described in their pre-interview questionnaire. At the end of this phase, we asked the participant to assess how easy or difficult it was to read the paper with their current reading pipeline.

Phase Ii: Interaction With Prototype

The primary research questions we investigate in this phase are:

-How do participants interact with our system?

-What system features resonate positively/negatively with the participant?

The goal of this phase was to understand whether our proposed system could be helpful to the participant, and to iterate on the design of our prototype system. The participant was asked to interact with an early prototype of our system, reading the same paper they read in Phase I. We first provided a brief introduction to the prototype, then allowed the participant to proceed uninterrupted for several minutes interacting with the paper. The participant was asked to think aloud during their interactions. Towards the end of this phase, we prompted the participant to interact with any features in our prototype they may have missed. At the end of this phase, we asked the participant to assess how easy or difficult it was to read the paper with our prototype.

Phase III: Q&A and discussion

The primary objectives of this phase are to answer the questions:

-How can our system be improved to best meet the participant's needs moving forward?

-How likely is the participant to use our system were it to be available in the future?

The participant was asked to describe their perceived pros and cons of the prototype system, and to provide suggestions of missing features, ordered by priority. We asked the participant whether they would use this system were it to be available, and if not, what features would need to be implemented to change that decision.

All interviews were conducted by one author, with two other authors observing the entire session and participating during Phase III. All interviews were recorded for followup analysis, and participants were compensated with a $150 USD gift card for their time. Questions used to guide the semi-structured interview are provided in Appendix F.3.

We identify themes and concepts from the participant interviews. We first perform open coding to identify relevant concepts, then axial coding to group these concepts under broad themes. These themes are 1) the technologies employed by users, 2) challenges in their current reading pipeline, and 3) mitigation or coping strategies, and in relation to our system: 4) positive features, 5) negative features or issues with the prototype, and 6) feature requests.

One author conducted open coding on recorded interviews to identify concepts and themes. Two authors then met several times to iterate upon these themes and concepts. In these meetings, the authors further defined attributes associated with each concept, such as defining whether the technologies used were in relation to opening PDFs, screen reading, or other tasks; or whether the challenges identified affect the whole document, navigation, text, or a particular in-paper element. Following discussion and freezing of the themes and concepts, all interviews were selectively coded a second time to identify all concepts and attributes. The authors also separately coded issues raised by participants in their pre-interview questionnaires. We report results for themes 1-3 in Section 4.4 and themes 4-6 in Section 4.5.

4.2 System Prototype For User Study

Table 4. User study participants, the prototype versions they interacted with, and the tools they currently use for reading papers. *P2 is low vision and uses sighted navigation tools in conjunction with a screen reader.
ID Prototype Version Current Tools
P1 v0.1 NVDA Screen Reader, Adobe Acrobat Reader
P2* v0.2 Mac Text-to-speech, Mac Magnifying Glass (sighted navigation), Mac Preview
P3 v0.3 Braille display, Mac VoiceOver, JAWS/NVDA on Windows, Mac Preview, Adobe Acrobat Reader
P4 v0.3 Mac VoiceOver, Mac Preview or Adobe Acrobat Reader
P5 v0.3 Microsoft Narrator, Adobe Acrobat Reader
P6 v0.3 Braille display, InftyReader, Mac VoiceOver, Mac Preview

The prototype of PaperToHTML provided to users during user study sessions is a minimal version of the system presented in Section 5. The prototype did not convert papers on demand. Prior to each study session, we converted PDFs provided to us by users in preliminary questionnaires, and allowed users to access these conversions during the interview using static hyperlinks. Table 4 . User study participants, the prototype versions they interacted with, and the tools they currently use for reading papers. *P2 is low vision and uses sighted navigation tools in conjunction with a screen reader.

under section headers, with figures and tables inserted at inferred locations between paragraphs; and references, with links between inline citations and reference entries. Detailed descriptions of HTML generation and system UI features are provided in Section 5. Features introduced based on user feedback are described in Section 5.3.

4.3 Study Participants

Participants are graduate students, PhD students, and faculty members from predominantly English-speaking countries, whose primary research areas are in computer science, though also spanning neuroscience and mathematics. We report findings from all participants for all themes captured, making note of features that changed in our system between versions used during the study. Three of six participants study human-computer interaction and accessibility, which may be due in part to our sampling methodology, but may also reflect the relevance of accessibility research to BLV researchers. Other study participants conduct research in the areas of machine learning, neuroscience, software engineering, and blockchain. All but one participant reported having more than one year of experience using screen readers. The tools employed by participants are summarized in Table 4 along with the version of the prototype system with which they interacted.

4.4 Study Findings: Current Pipeline

Summary of current experience. Of the six participants, three users have experience with screen readers on the Windows OS, such as NVDA, JAWS, and Microsoft Narrator, and three users use VoiceOver on MacOS. Two users use braille display in conjunction with their screen reader. One participant (P2) is low vision and uses a combination of text-to-speech and a magnifying glass to perform sighted navigation; P2's primary reading interaction involves selecting blocks of text in the PDF and using text-to-speech. Adobe Acrobat Reader is the most common software for opening PDFs; though several participants use Preview in MacOS, with one participant (P4) explicitly stating a preference for Preview over Acrobat. One participant uses a proprietary tool called InftyReader, 21 which converts PDFs into ASCII text and math formulas into MathML, which is accessible.

Table 5. Challenges to PDF reading identified by participants during interviews. *Only identified as an issue during pre-interview questionnaire.
Issue description Affects Raised by user
Scanned PDFs cannot be read without remediation Document P3, P4, P5*
No headings/sub-headings for navigation Navigation P1, P3, P5
Figures are not annotated as figures Navigation P1, P5
Losing cursor focus when switching away from the PDF Navigation P1
Headings are not hierarchical (no sub-headings) Navigation P5
Text is read as single string (no spaces or punctuation) Text P1, P4, P5
Headers/footers/footnotes mixed into text Text P1, P4, P5
Words with ligatures are mispronounced Text P1, P3
Words split at line breaks are mispronounced Text P2, P3
Reading order is incorrect Text P3, P5
Text before and after figures sometimes skipped Text P4
Text on some pages not recognized at all Text P4
Math content is inaccessible Element P1, P2, P3, P4, P5, P6
Tables are inaccessible Element P1, P2*, P3, P5, P6
Figures lack alt-text Element P1, P3, P5, P6
Figure captions are not associated with figures Element P1, P5
Characters or words in figures are read and do not make sense Element P4, P5
Figure alt-text (when provided) is not descriptive Element P5
Code blocks are inaccessible Element P2, P4
Table 6. Coping mechanisms discussed by users for dealing with challenging papers.
Coping mechanism Raised by user What users said
Give up, abandon the paper P1, P3, P5 P3: when asked how often they abandon papers, an- swers "60-70% of the time" P5: sometimes the only option is to "sit down and start crying" (jokingly, though the sentiment is true)
Try other conversion tools P1,P3,P6
Download LaTeX source or Word document if available P3, P4, P6
Ask sighted colleagues or family members to read P3, P5, P6
Ask for remediation / convert to braille P4, P5, P6 P4: 10 day turnaround is on the quick side, which is not good enough for research P5: process takes a long time, around 1-2 weeks
Try other PDF readers or browsers P1, P6 P1: may try Microsoft Edge browser even though it usually does not help, but he feels "hopeful"
Message authors to get source document P3, P4 P4: sometimes the author manuscript is accessible but the camera-ready version is not; fault of the confer- ences and publishers, not the authors

Challenges of current PDF reading pipeline. Coping mechanisms. The coping mechanisms employed by BLV researchers to read inaccessible PDFs are wide-ranging, often involving trying tools outside of their primary workflow, soliciting help from others, or in the worst case, giving up and moving on. We describe these in Table 6 . Several users reported trying certain tools like alternate PDF readers, browsers, or optical character recognition (OCR), even though the tools usually do not result in a significant improvement over their standard pipeline; when asked why, several participants reported feeling "hopeful" that a tool might work (P1) or hoping to "get lucky" (P3).

Several of these coping mechanisms involve other people. For example, three participants reported needing to ask sighted colleagues or family members to copy text, or to explain select paper content, especially figures and equations.

Asking for PDF remediation was also a possibility for several participants; in this process, workers at the researcher's host institution convert a PDF into an accessible format, manually assigning reading order, correcting equations, and writing descriptions for figures. The output of the remediation process is seen as "ideal" (P4), but the process takes significant time (several weeks for any PDF) and may not fit into a researcher's schedule and timeline. Additionally, this process may only be available to researchers affiliated with a significantly large and resourced institution, and as P6

discusses, may no longer be a viable option for those who work outside of academia. In some cases, BLV researchers

Coping mechanism Raised by user What users said

Give up, abandon the paper P1, P3, P5 P3: when asked how often they abandon papers, answers "60-70% of the time" P5: sometimes the only option is to "sit down and start crying" (jokingly, though the sentiment is true) may also message authors directly to gain access to the source documents (P3 and P4). Both LaTeX source and Word documents are more accessible than PDFs, and access to these source documents can greatly improve the ability to read these papers.

Perhaps most disheartening is how often BLV researchers may simply give up in the face of an inaccessible paper.

P1 says that by the time he has spent several hours making a paper readable, he may have already lost interest and motivation to read it. When asked how often papers are abandoned, P3 responds 60-70% of the time. Though P4 does not discuss abandonment directly, P4 shares the following relevant sentiment: "reading papers is the hardest part of research" for a BLV researcher, and if papers were more accessible, there would be more blind researchers.

4.5 Study Findings: Papertohtml Prototype

Table 7. Positive and negative features identified in the prototype. *The feature was implemented or the issue addressed in v0.2 following v0.1 feedback. **The issue was addressed in v0.3 following v0.2 feedback.
Feature Raised by user What users said
POSITIVE
Bidirectional links between inline citations and references P1, P2, P3, P4, P5, P6 P3: "very few research teams actually get this and get this right, so well done"; 'crucial piece of the puzzle"
Headings for easy navigation P1, P2, P3, P4, P6 P4: "Headings are the best thing ever"; makes it very clear what section you are in
Table of contents* P2, P3, P5, P6
Figures are tagged as figures, and captions are associated P4, P5, P6
Can use browser and os features like find/copy/paste P1, P4
Simple typography for reading P2
Can interact with headings word-by-word or letter-by-letter P4
Not extracted items are noted as missing P4 P4: "at least I know there was an equation here"
NEGATIVE
Some headings extracted incorrectly P1, P3, P5
Some headings missed in extraction P3, P5 P5: "it's really important that i trust it"; "there [should be] *no* false negatives"
Code block not extracted P2, P4
Tables are extracted as figures P2, P6
Equations not extracted P4, P6 P6: Not sure if this system extracts equations because some-
times there is some math in the body text
Figures placed away from text* P1
No alt-text extracted P1
URLs missing from bibliography entries** P2
Some information not surfaced (keywords, footnotes) P3
Some headers/footers/footnotes mixed in text P4
Headings are not hierarchical P5

Positive and negative features. All user interviews were analyzed to extract positive and negative responses to various features or flaws of the prototype. We summarize these features and flaws in Table 7 . Among the participants' favorite features are links between inline citations and references (all 6 participants), section headings for navigation (5 participants), the table of contents (4 participants), and figures tagged as figures with associated figure captions (3 participants). Regarding links between inline citations and references, several participants were especially supportive of the return links that allow the reader to return back to their reading context after following a citation link. P3 said that the links acted as external memory, allowing BLV users to essentially "glance" at the bibliography and back, like a sighted user might. Similar sentiments were shared by P5 and P6, although P5 also proposed the possibility of preserving the context even further by providing bibliography information inline rather than navigating back and forth between the main text and references section.

Among the negative features observed by participants, most have to do with imperfect extraction, for example, incorrectly extracted headings (3 participants), missed headings (2 participants), and various extraction issues with code blocks, tables, equations, and more. Many of these issues are described further and quantified in Section 6. Of these

Feature

Raised by user What users said

Positive

Table 8. Feature recommendations made by users during study sessions.
Feature Raised by user What users said
Easier access to papers P1, P3, P4, P6 P3: detect if a page is a paper, and automatically generate HTML P6: "if you could upload a PDF and create structure, that'd be great"
Reduce verbosity of back links P1, P4, P5
Make links optional P1, P5 P1: links are too verbose, make it optional
Infer hierarchical structure of Section headings P3, P5
Annotate lists as
    or
      HTML lists
P1, P5
Use MathML or MathJax representation for equations P1, P6
Provide alt-text for figures P3 P3: automatic alt-text would be good, though the "best we can do now is maybe 'this is a graph'"
Export for offline reading P3

Bidirectional links between inline citations and references P1, P2, P3, P4, P5, P6 P3: "very few research teams actually get this and get this right, so well done"; "crucial piece of the puzzle" Headings for easy navigation P1, P2, P3, P4, P6 P4: "Headings are the best thing ever"; makes it very clear what section you are in : automatic alt-text would be good, though the "best we can do now is maybe 'this is a graph'" Export for offline reading P3 Table 8 . Feature recommendations made by users during study sessions.

issues, problems with heading extraction were most notable, likely because the heading structure is the first element of the document with which the participants interact, and it provides a mental model of the overall document structure.

Mistakes in heading extraction are obvious and erode trust in our overall system. As P5 says, "it's really important that I trust it, " and errors of this nature, both false positive and false negative extractions, can reduce trust. Similarly, though we describe in our introductory material that our system currently does not extract equations, P6 points out that it is unclear whether the system extracts equations because occasionally math can be found in the body text. This type of conflict between what is described and what is reality can also reduce trust. However, one may be able to build trust even in the face of extraction errors by indicating to the user when content is not extracted; as P4 says regarding the placeholders for not extracted items, "at least I know there was an equation here. "

Feature requests. Feature recommendations made by participants are summarized in Table 8 . By far the most common feature request related to making more papers available in our system. Suggestions included integrating with a scholarly search engine (P1), creating a tool to automatically detect and convert papers to HTML in the browser (P3), linking the system with university libraries for access to more paper PDFs (P4), and allowing the user to upload any PDF and performing the conversion (P6). For the public version of the system described in Section 5, we opt for the last of these recommendations.

Several requests related to the verbosity of links in our system. For screen reader navigation, all links and spaces between links require extra button clicks for navigation, reducing reading speed. We received several recommendations on ways to reduce the number of keystrokes necessary to navigate through both forward and back links between inline citations and references. Additionally, both P1 and P5 suggested making inline citation links optional, creating both a skimming mode (without links) and deep reading mode (with links for navigating the citation graph). In the version of the system presented in this paper, we attempt to reduce the verbosity of links, and we intend to explore the notion of optional link configurations in future work.

We describe how some feature requests are integrated during iterative design and development of our system (Section 5.3). For issues and suggestions related to specific paper content such as figure alt-text and equations, significant work remains to accurately and efficiently extract this content from papers and/or generate the content if missing (e.g. alt-text). We assess the accuracy of extraction of specific paper components in our current system (Section 6) and discuss possible approaches in Future Work (Section 7.1).

Future usage. At the end of each session, we ask users whether they would be likely to use the prototype in the future.

We ask specifically: On a scale of 1 to 5, how likely are you to use the HTML render, if it is available to you in the future? (Answers: 1 = Very unlikely, 2 = Unlikely, 3 = Neutral, 4 = Likely, 5 = Very likely) If the answer is unlikely or neutral, we ask what changes would need to be made to the tool such that they would use it.

All users reported that they would use the prototype in the future. Five users responded 5, that they would be very likely to use it; one user (P5) responded 3 to the prototype as it currently is, and 5 if some of the issues for heading extraction were addressed. P1, who interacted with an early prototype with fewer implemented features, said that this would become a tool in the toolbox, but he would not be able to rely solely on it due to incomplete extractions. P5 expressed a similar sentiment, that in its current state, he may try the prototype system when his current workflow fails, but if issues around heading extraction were addressed, he would be very likely to use it. P3 replies when asked how the system might be integrated into their workflow, "I think it would become the workflow. " P4 says "for unaccessible PDFs, this is life-changing. "

4.6 Design Recommendations

Fig. 6. Design recommendations for screen reader friendly paper reading systems. A system should aim to provide the document structure in a way that matches the mental model of the user, and to tag all elements appropriately. These aspects are achievable through proper tagging of a paper, including in PDF format. Additionally, a system should aim to act as external memory for the user, minimizing the amount of cognitive load needed to return to their reading context. To improve trust, a system should indicate when there is known missing data in the extraction or a possibility of missing or incorrect data. Finally, a system should reduce verbosity, ensuring that as few keystrokes as possible are necessary for the user to perform their desired task.

We distill our learnings into a set of five design recommendations for BLV user-friendly paper reading systems. Figure 6 summarizes the following recommendations:

1. Document structure should match the mental model of the user. Structure is necessary for providing an overview of a document and is essential to navigation. For example, headings in a paper should be tagged as such and the hierarchy of the headings should match the mental model of the user, i.e., top level headings should be tagged

or

, and lower level headings

through

accordingly.

2.

Objects in the paper should be tagged appropriately. Regarding user trust: this should a priority in any AI-based system. Because PDF extraction and document rendering are imperfect processes, some degree of error is expected. Though all participants in our user study expressed that some error is tolerable, one can mitigate the conversion of errors to distrust by clearly indicating known errors and missing content in the system. For example, in some cases our system is unable to extract a figure caption; if the caption for Figure 3 is not extracted, rather than skipping from Figure 2 to Figure 4 and causing confusion for the reader, it is better to indicate that Figure 3 is missing in the extraction.

A system that responds quickly to user requests is obviously more desirable. However, several participants indicated that some wait time is acceptable, especially if a longer wait time corresponds to a higher quality reading experience.

Though we report this finding, it may not hold for all or even a majority of users in practice. While our system is significantly faster than the typical PDF remediation process (which takes 1-2 weeks), the balance between speed and quality and their effects on usability require further exploration.

Though we derive these design recommendations in the scope of paper reading, they may be generalizable to other classes of documents. In fact, several of these design principles echo available guidelines for human-AI interaction [7] , especially in indicating the capabilities and limitations of the system (recommendation 4). Other recommendations focus on emulating the types of advantages that sighted users derive from layout and visual information, but to implement them in such a way that BLV users can benefit, e.g., using the system as a source of external memory.

5. The Papertohtml System

To address the accessibility challenges described in Sections 3 and 4, we prototype and develop the PaperToHTML system for extracting semantic content from paper PDFs and re-rendering this content as accessible HTML. HTML is widely accepted as a more accessible document format than PDFs. In the 2019 Access SIGCHI Report, the authors discuss the reasoning behind switching CHI publications to a new HTML5 proceedings format to improve accessibility [44] . By rendering the content of paper PDFs as HTML, and introducing proper reading order and accessibility features such as section headings, links, and figure tags, we can offset many of the issues of reading from an inaccessible PDF.

We describe the conversion pipeline and UI features of our system. Figure 1 provides a schematic for our approach. PaperToHTML leverages two open source PDF processing tools, Grobid [41] via the S2ORC [40] library and DeepFigures [64] , the Semantic Scholar API, 22 and a custom Flask application for rendering the extracted content of the PDF as HTML. The S2ORC project [40] integrates the Grobid machine learning library [41] and a custom XML to JSON parser 23 to produce a structured representation of paper text. We use a version of the S2ORC pipeline using Grobid v0.6.0 [41] . In the current iteration of PaperToHTML, we do not display author affiliations, footnotes, and most mathematical equations due to the difficulty of extracting these pieces of information accurately from the PDF. Though some of the elements are extracted in S2ORC, the overall quality of the extractions for these elements is lower, and is currently insufficient for surfacing in the prototype (see Section 6 for details). Future work includes investigating the possibility of extracting and exposing these elements, either by improving current models or training new models targeted towards the extraction of specific paper elements. following paragraph 2. This ensures that the layout for the HTML render closely approximates the intended reading order. We justify this decision based on user feedback discussed in Section 4. Back links between bibliography and inline citations. Following each bibliography entry, we provide links back to the first mention of that entry in each section of the paper in which it was mentioned. For example, if bibliography entry [1] is cited in the "II. Related Works" section and the "III. Methods" section, we provide two links following the entry in the bibliography to the corresponding citation locations in sections II and III, as in:

5.2 Papertohtml Ui Features

[1] Last name et al. Paper title. Venue. DOI.

Link To Return To Section Ii, Link To Return To Section Iii

This allows users to navigate back to their reading location in the document after clicking through to a bibliography entry. A user may otherwise hesitate to resolve a link, because it may result in losing their place and train of thought.

5.3 Integrating User Feedback Into Papertohtml

We leverage the feedback we received during our user studies (see Section 4) to make improvements to PaperToHTML.

We denote the versions of the prototype as v0.1 (version visited by P1), v0. Following the user study, for version v1.0, we implemented a system to allow users to upload PDFs and generate HTML renders on demand. This was in response to the most common feature request of desiring easier access to papers (essentially making the system practically usable). We opt for P6's suggestion of allowing the user to upload

PDFs (therefore allowing a user to process any paper PDF to which they have access), and automatically processing the PDF and creating an HTML document on demand (allowing access to the conversion from any browser). PDF processing in v1.0 takes around 30 seconds-2 minutes. The system does not retain uploaded PDFs, but it caches the HTML render for faster subsequent access (following initial processing, reloading a document takes less than seconds).

We also incorporate metadata from the Semantic Scholar API into v1.0 to improve the quality of the displayed metadata.

By sourcing paper titles, authors, venue, year, and abstract information from Semantic Scholar, we are able to take advantage of high-quality and occasionally curated metadata from publishers and paper aggregators.

We evaluate the quality of HTML renders generated by PaperToHTML in Section 6. Based on these results and positive user response (Sections 4 and 7), we believe our approach can dramatically increase the screen reader navigability and accessibility of scientific papers by providing alternate and more accessible HTML versions of these papers on-demand.

6. System Evaluation

Extracting semantic content from PDF is an imperfect process. Though re-rendering a PDF as HTML can increase a document's accessibility, the process relies on machine learning models that can make mistakes when extracting information. As we glean from user studies, BLV users may have some tolerance for error, but there is an inherent trade-off between errors and perceived trust in the system. We conduct a study to estimate (1) the faithfulness of the HTML renders to the source PDFs, and (2) the overall readability of the resulting HTML renders produced by PaperToHTML. We define faithfulness as how accurately the HTML render represents different facets of the PDF document, such as displaying the correct title, section headers, and figure captions. These facets are measured as the number of errors that are made in rendering, e.g., mistakenly parsing one figure caption into the body text is counted as one error towards that facet. Readability, on the other hand, is an ordinal variable meant to capture the overall qualitative usability of the HTML document. Each document is given one of three grades, those with no major problems, some problems, and many problems impacting readability.

To evaluate readability and faithfulness, we first perform open coding (Section 6.1) on a small sample of paper PDFs and their corresponding PaperToHTML HTML renders. The purpose of this exercise is to identify facets of extraction that impact the ability to read a paper. A rubric is then designed based on these identified facets. The process taken to design the evaluation rubric, the rubric's content, and annotation instructions are detailed in Section 6.2. We then annotate a sample of 385 papers across different fields of study using this rubric. For each type of error identified during open coding, we compute the overall error rates observed in our sample. We also present the overall assessed readability, reported in aggregate over our sample and by fields of study (Section 6.3).

6.1 Open Coding Of Document Facets

One author performed open coding on a sample of papers, comparing the PDF and HTML renders to identify inconsistencies and facets that impact the faithfulness of document representation. Papers are sampled from the Semantic Scholar API 24 using various search terms, and selecting the top 3 results for each search term for which a PDF is available. Search terms are selected to achieve coverage of different domains, and top papers are sampled to select for high-relevance publications. The author stopped sampling papers after no new facets could be identified, resulting in 8 search terms and 24 papers. The search terms used were: human computer interaction, epilepsy, quasars, language model, influenza epidemiology, anabolic steroids, social networks, and arctic snow cover.

Table 9. Paper facets identified for evaluation along with classes of common errors.
Facet Description Common errors
TITLE The title and subtitle of the paper Missing words Extra words
AUTHORS A list of authors who wrote the paper; this includes affiliation, though we do not explicitly evaluate affiliation in this study Missing authors Extra authors Misspellings
ABSTRACT The abstract of the paper Some text not extracted Other text incorrectly extracted as abstract
SECTION HEADINGS The text of section headings Some headings not extracted (part of body text) Other text incorrectly extracted as headings
BODY TEXT The main text of the paper, organized by paragraph under each section heading Some paragraphs not extracted (missing) Some text not extracted Other text incorrectly extracted as body text
FIGURES Images, captions, and alt-text of each figure Figure not extracted Caption text not extracted (part of body text) Other text incorrectly extracted as caption text
TABLES Caption/title and content of each table Table not extracted (not part of body text) Table not extracted (part of body text) Caption text not extracted (part of body text) Other text incorrectly extracted as caption text
EQUATIONS Mathematical formulas, represented in TeX or Math ML; note: our current pipeline does not extract math Some equations not extracted Some equations incorrectly extracted
BIBLIOGRAPHY Bibliography entries in the reference section Some bibliography entries not extracted Some bibliography entries incorrectly extracted Other text incorrectly extracted as bibliography
INLINE CITATIONS Inline citations from the body text to papers in the bibliography section Some inline citations not detected Some inline citations incorrectly linked
HEADERS, FOOTERS 0 FOOTNOTES Page headers and footers, footnotes, endnotes, and other text that is not a part of the main body of the document Some headers and footers incorrectly extracted into body text

For each paper, the author evaluated the PDF and HTML render side-by-side, scanning through the document to identify facets which differ between the two document representations. Specifically, the author looked for any text in the PDF that is not shown in the HTML, any text from the PDF that is not where it belongs in the HTML (e.g. figure captions , headers, or footnotes that should be separate from the main text but are mixed in, interrupting reading flow), and other parsing mistakes (e.g. errors with math, missing lists and tables etc). The observed extraction errors are grouped by facet in Table 9 .

6.2 Evaluation Rubric

We develop evaluation rubrics and forms for grading the quality and faithfulness of the HTML render. The evaluation form attempts to capture errors in PDF extraction that affect each of the primary facets identified in Table 9 . We also ask annotators to provide an overall assessment of the HTML's readability. Instructions for completing the annotation form are given in Appendix B.1. The final version of the form is replicated in Appendix B.1, and the rubric for evaluating overall readability is given in Appendix B.3.

Three authors iterated twice on the content of the evaluation form, until consensus was reached that all paper facets were adequately assessed using a minimum set of questions. Two authors then participated in pilot annotations, where each person independently annotated the same set of five papers sampled from the set labeled by the third author during open coding. Answers to all numeric questions were within ±1 for these five papers when comparing the two authors' annotations. All three authors discussed discrepancies in overall readability score, iterating on the rubric defined in Appendix B.3 and coming to a consensus. The finalized form and rubric are used for evaluation.

Fig. 7. A successfully extracted figure from Nascimento and Bioucas-Dias [47] is shown with its corresponding figure caption (top left). When figures are not extracted and inferred to exist (handle mentioned in text or number between two extracted figures), a placeholder image is shown along with a message referencing the failed extraction (top right). Similarly, when an equation is detected to be present in the PDF and not extracted, we insert text signaling the failed extraction and refer the user to the source document (bottom).

Of the facets and errors described in Table 9, our current pipeline does not extract table content and equations. Tables are extracted as images by DeepFigures [64] , which do not contain table semantic information. Regarding equations, we distinguish between inline equations (math written in the body text) and display equations (independent line items that can usually be referenced by number); for this work, we evaluated a small sample of papers for successful extraction of display equations. Though some display equations are recognized, the quality of equation extraction is low, usually resulting in missing tokens or improper math formatting. Therefore, we decided to replace display equations in the prototype with the equation placeholder shown in Figure 7 . Since problems with mathematical formulae are among those most mentioned by users in our study, equation extraction is among our most urgent future goals, and we discuss some options going forward in Section 7.1. Table 9 . Paper facets identified for evaluation along with classes of common errors.

6.3 Evaluation Results

Table 10. Inter-rater agreement for evaluation. For categorical questions, such as title, author, abstract, bibliography, inline citation, and overall score, we report the number of classes available for annotation, along with annotator agreement and Cohen’s Kappa. For numerical questions, such as the number of each type of extraction error, we report agreement, the intraclass correlation coefficient (ICC), and the average difference and standard deviation of the values between the two annotators.
Evaluation criteria Number of classes Agreement Cohen's Kappa ICC Mean Difference ( SD)
Title 3 0.87 0.33 - -
Authors 3 1.00 1.00 - -
Abstract 3 0.95 0.64 - -
Number of figures - 1.00 - 1.00 0.00 0.00
Figure extraction errors - 0.89 - 1.00 0.11 0.31
Figure caption errors - 0.89 - 1.00 0.11 0.31
Number of tables - 0.92 - 0.98 0.12 0.43
Table extraction errors - 0.89 - 0.98 0.17 0.50
Table caption errors - 0.78 - 0.94 0.33 0.67
Header/footer/footnote errors - 0.40 - 0.60 1.88 2.12
Section heading errors - 0.71 - 0.79 0.71 1.70
Body paragraph errors - 0.46 - 0.66 1.50 2.22
Bibliography extraction 4 0.94 0.82 - -
Inline citation linking 4 0.80 0.11 - -
Overall score 3 0.55 0.07 - -

We start with the dataset of 11,397 papers analyzed in Section 3, and subsample 535 documents stratified by field of study. Two authors, both with undergraduate science training, code papers from this sample, with an aim of annotating around 20 papers per field of study. Though we achieve the target number for most fields, we missed this target for some fields closer to the humanities because more of these documents are difficult to manually annotate within our time and resource constraints. For example, documents are deemed unsuitable for annotation if they are not papers (i.e., they are books, posters, abstracts, etc), if they are too long, or if they are not in English. In these cases, these document can be skipped. Detailed guidance on suitability is provided in the annotation instructions (Appendix B.1). Table 10 . Inter-rater agreement for evaluation. For categorical questions, such as title, author, abstract, bibliography, inline citation, and overall score, we report the number of classes available for annotation, along with annotator agreement and Cohen's Kappa. For numerical questions, such as the number of each type of extraction error, we report agreement, the intraclass correlation coefficient (ICC), and the average difference and standard deviation of the values between the two annotators.

as those on the extraction of title, authors, abstract, and bibliography. For numerical questions such as counting the occurrence of extraction errors related to figures, tables, section headings, and body paragraphs etc, we report the intraclass correlation coefficient (ICC) as well as the average difference of values between the two annotators. See Table 10 for these results. 25 Agreement was high for most element-level annotator questions. Annotators had the highest levels of disagreement on the evaluation of header/footer/footnote errors, section heading errors, and body paragraph errors, likely due to these being text-based and the most numerous; though the average differences reported between annotators on these questions are only between 1-2. Likewise, agreement on overall readability score is modest, at 0.55; we note, however, that neither annotator labeled any paper as having no major readability problems when the other annotator labeled it as having lots of readability problems.

Fig. 8. Evaluation results for various document components. Corresponding numbers are provided in Table 13 in Appendix C.
Table 11. Evaluation questions. Optional questions are in italics.
Answer Questions
y/p/n text Is the TITLE correctly extracted? Comment (clarify if answer is "partially" or "no")
y/p/n text Are the AUTHOR(S) correctly extracted? Comment (clarify if answer is "partially" or "no")
y/p/n text Is the ABSTRACT correctly extracted? Comment (clarify if answer is "partially" or "no")
y/n Does this paper contain a substantial number of math EQUATIONS (more than 5 display equations)?
number number number number text How many FIGURES are in the PDF? (Enter '0' if none) How many FIGURES are correctly extracted? (Enter '0' if no figures in paper) How many FIGURE CAPTIONS are correctly extracted? (Enter '0' if no figures in paper) Approximately how many FIGURE captions are **INCORRECTLY** parsed into the body text (should be a figure caption but is mixed in with the body text)? (Enter '0' if they are all correct or if no figures) Comment (Optional - Note anything here about FIGURES or FIGURE CAPTIONS, e.g. which figures are not extracted, which figure captions are not extracted, which figure captions are incorrectly extracted into the body text etc.)
number number number How many TABLES are in the PDF? (Enter '0' if none) How many TABLES are correctly extracted? (Enter '0' if no tables in paper) How many TABLE TITLES / CAPTIONS are correctly extracted? (Enter '0' if no tables in paper or if those tables do not have titles / captions)
number Approximately how many TABLE titles are **INCORRECTLY** parsed into the body text (should be a table title / caption but is mixed in with the body text)? (Enter '0' if they are all correct or if no tables or table titles / captions)
number Approximately how many TABLES have content that is **INCORRECTLY** parsed into the body text (content of table is mixed with the body text)? (Enter '0' if they are all correct or if no tables)
text Comment (Optional - Note anything here about TABLES or TABLE TITLES / CAPTIONS, e.g. which tables are not extracted, which table captions are not extracted, which table title / captions / content are incorrectly extracted into the body text etc.)
number text Approximately how many times are page HEADERS or FOOTERS *INCORRECTLY mixed into the body text?This also includes margin content such as arXiv watermarks. (Enter '0' if all okay or no headers or footers) Comment (Optional - Note anything interesting here about incorrectly parsed headers or footers; no need to provide page numbers)
number text Approximately how many SECTION HEADINGS are *INCORRECTLY* extracted? (Enter '0' if they are all correct or no section headings) Comment (Optional - Note anything interesting about the section heading extractions, no need to list exhaustively)
number text Approximately how many BODY TEXT PARAGRAPHS are **MISSING** from the extraction? (Enter '0' if they are all there or there is no body text) Comment (Optional - Note anything interesting about the body text extractions)
choice Are BIBLIOGRAPHY entries extracted correctly? (options: all correct, mostly correct, half correct, mostly incorrect, incorrect, no bibliography)
text Comment (Optional - Note anything interesting about the bibliography extractions; no need to list exhaustively)
choice Are INLINE CITATIONS linked to bibliography entries? (Please answer this questions considering only the bibliog- raphy entries that were extracted) (options: all linked, majority linked, half linked, most unlinked, none linked, no bibliography)
text Comment (Optional - Note anything interesting about the inline citation linking; no need to list exhaustively)
text Are there any other problems with the HTML parse that are not covered by one of the above questions? Please describe. (Optional)
choice Please rate the overall full text quality in the HTML render (options: no major problems, some problems, lots of problems - see rubric in Section B.3)
Table 12. Rubric for HTML parse quality assessment (final question in evaluation questionnaire).
Table 13. Assessment count for all evaluation paper elements. Corresponds to distributions shown in Figure 8.

All results and statistics are reported on the set of 385 annotated papers. Similarly, the majority of tables and table captions are correctly identified as tables and table captions, and are not incorrectly mixed into the body text. We note that the lack of an error here does not indicate that the table is extracted correctly in an accessible manner, just that it is not incorrectly parsed as body text. Fig. 8 . Evaluation results for various document components. Corresponding numbers are provided in Table 13 in Appendix C.

Unsurprisingly, errors in text element parsing are the most prevalent, especially for headers/footers/footnotes and section headings. The most common type of header/footer/footnote error observed are when these texts are mixed into the body text around page breaks, interrupting reading flow. These types of errors are also observed frequently during screen reader use when reading directly from an untagged PDF. For section headings, in particular, the majority of papers have errors; around 67% of papers have between 1-5 errors (either missed headings or extraneous headings), and 9% have more than 5 errors. Due to the large number of section headings in papers, parsing errors are more frequent, and unfortunately, these errors impact the ability to properly navigate the HTML parse. Errors in body text extraction also negatively impact readability, in this case, select text in the document is being missed completely in the HTML render.

Though the majority of parses have no body text errors, around 33% of papers have between 1-5 missing paragraphs. We are encouraged that a majority of HTML renders have no major problems, though our results necessitate further understanding of the papers with which our extraction pipeline has difficulty. If papers with lots of problems can be identified a priori, we can prevent surfacing these low quality parses to the user. We perform preliminary experiments to identify errors that are correlated with readability problems; though no error stood out as being predictive, we present these results in Appendix E.

7. Discussion

In this work, we present the results of studies that aim to quantitatively and qualitatively characterize the current state of academic paper PDF accessibility for readers who are blind and low vision. We also introduce the PaperToHTML system as a potential technological solution that can be used to mitigate some of the challenges faced by BLV readers.

Based on our analysis, the current state of paper accessibility is grim, with an average of 2.4% of papers satisfying our defined accessibility criteria. Though we observe some improvement over time, we are not optimistic that these improvements are due to author prioritization of accessibility, since the presence of figure alt-text (the only of the five criteria that requires author intervention) remains low. Rather, the commitment to accessibility made by certain typesetting software such as Microsoft Word may be responsible for a portion of these improvements. Given the strong correlation between PDF creation software and accessibility compliance, we encourage conferences, publishers, and authors to consider the tools they are using to generate PDFs, and to integrate accessibility requirements during the publication process. Though we also issue this warning: passing accessibility checkers is not the same thing as creating an accessible document; many of the PDFs that "Passed" our Alt-text checker lacked any semblance of meaningful alt-text. Relatedly, we may also need to reconsider our processes around measurement, and ensure that what we measure is correlated to meaningful improvements in document accessibility.

Given the scope and magnitude of the problem, and how PDF is still the dominant file type used for distributing scientific papers, there are clear needs for immediate technological solutions. We introduce the PaperToHTML system, which integrates several text and vision machine learning models to extract content from paper PDFs and render this content as HTML. The system adds tags and infers reading order, thereby improving the navigational capabilities of BLV users. Of course, no extractive pipeline is perfect, and we quantify and qualify extraction quality through an evaluation study and user study. Our evaluation results, coupled with user assertions of future engagement, suggest that though imperfect, the system may already be useful for BLV readers in its current form. We are also optimistic that the quality of PDF extraction and understanding will improve in the coming years, further addressing many of the problems identified by participants in our study.

For any machine learning-based document understanding system, errors are inevitable. Managing user expectations for these systems is crucial. This echoes previously published guidelines for human-AI interaction, which suggest communicating to the user the capabilities and limitations of the AI system [7] . Setting expectations correctly and referring the user back to the original source document when the extractive procedure fails can help mitigate inappropriate reliance on the system.

Our findings also confirm that paper reading interfaces for BLV users will require different affordances than reading interfaces for sighted individuals. For example, some users desired the option to turn off basic features like within-paper links. Though features like citation-reference links are easy to ignore when a sighted reader reads a paper, they can cause unnecessary interruption and friction in a BLV user's workflow. Haptic nudges like those introduced by Khan et al. [34] in their auditory skimming interfaces could be a useful alternative to HTML links, though would need to be adapted for web-based interfaces.

We hope our design recommendations will facilitate further conversations around the needs of BLV users, and that they may result in systems that ease the reading burden for these users. As one participant puts it, "reading papers is the hardest part of research" for researchers who are blind or low vision, and if papers were more accessible, "there would be more blind researchers. " It is a duty of the entire community to facilitate this, and to design, prototype, and build systems to support the needs of the BLV research community.

7.1 Limitations & Future Work

This work focuses on rendering PDF papers in HTML to improve document navigation and provide a more intuitive reading order. There are many other aspects of accessibility with which we do not contend, such as providing figure

alt-text, accessible math, or tagging tables. Future work involves investigating various ways to improve or provide these features automatically, or by harnessing the power of the community to provide some of these features for papers as they are requested. For example, we may integrate element-specific reading features for mathematical equations [10, 26, 43, 66] or graphs and charts [21] [22] [23] , or create a crowd-sourcing pipeline to solicit alt-text annotations for figures that lack descriptions. PDF parsing remains an open research problem with many challenges. Our reliance on these technologies necessarily introduce error into our pipeline and system. We attempt to describe and quantify these errors in Appendix E, but found no strong correlation between any particular type of error and the overall quality assessment. Unfortunately, this means that there is no obvious mitigation strategy for identifying low-quality extractions before they are shown to users. Further work remains to automatically or semi-automatically identify low-quality parses prior to surfacing them.

For example, we could investigate other paper features as predictors of parse quality. With more labeled data, we may also be able to train a neural classifier to identify low-quality parses. Additionally, we emphasize that many different PDF parsers and PDF understanding systems can be used to extract document content and create an HTML render (more in Section 2.4). In future work, we intend to investigate whether new state-of-the-art systems like Shen et al. [62] could be adapted to perform scientific PDF content extraction with reasonable effectiveness and efficiency.

In this work, we focus on processing PDFs and making them accessible. Some papers are available in XML, HTML, or other structured markup languages; and LaTeX or Word document source can be found for others. We could automatically take advantage of these alternatives to PDFs when they are publicly available, for example, by rendering the semantic content of the paper as extracted from these other document representations as in arXiv Vanity 26 for arXiv LaTeX source or Pubmed Central's PubReader, 27 which renders JATS XML. Though PaperToHTML already has the ability to process LaTeX and JATS XML documents, further study is necessary to determine whether these HTML parses are suitable for reading.

To more fully assess the benefits and flaws of our system, a longer testing period and broad collection of user feedback is needed. We hope to achieve this in future work through the public deployment of the PaperToHTML prototype system.

Lastly, PDFs have been repeatedly called out as being inaccessible, not only for screen readers, but broadly for reading, especially on mobile and other devices with small screen sizes [51] . Dissociating publishing from PDFs continues to be a good goal for the future. In recent years, alternative publication formats have risen in popularity, such as eLife's dual publication in PDF and HTML, 28 the interactive HTML papers at distill.pub, 29 or the ACM Digital Library's very own dual publication (PDF and HTML) process, 30 which is now available for many of the ACM's computing conferences and journals. We have no doubt that viable alternatives to PDF have and will arise, and encourage the community to explore these options when making publication decisions.

8. Conclusion

Based on our findings, most academic papers are inaccessible, and significant challenges remain for BLV researchers when interacting with and reading these papers. Though some improvements in accessibility have occurred over time, these changes may not be reflective of author or publisher actions directly, and should be interpreted using our proposed methodology, which considers the typesetting software used and the information content of alt-text. In the meantime,

we offer a potential solution for the millions of PDFs that have already been published and which still remain the dominant form of distribution for academic papers. Our system extracts the content of PDFs, tagging headings and objects and inferring reading order, producing a more navigable and accessible HTML document. Participants described our system as likely to "become the workflow" or "life-changing," indicating both a strong favorable response and particular need for these types of solutions.

We do not claim that PaperToHTML solves all (or even most) accessibility problems for BLV researchers, but it is a step in that direction. A more complete solution would require significant dialogue between all stakeholders and a potential revolution in the way in which scholars publish and distribute their research findings. While those conversation are happening, we encourage researchers to prioritize and address the existing challenges with whatever tools they have in their toolbox right now. We especially encourage others to take into account our findings on the needs and challenges of BLV researchers when designing and engineering new systems and tools for reading the scholarly literature.

A Trends In Typesetting Software Usage

Fig. 9. There is a strong correlation (𝑟 = 0.89, 𝑝 < 0.001, 95% CI shown) between the proportion of PDFs typeset using Microsoft Word and the mean normalized Total Compliance of papers by field of study. Fields such as Business, Philosophy, Sociology, Materials science, and Psychology use Microsoft Word around or over 20% of the time, and have correspondingly higher mean accessibility compliance. On the other end of the spectrum are fields like Mathematics, Physics, and Medicine, where Microsoft Word is rarely used, and which have very low levels of mean compliance.

In Figure 9 , we observe that usage of Microsoft Word is highly correlated with accessibility compliance. Here, we plot the proportion of Microsoft Word usage per field of study and the corresponding mean normalized Total Compliance rates for those fields. Higher rates of Microsoft Word usage are statistically correlated with higher mean normalized Total Compliance ( = 0.89, < 0.001).

Fig. 10. The proportion of PDFs typeset by the five most common typesetting software over time. Software such as Adobe InDesign, LaTeX, and Microsoft Word are increasing in popularity over time.

In Figure 10 Compliance over time, since these typesetting software are the most associated with higher accessibility compliance.

B Evaluation Forms

This section contains forms and documents used to evaluate the quality of HTML renders produced by our system.

B.1 Evaluation Instructions

Instructions for annotators are reproduced verbatim below.

Goal: Identify and quantify the prevalence of different parse issues in S2ORC parses to assess their suitability for accessibility purposes. This will help us decide whether S2ORC parses can help meet screen reacher accessibility needs.

Number of papers: 500 papers sampled across different domains of science You will be presented with a spreadsheet of scientific papers, each with a pair of links. One link goes to a PDF of the paper. One link goes to an HTML representation of the same article. For each pair of links, we would like to know how faithfully the HTML representation captures the information on the PDF.

Instructions:

1. Open the two links side by side.

2. If the two links do not seem to correspond to the same paper, STOP. Make a note in the spreadsheet and SKIP.

3. If the PDF shows a paper that is not suitable, STOP. Make a note in the spreadsheet and SKIP. Nonsuitable may include:

-It is not a scientific paper.

-It is spam or a fake paper.

-It is slides, a poster, or other such non-paper document.

-It is just an abstract.

-It is a series of articles (e.g. conference proceedings, journal issue etc).

-It is a book.

-It is supplementary material;

note: some supplementary material is solely made up of figure or images.

-Something else that makes you pause. If you're not sure, SKIP it.

4. Copy the paper identifier corresponding to this paper into the first question on this form. Please make sure the identifier matches the paper you are evaluating.

5.

Answer each of the questions in this form as best as you can, treating the PDF as gold. There is no need to review every word or line of text. We are just trying to get an overall assessment of parse quality. For any question that asks for a number, enter '0' if there are no obvious problems with those extractions.

6. Submit the form and mark the row in the spreadsheet as complete.

Note: Display equations (those that are in their own paragraph) are currently not preserved in S2ORC, so we ask the annotator to ignore issues around missing display equations. Inline equations (those that are inline within a paragraph) are converted to token streams, which may not be faithful to the original PDF (e.g. fractions may not be preserved). The annotator can provide a description of issues around equation parsing when there is a notable issue.

B.2 Evaluation Questions

Questions asked in the evaluation form are reproduced in Table 11 .

B.3 Quality Rubric

The quality rubric for the final question in the evaluation form is given in Table 12 . This rating attempts to capture the overall readability and usability of the HTML render. Three authors discussed and converged upon this rubric following initial pilot annotations.

C Evaluation Results

Raw counts for each type of error detected during the evaluation of HTML renders are provided in Table 13 . The overall quality score split by field of study is shown in Table 14 .

Table 14. Distribution of overall quality scores for readability, split by field of study. Corresponds to distributions shown in Figure 11.
Overall Readability Number of papers Good Okay Bad
All papers 385 210 122 53
Art 13 6 1 6
Biology 23 12 7 4
Business 14 6 2 6
Chemistry 19 12 5 2
Computer science 21 10 7 4
Economics 20 6 8 6
Engineering 23 15 7 1
Environmental science 18 7 8 3
Geography 17 9 6 2
Geology 21 12 8 1
History 7 5 1 1
Materials science 24 15 8 1
Mathematics 25 13 8 4
Medicine 26 14 12 0
Other 8 6 2 0
Philosophy 12 7 5 0
Physics 39 25 10 4
Political science 13 6 6 1
Psychology 22 11 7 4
Sociology 20 13 4 3

D Html Readability By Field Of Study

Fig. 11. Overall readability results as proportion of total split by field of study, sorted by the percentage of papers with no major problems. The number of documents analyzed in each field is given, ranging from N=7 (History) to N=39 (Physics). The fields of study with the worst parse quality (Economics, Environmental science, Business, Art, and Political science) tend to be closer to the humanities, and may be due to the under-representation of papers from these fields in the data used to train the PDF parsers we use in our extraction pipeline. Corresponding numbers are provided in Table 14 in Appendix C.

In Figure 11 , we show the breakdown of overall readability by field of study, plotting the proportion of papers per field that are classified as having no major problems, some problems, and lots of problems impacting readability. Many fields have similar distributions compared to the overall evaluation set. However, we note that some fields such as Art, Business, Economics, and Environmental science to some degree, have significantly lower quality extraction results.

We posit that this may be due to biases in our PDF extraction pipeline. Some of the machine learning modules we use are primarily trained on paper data from the biomedical and Computer Science domains, where large scale labeled PDF extraction datasets can be found. Humanities-adjacent fields like Art and Business have very different publication norms, and the different layouts and content of papers and documents in these fields may provide additional challenges to our system, resulting in lower quality extraction and rendering.

E Association Between Extraction Errors And Html Readability

To investigate the possibility of identifying paper extractions with major problems, we fit a Logistic Regression classifier using element specific evaluation results as input features, and whether or not a paper has major problems as the target class for classification. Element specific questions are converted into 43 binary input variables; for example, the title element is mapped to three binary variables, whether the title is extracted correctly (title_yes), extracted partially (title_partially), or extracted incorrectly (title_no). We collapse the targets into two binary classes, 1 if the paper has major problems, and 0 if it has no major problems or some problems. The classifier is trained using 5-fold cross validation, with balanced class weights, and achieves a mean accuracy of 0.69, and area under the ROC of 0.65. This performance is not particularly notable or good; the labeled training sample is small, and due to the complexity of Fig. 11 . Overall readability results as proportion of total split by field of study, sorted by the percentage of papers with no major problems. The number of documents analyzed in each field is given, ranging from N=7 (History) to N=39 (Physics). The fields of study with the worst parse quality (Economics, Environmental science, Business, Art, and Political science) tend to be closer to the humanities, and may be due to the under-representation of papers from these fields in the data used to train the PDF parsers we use in our extraction pipeline. Corresponding numbers are provided in Table 14 in Appendix C. what makes a document problematic to read, we did not expect there to be a clear way to predict extractions with major problems based on a small number of element-level features. Something we aim to explore more in the future is whether the raw tokens on the PDF or publisher metadata can be leveraged to better predict when our extractive parse has failed. The most predictive feature is when abstracts are extracted incorrectly. Given the prevalence of abstracts in various literature databases, abstract quality could be easily assessed through external verification. In other words, if the abstract we extract is different from the abstract found for the same paper in other databases on in the publisher metadata, perhaps we can avoid surfacing this paper. However, the distribution of weights among various other element-level features suggests that this feature alone would be insufficient, and that the contributions of these various features are complex, denying us an easy way of identifying paper parses with major problems.

F User Study Materials

Documents used for the user study are provided in this Appendix.

F.1 Recruitment Email

The following email was sent and forwarded to several mailing lists to recruit participants.

The anon_corpus Research Team at the anonymized is conducting an experiment to evaluate the screen reader accessibility of scientific papers.

We are looking for participants who are age 18 or older, who identify as blind or low vision, and who have experience using screen readers to interact with scientific papers. If you are interested in participating, please complete the following form to determine eligibility: link Participation in this study is entirely voluntary. If you do decide to participate, your individual data will be kept strictly confidential and will be stored without personal identifiers.

The study involves an informational interview to better understand screen reader needs around scientific papers. Each participant will also be asked to interact with papers on a web interface developed by the team. The study will take approximately 75 minutes, and participants will receive a $150 Amazon gift card for their time.

Location: Online (Zoom)

Please contact the authors if you have any questions or concerns about this study. Thank you in advance for your time! Please help us spread the word by forwarding as appropriate.

F.2 Pre-Interview Questionnaire

Prior to each user study interview, the participant was asked to complete the following form:

Share 3 to 5 scientific papers that are difficult to read due to accessibility issues Thank you for volunteering to take part in this study! Please take a few minutes to supply us with some subject keywords you are interested in, and a list of 3 to 5 scientific papers you have found difficult to read due to accessibility issues. This would help us better plan the study based on your experience. -On a scale of 1 to 5, how easy or difficult was it for you to read this paper?

(1 = very easy, 5 = very difficult)

-Briefly describe why you chose the rating 4-7. Repeat 3.

F.3 Interview Questions

The following discussion guide is used to provide structure for user interviews.

Phase I -Warmup:

• Can you tell us a little bit about yourself? (Background, what kind of research do you do)

• Tell us about how you normally read papers -What is your workflow like? What tools do you use?

-Do you usually read PDFs directly or do you read papers in other ways? • If you need to read a paper and it is not accessible, what do you do now?

-How long does the process take?

-How often is it successful?

• Can you give a few examples of the main challenges you face when reading papers? (For example, are there certain features or attributes of papers that make them particularly difficult to read?)

• In your opinion, are there any resources that provide papers that are easier to read by screen readers? (For example, any journals, conferences, or search engines?)

• Overall, how do you feel about your current experience of reading papers?

Phase I -Current workflow:

• Based on the list of papers you provided, walk us through how you would read the paper [abc] . Use the screen reader of your choice, and any additional tools or extensions that are part of your usual process.

• Instructions: Please share your whole screen, think aloud and walk me through your thinking process

• What kind of information were you looking for, and how did you explore the page to find the information?

• On a scale of 1 to 5, how easy or difficult was it to read this paper with the tools, and why? (1 = very easy, 5 = very difficult)

• If you could change anything, how could this best meet your needs?

Phase II:

• We are currently working on an experimental prototype to make papers more easily read by screen readers.

Please take a minute to read the about page first: link • Based on the list of papers you provided, walk us through how you would read the paper [abc] using this HTML render. You can also use the screen reader of your choice, and any additional tools or extensions that are part of your usual process.

• Instructions:

-We are working with prototypes so not everything works -Please think aloud and walk me through your thinking process -Feel free to provide as many feedback as you can, good or bad • Please take a few more minutes to explore the other parts of this prototype. (e.g. References)

• On a scale of 1 to 5, how easy or difficult was it to read this paper with the tools, and why? (1 = very easy, 5 = very difficult)

Phase III:

• On a scale of 1 to 5, how likely are you to use the HTML render, if it is available to you in the future? (1 = very unlikely, 5 = very likely)

• Which features do you consider to be most helpful?

• Is there anything it would need to have, or change to convince you to use it?

• How do you envision yourself using this tool? How might it fit into your workflow? (For example, would it be an additional extension that is part of your usual process?)

• If you could search for papers and view them in this format, what do you think?

• If you could upload any PDF and create an HTML page like this, what do you think? (Would that be helpful for you, or something you might use, why or why not?)

• Are you aware of any other tools that display papers in any way besides PDF?

• Do you have any additional feedback about the HTML render or anything else that you would like to share?

• Thank you Answer Questions y/p/n Is the TITLE correctly extracted? text Comment (clarify if answer is "partially" or "no") y/p/n Are the AUTHOR(S) correctly extracted? text Comment (clarify if answer is "partially" or "no") y/p/n Is the ABSTRACT correctly extracted? text Comment (clarify if answer is "partially" or "no") y/n Does this paper contain a substantial number of math EQUATIONS (more than 5 display equations)? Table 13 . Assessment count for all evaluation paper elements. Corresponds to distributions shown in Figure 8 .

Overall Readability

Number Table 14 . Distribution of overall quality scores for readability, split by field of study. Corresponds to distributions shown in Figure 11 .

Class Precision Recall F1-Score Support

Table 15. Precision, recall, and F1-scores for classification. The classifier does not perform well at identifying papers with major problems from element-based features (F1 = 0.32).
Class Precision Recall F1-score Support
No major problems / Some problems 0.91 0.71 0.80 332
Major problems 0.23 0.55 0.32 53

No major problems / Some problems 0.91 0.71 0.80 332 Major problems 0.23 0.55 0.32 53 Table 15 . Precision, recall, and F1-scores for classification. The classifier does not perform well at identifying papers with major problems from element-based features (F1 = 0.32).

SECTION

https://www.w3.org/TR/wcag-3.0/ 2 https://www.w3.org/WAI/standards-guidelines/aria/ 3 https://www.acm.org/publications/authors/submissions

https://github.com/allenai/s2orc-doc2json 13 https://github.com/allenai/deepfigures-open/

See submission and accessibility guidelines for ASSETS (https://assets19.sigaccess.org/creating_accessible_pdfs.html), CHI (https://chi2021.acm.org/ for-authors/presenting/papers/guide-to-an-accessible-submission), W4A (http://www.w4a.info/2021/submissions/technical-papers/) and DSAI (http://dsai.ws/2020/submissions/).

https://www.microsoft.com/en-us/accessibility/microsoft-365

https://zoom.us/

http://www.inftyreader.org/

https://api.semanticscholar.org/23 Available at https://github.com/allenai/s2orc-doc2json

https://api.semanticscholar.org/

Several criteria have relatively high Agreement scores and low Cohen's Kappa. This is a known limitation of Kappa for skewed data[18]. We report multiple agreement metrics to provide a more complete picture of inter-rater reliability. For "Overall score, " we acknowledge that inter-rater agreement is modest.

http://www.arxiv-vanity.com/