Improving the Accessibility of Scientific Documents: Current State, User Needs, and a System Solution to Enhance Scientific PDF Accessibility for Blind and Low Vision Users
Authors
Abstract
The majority of scientific papers are distributed in PDF, which pose challenges for accessibility, especially for blind and low vision (BLV) readers. We characterize the scope of this problem by assessing the accessibility of 11,397 PDFs published 2010--2019 sampled across various fields of study, finding that only 2.4% of these PDFs satisfy all of our defined accessibility criteria. We introduce the SciA11y system to offset some of the issues around inaccessibility. SciA11y incorporates several machine learning models to extract the content of scientific PDFs and render this content as accessible HTML, with added novel navigational features to support screen reader users. An intrinsic evaluation of extraction quality indicates that the majority of HTML renders (87%) produced by our system have no or only some readability issues. We perform a qualitative user study to understand the needs of BLV researchers when reading papers, and to assess whether the SciA11y system could address these needs. We summarize our user study findings into a set of five design recommendations for accessible scientific reader systems. User response to SciA11y was positive, with all users saying they would be likely to use the system in the future, and some stating that the system, if available, would become their primary workflow. We successfully produce HTML renders for over 12M papers, of which an open access subset of 1.5M are available for browsing at https://scia11y.org/.
1. Introduction
Scientific literature is most commonly available in the form of PDFs, which pose challenges for accessibility [6, 34] .
When researchers, students, and other individuals who are blind or low vision (BLV) interact with scientific PDFs through screen readers, the availability of document structure tags, labeled reading order, labeled headers, and image alt-text are necessary to facilitate these interactions. However, these features must be painstakingly added by authors using proprietary software tools, and as a result, are often missing from papers. Low vision or dyslexic readers who interact with PDFs through screen magnification or text-to-speech may also find the complexity of certain academic paper PDF formats challenging, e.g., non-linear layout can interrupt the flow of text in a magnifying tool. Inaccessible paper PDFs can lead to high cognitive overload, frustration, and abandonment of reading for BLV readers.
Unfortunately, we find that the majority of scientific PDFs lack basic accessibility features. We estimate based on a sample of 11,397 PDFs from multiple fields of study that only around 2.4% of paper PDFs released in the last decade satisfy all of the aforementioned accessibility requirements. Accessibility challenges for academic PDFs are largely due to three factors: (1) the complexity of the PDF file format, which make it less amenable to certain accessibility features, (2) the dearth of tools, especially non-proprietary tools, for creating accessible PDFs, and (3) the dependency on volunteerism from the community with minimal support or enforcement [6] . The intent of the PDF file format is to support faithful visual representation of a document for printing, a goal that is inherently divergent from that of document representation for the purposes of accessibility. Though some professional organizations like the Association for Computing Machinery (ACM) have encouraged PDF accessibility through standards and writing guidelines, 1 uptake among academic publishers and disciplines more broadly has been limited.
While policy changes help, the fact remains that most academic PDFs produced today, and historically, are inaccessible, yet remain as the dominant way to read those papers. A long-range solution will necessitate buy-in from multiple stakeholders-publishers, authors, readers, technologists, granting agencies, and the like. But in the interim, there are technological solutions that can be offered as a sort of "band-aid" to the problem. We use this paper to offer an in-depth qualitative and quantitative description of the problem as it stands, and to introduce one such technological solution:
the SciA11y system that automatically extracts semantic information from paper PDFs and re-renders this content in the form of an accessible HTML document. Though the process is imperfect and can introduce errors, we demonstrate the ability of the rendered HTMLs to reduce cognitive load and facilitate in-paper navigation and interactions for BLV users.
The goals and contributions of this paper are three-fold:
(1) We characterize the state of academic-paper PDF accessibility by estimating the degree of adherence to accessibility criteria for papers published in the last decade (2010-2019), and describe correlations between year, field of study, PDF typesetting software, and PDF accessibility.
(2) We propose an automated approach for extracting the content of academic PDFs and displaying this content in a more accessible HTML document format. We build a prototype that re-renders 12 million PDFs in HTML, and describe the design decisions, features, and quality of the renders (assessed as faithfulness to the source PDF). We perform expert grading of the rendered HTML and report an error analysis. A demo of our system is available at scia11y.org, which makes available 1.5M HTML renders of open access PDFs.
(3) We conduct an exploratory user study with six BLV scholars to better understand the challenges they experience when reading academic papers and how our proposed tool might augment their current workflow. During Fig. 1 . A schematic for creating the SciA11y HTML render from a paper PDF. Starting with the raw two-column PDF on the left, S2ORC [24] is used to extract title, authors, abstract, section headers, body text, and references. S2ORC also identifies links between inline citations and references to figures and table objects. DeepFigures [43] is used to extract figures and tables, along with their captions. The output of these two models are merged with metadata from the Semantic Scholar API. Heuristics are used to construct a table of contents, to insert figures and tables in the appropriate places in the text, and to repair broken URLs. We add HTML headers as illustrated (header tags for sections, paragraph tags for body text, and figure tags for figures and tables); highlighted components (table of contents and links in references) are not in the PDF and novel navigational features that we introduce to the HTML render. An example HTML render of parts of a paper document is show to the right (actual render is single column, which is split here for presentation).
the study, we ask users to interact with the prototype and offer feedback for its improvement. We perform open coding of interviews to identify existing reading challenges, coping mechanisms, as well as positive and negative responses to prototype features. We summarize the findings of this user study into a set of design recommendations.
Our analysis reveals that PDF accessibility adherence is low across all fields of study. Of the five accessibility criteria we assess, only 2.4% of the PDFs we assess demonstrate full compliance. Though compliance for several criteria seems to be increasing over time, author awareness and contribution to accessibility remains low, as Alt-text has the lowest compliance of the five criteria at between 5-10% (Alt-text is the only criterion of the five that requires author intervention in all cases using current tools). We also find that typesetting software is strongly associated with accessibility compliance, with LaTeX and publishing software like Arbortext APP producing low compliance PDFs, while Microsoft Word is generally associated with higher compliance.
To offset the reading challenges of inaccessible papers for BLV researchers, we propose and test the SciA11y system for rendering academic PDFs into accessible HTML documents. As shown in Figure 1 , our prototype integrates several machine learning text and vision models to extract the structure and semantic content of papers. The content is represented as an HTML document with headings and links for navigation, figures and tables, as well as other novel features to assist in document structure understanding. Our evaluation of the SciA11y system identifies common classes of extraction problems, and finds that though many papers exhibit some extraction errors, the majority (55%) have no major problems that impact readability, and another 32% have only some problems that impact readability. Through our user study, we identify numerous challenges faced by BLV users when reading paper PDFs, including some that affect the whole document or limit navigation, and many that affect the ability of the reader to understand text or various elements of a paper like math content or tables. Responses to SciA11y were positive; participants especially liked navigation features such as headings, the table of contents, and bidirectional links between inline citations and references. Of the extraction errors in SciA11y, missed or incorrectly extracted headings were the most problematic, as these impact the user's ability to navigate between sections and fully trust the system. All users reported being likely to use the system in the future. When asked how the system might be integrated into their workflow, one participant replied "I think it would become the workflow. " Another participant said, "for unaccessible PDFs, this is life-changing. "
We condense these findings into a set of recommendations for designing and engineering accessible reading systems (Section 6.3) . Most importantly, documents should be structured to match a reader's mental model, objects should be properly tagged, and care should be taken to reduce the reader's cognitive load and increase trust in the system.
Features that emulate the external memory that visual layout provides to sighted users can be especially beneficial.
This paper is organized as follows. Following a description of related work in Section 2, we first provide a metascientific analysis of the current state of academic PDF accessibility in Section 3. In Section 4, we document our pipeline for converting PDF to HTML and describe the SciA11y prototype for rendering papers. An evaluation of HTML render quality and faithfulness is provided in Section 5. Section 6 describes our user study and findings. We recognize that no PDF extraction system is perfect, and many open research challenges remain in improving these systems. However, based on our findings, we believe SciA11y can dramatically improve screen reader navigation of most papers compared to PDFs, and is well-positioned to assist BLV researchers with many of their most common reading use cases. Our hope is that a system such as SciA11y can improve BLV researcher access to the content of academic papers, and that these design recommendations can be leveraged by others to create better, more faithful, and ultimately more usable tools and systems for scholars in the BLV community.
2. Related Work
Accessibility is an essential component of computing, which aims to make technology broadly accessible to as many users as possible, including those with differing sets of abilities. Improvements in usability and accessibility falls to the community, to better understand the needs of users with differing abilities, and to design technologies that play to this spectrum of abilities [48] . In computing, significant strides have been made to increase the accessibility of web content.
For example, various versions of the Web Content Accessibility Guidelines (WCAG) [8, 10] and the in-progress working draft for WCAG 3.0, 2 or standards such as ARIA from the W3C's Web Accessibility Initiative (WAI) 3 have been released and used to guide web accessibility design and implementation. Similarly, positive steps have been made to improve the accessibility of user interfaces and user experience [5, 35, 36, 46] , as well as various types of media content [19, 29, 32] .
We take inspiration from accessibility design principles in our effort to make research publications more accessible to users who are blind and low vision. Blindness and low vision are some of the most common forms of disability, affecting an estimated 3-10% of Americans depending on how visual impairment is defined [18] . BLV researchers also make up a representative sample of researchers in the United States and worldwide. A recent Nature editorial pushes the scientific community to better support researchers with visual impairments [41] , since existing tools and resources can be limited. There are many inherent accessibility challenges to performing research. In this paper, we engage with one of these challenges that affects all domains of study, accessing and reading the content of academic publications.
BLV users interact with papers using screen readers, braille displays, text-to-speech, and other assistive tools. A WebAIM survey of screen reader users found that the vast majority (75.1%) of respondents indicate that PDF documents are very or somewhat likely to pose significant accessibility issues. 4 Most paper are published in PDF, which is inherently inaccessible, due in large part to its conflation of visual layout information with semantic content [6, 34] . Bigham et al. [6] describe the historical reasons we use PDF as the standard document format for scientific publications, as well as the barriers the format itself presents to accessibility. Prior work on scientific accessibility have made recommendations for how to make PDFs more accessible [11, 38] , including greater awareness for what constitutes an accessible PDF and better tooling for generating accessible PDFs. Some work has focused on addressing components of paper accessibility, such as the correct way for screen readers to interpret and read mathematical equations [1, 4, 16, 17, 26, 44, 45] , describe charts and figures [12] [13] [14] , automatically generate figure captions [9, 37] , or automatically classify the content of figures [21] . Other work applicable to all types of PDF documents aims to improve automatic text and layout detection of scanned documents [31] and extract table content [15, 39] . In this work, we focus on the issue of representing overall document structure, and navigation within that structure. Being able to quickly navigate the contents of a paper through skimming and scanning is an essential reading technique [28] , which is currently under-supported by PDF documents and PDF readers when reading these documents by screen reader.
There also exists a variety of automatic and manual tools that assess and fix accessibility compliance issues in PDFs, including the Adobe Acrobat Pro Accessibility Checker 5 , Common Look 6 , ABBYY FineReader 7 , PAVE 8 , and PDFA Inspector 9 . To our knowledge, PAVE and PDFA Inspector are the only non-proprietary, open-source tools for this purpose. Based on our experiences, however, all of these tools require some degree of human intervention to properly tag a scientific document, and tagging and fixing must be performed for each new version of a PDF, regardless of how minor the change may be.
Prior work | PDFs analyzed | Venues | Year | Accessibility checker |
---|---|---|---|---|
Brady et al. [7] | 1811 | CHI, ASSETS and W4A | 2011-2014 | PDFA Inspector |
Lazar et al. [23] | 465 + 32 | CHI and ASSETS | 2014-2015 | Adobe Acrobat Action Wizard |
Ribera et al. [40] | 59 | DSAI | 2016 | Adobe PDF Accessibility Checker 2.0 |
Nganji [33] | 200 | Disability & Society, Journal of Developmental and Physical Disabilities, Journal of Learning Disabilities, and Research in Developmental Disabilities | 2009-2013 | Adobe PDF Accessibility Checker 1.3 |
Our analysis | 11,397 | Venues across various fields of study | 2010-2019 | Adobe Acrobat Accessibility Plug-in Version 21.001.20145 |
Guidelines and policy changes have been introduced in the past decade to ameliorate some of the issues around scientific PDF accessibility. Some conferences, such as The ACM CHI Virtual Conference on Human Factors in Computing Systems (CHI) and The ACM SIGACCESS Conference on Computers and Accessibility (ASSETS), have released guidelines for creating accessible submissions. 10 The ACM Digital Library 11 provides some publications in HTML format, which is easier to make accessible than PDF [20] . Ribera et al. [40] conducted a case study on DSAI 2016 (Software Development and Technologies for Enhancing Accessibility and Fighting Infoexclusion). The authors of DSAI were responsible for creating accessible proceedings and identified barriers to creating accessible proceedings, including lack of sufficient tooling and lack of awareness of accessibility. The authors recommended creating a new role in the organizing committee dedicated to accessible publishing. These policy changes have led to improvements in localized communities, but have not been widely adopted by all academic publishers and conference organizers. Table 1 lists prior studies that have analyzed PDF accessibility of academic papers, and shows how our study compares.
Prior work has primarily focused on papers published in Human-Computer Interaction and related fields, specific to certain publication venues, while our analysis tries to quantify paper accessibility more broadly. Brady et al. [7] quantified the accessibility of 1,811 papers from CHI 2010-2016, ASSETS 2014, and W4A, assessing the presence of 4 https://webaim.org/projects/screenreadersurvey8/ 5 https://www.adobe.com/accessibility/products/acrobat/using-acrobat-pro-accessibility-checker.html 6 https://monsido.com/monsido-commonlook-partnership 7 https://pdf.abbyy.com/ 8 https://pave-pdf.org/faq.html 9 https://github.com/pdfae/PDFAInspector 10 See http://chi2019.acm.org/authors/papers/guide-to-an-accessible-submission/ and https://assets19.sigaccess.org/creating_accessible_pdfs.html 11 https://dl.acm.org/
Our Analysis 11,397
Venues across various fields of study 2010-2019 Adobe Acrobat Accessibility Plug-in Version 21.001.20145 Table 1 . Prior work has investigated PDF accessibility for papers published in specific venues such as CHI, ASSETS, W4A, DSAI, or various disability journals. Several of these works were conducted manually, and were limited to a small number of papers, while the more thorough analysis was conducted for CHI and ASSETS, two conference venues focused on accessibility and HCI. Our study expands on this prior work to investigate accessibility over 11,397 PDFs sampled from across different fields of study.
document tags, headers, and language. They found that compliance improved over time as a response to conference organizers offering to make papers accessible as a service to any author upon request. Lazar et al. [23] conducted a study quantifying accessibility compliance at CHI from 2010 to 2016 as well as ASSETS 2015, confirming the results of Brady et al. [7] . They found that across 5 accessibility criteria, the rate of compliance was less than 30% for CHI papers in each of the 7 years that were studied. The study also analyzed papers from ASSETS 2015, an ACM conference explicitly focused on accessibility, and found that those papers had significantly higher rates of compliance, with over 90% of the papers being tagged for correct reading order and no criteria having less than 50% compliance. This finding indicates that community buy-in is an important contributor to paper accessibility. Nganji [33] conducted a study of 200 PDFs of papers published in four disability studies journals, finding that accessibility compliance was between 15-30% for the four journals analyzed, with some publishers having higher adherence than others. To date, no large scale analysis of scientific PDF accessibility has been conducted outside of disability studies and HCI, due in part to the challenge of scaling such an analysis. We believe such an analysis is useful for establishing a baseline and characterizing routes for future improvement. Consequently, as part of this work, we conduct an analysis of scientific PDF accessibility across various fields of study, and report our findings relative to prior work.
3. Analysis Of Academic Pdf Accessibility
To capture and better characterize the scope and depth of the problems around academic PDF accessibility, we perform a broad meta-scientific analysis. We aim to measure the extent of the problem (e.g., what proportion of papers have accessible PDFs?), whether the state of PDF accessibility is improving over time (e.g., are papers published in 2019 more likely to be accessible than those published in 2010?), and whether the typesetting software used to create a paper is associated with the accessibility of its PDF (e.g., are papers created using Microsoft Word more or less accessible than papers created with other software?).
Prior studies on PDF accessibility have been limited to papers from specific publication venues such as CHI, ASSETS, W4A, DSAI, and journals in disability research. Notably, these venues are closer to the field of accessible computing, and are consequently more invested in accessibility. 12 We expand upon this work by investigating accessibility trends across various fields of study and publication venues. Our goal is to characterize the overall state of paper PDF accessibility and identify ongoing challenges to accessibility going forward.
3.1 Data & Methods
We sample PDFs from the Semantic Scholar literature corpus [3] for analysis. We construct a dataset of papers by sampling PDFs published in the years of 2010-2019 stratified across the 19 top level fields of study defined by Microsoft Academic Graph [42, 47] . Examples of fields include Biology, Computer Science, Physics, Sociology, and others. This dataset allows us to investigate the overall state of PDF accessibility for academic papers, and to study the relationship between field of study and PDF accessibility.
For each field of study, we sample papers from the top venues by total citation count, along with some documents without venue information, which include things like books and book chapters. The resulting papers come from 1058 unique publication venues; for each field of study, between 29 We analyze the PDFs in our dataset using the Adobe Acrobat Pro DC PDF accessibility checker. 13 Though this checker is proprietary and requires a paid license, it is the most comprehensive accessibility checker available and has been used in prior work on accessibility [23, 33, 40] . Alternatively, non-proprietary PDF parsers such as PDFBox 14 do not consistently extract accessibility criteria from sample PDFs, even when the criteria are met. We also prefer Adobe's checker to PDFA Inspector, used by Brady et al. [7] , because PDFA Inspector only analyzes three criteria, whereas we are interested in other accessibility attributes as well, like the presence of alt-text.
For each PDF, the Adobe accessibility checker generates a report that includes whether or not the PDF passes or fails tests for certain accessibility features, such as the inclusion of figure alt-text or properly tagged headings for navigation.
Alt-text | 3.6% | 4.0% | 7.5% |
Table headers | 0.7% | 1.0% | 13.3% |
Tagged PDF | 6.3% | 7.4% | 13.4% |
Default language | 2.3% | 3.0% | 17.2% |
Tab order | 0.3% | 1.0% | 9.3% |
Adobe-5 Compliance | - | - | 2.4% |
Because there is no API or standalone application for the Adobe accessibility checker, it can only be accessed through the user interface of a licensed version of Adobe Acrobat Pro. We develop an AppleScript program that enables us to automatically process papers through the Adobe checker. Our program requires a dedicated computer running MacOS and a licensed version of Adobe Acrobat Pro. It takes 10 seconds on average to download and process each PDF, which 12 See submission and accessibility guidelines for ASSETS (https://assets19.sigaccess.org/creating_accessible_pdfs.html), CHI (https://chi2021.acm.org/ for-authors/presenting/papers/guide-to-an-accessible-submission), W4A (http://www.w4a.info/2021/submissions/technical-papers/) and DSAI (http://dsai.ws/2020/submissions/). 13 https://www.adobe.com/accessibility/products/acrobat/using-acrobat-pro-accessibility-checker.html 14 Table 2 . We reproduce the analysis conducted by Lazar et al. [23] on PDFs of papers published in CHI, showing the percentage of papers that satisfy each of the five accessibility criteria. We find similar compliance rates, indicating that our automated accessibility checker pipeline is comparable to previous analysis methods. We also show the percentage of papers in our full dataset of 11, 397 PDFs that satisfy each criterion, along with the percent that satisfy Adobe-5 Compliance.
enables us to scale up our analysis to tens of thousands of papers. Accessibility reports from the checker are saved in HTML format for subsequent analysis.
Each report contains a total of 32 accessibility criteria, marked as "Passed," "Failed," or "Needs manual check." 15 Following Lazar et al. [23] , we analyze the following five criteria 16 :
• Alt-text: Figures have alternate text.
• Table headers: Tables have headers. • Tagged PDF: The document is tagged to specify the correct reading order.
• Default language: The document has a specified reading language.
• Tab order: The document is tagged with correct reading order, used for navigation with the tab key.
For our analysis, we also report Total Compliance, which refers to the sum number of accessibility criteria met (e.g. if a paper has met 3 out of the 5 criteria we specify, then Total Compliance is 3). In some cases, we report the Normalized Total Compliance, which is computed as the Total Compliance divided by 5, and can be interpreted as the proportion of the 5 criteria which are satisfied. We also report Adobe-5 Compliance, a binary value of whether a paper has met all 5 criteria we specify (1 if all 5 criteria are met, 0 if any are not met), and the rate of Adobe-5 Compliance for papers in our dataset.
In addition to running the accessibility checker, we also extract metadata for each PDF, focusing on metadata related to the PDF creation process. PDF metadata are generated by the software used to create each file, and we analyze the associations between different PDF creation software and the accessibility of the resulting PDF document. Our hypothesis is that some classes of software (such as Microsoft Word) produce more accessible PDFs.
3.2 Accuracy Of Our Automated Accessibility Checker
Previous work employed different versions of the Adobe Accessibility Checker to generate paper accessibility reports.
To confirm the accuracy of our checker, as well as the automated script we create to perform the analysis, we run our checker on CHI 2010 papers to reproduce the results of Lazar et al. [23] . We identify CHI papers using DOIs reported by the ACM, and resolve these to PDFs in the Semantic Scholar corpus [3] . We identify 3,248 CHI papers in the corpus, and generate accessibility reports for these using our automated checker. Our results shows similar rates of compliance compared to what was measured by Lazar et al. [23] (see Table 2 for results). This indicates that our automated accessibility checker produces comparable results to previous studies.
3.3 Proportion Of Papers With Accessible Pdfs
Around 1.6% of PDFs we attempted to process failed in the Adobe checker (i.e., we could not generate an accessibility report). The accessibility checker most commonly fails because the PDF file is password protected, or the PDF file is corrupt. In both of these cases, the PDF is inaccessible to the user. We exclude these PDFs from subsequent analysis.
Accessibility compliance over all papers is low. Table 2 In fact, only 854 PDFs (7.5%) in the whole dataset have alt-text for figures. This is intuitive as Alt-text is the only criterion that always requires author input to achieve, while the other four criteria can be derived from the document or automatically inferred, depending on the software used to generate the PDF.
As shown in Figure 3 , all fields have an Adobe-5 Compliance of less than 7%. The fields with the highest rates of compliance are Philosophy (6.3%), Art (6.2%), Business (5.7%), Psychology (5.7%), and History (5.3%) while the fields with the lowest rates of compliance are Geology (0.2%), Mathematics (0.3%), and Biology (0.6%). Fields associated with higher compliance tend to be closer to the humanities, and those with lower levels of compliance tend to be science and engineering fields. The prevalence of different document editing and typesetting software by field of study may explain some of these differences, and we explore these associations in Section 3.5.
3.4 Trends In Paper Accessibility Over Time
Typesetting Software | Count (%) |
---|---|
Adobe InDesign | 1591 (14.0%) |
LaTeX | 1431 (12.6%) |
Arbortext APP | 1374 (12.1%) |
Microsoft Word | 1318 (11.6%) |
Printer | 1021 (9.0%) |
We show changes in compliance for all fields of study over time in Figure 4 . Table 3 . Count of papers per Typesetting Software. "Other" includes PDFs created with an additional 24 unique software programs, each with counts of less than 350, as well as those created with an unknown typesetting software.
10% compliance in 2010 to more than 25% in 2019. This may be due to changes in PDF generation defaults in various typesetting software. Though this improvement is good, Default Language is the easiest of the five criteria to bring into compliance, and arguably the least valuable in terms of improving the accessible reading experience. The criterion with the lowest rate of compliance is Alt-text, which has remained stable between 5-10% and has been lower in recent years.
Since Alt-text is the only criterion of the five which always necessitates author intervention, we believe this is a sign that authors have not become more attuned to accessibility needs, and that at least some of the improvements we see over time can be attributed to typesetting software or publisher-level changes.
3.5 Association Between Typesetting Software And Paper Accessibility
Typesetting software is extracted from PDF metadata and manually canonicalized. We extract values for three metadata fields:
xmp:CreatorTool, pdf:docinfo:creator_tool, and producer. All unique PDF creation tools associated with more than 20 PDFs in our dataset are reviewed and mapped to a canonical typesetting software. For example, the values (latex, pdftex, tex live, tex, vtex pdf, xetex) are mapped to the LaTeX cluster, while the values (microsoft, for word, word) and other variants are mapped to the Microsoft Word cluster. We realize that not all Microsoft Word versions, LaTeX distributions, or other versions of typesetting software within a cluster are equal, but this normalization allows us to generalize over these software clusters. For analysis, we compare the five most commonly observed typesetting software clusters in our dataset, grouping all others into a cluster called Other.
We report the distribution of typesetting software in Table 3 . The most popular PDF creators are Adobe InDesign, LaTeX, Arbortext APP, Microsoft Word, and Printer. "Printer" refers to PDFs generated by a printer driver (by selecting "Print" → "Save as PDF" in most operating systems); unfortunately, creating a PDF through printing provides no indicator of what software was used to typeset the document, and is generally associated with very low accessibility compliance. The "Other" category aggregates papers created by all other clusters of typesetting software; each of these clusters is associated with less than 350 PDFs, i.e., the falloff is steep after the Printer cluster. For the following analysis, we present a comparison between the five most common PDF creator clusters. Figure 5 shows histograms of the Total Compliance score for PDFs in the five most common typesetting software clusters. While the vast majority of papers do not meet any accessibility criteria, it is clear that Microsoft Word produces the most accessible PDFs, followed by Adobe InDesign. To determine the significance of this difference, we compute the ANOVA and Kruskal-Wallace [22] statistics with the PDF typesetting software clusters as the sample groups and the Total Compliance as the measurements for the groups. We compute an ANOVA statistic of 2587.1 ( < 0.001) and Compliance scores between the five most common PDF typesetting software.
In Figure 6 , we observe again that usage of Microsoft Word is highly correlated with accessibility compliance. Here, we plot the proportion of Microsoft Word usage per field of study and the corresponding mean normalized Total
Compliance rates for those fields. Higher rates of Microsoft Word usage are statistically correlated with higher mean normalized Total Compliance ( = 0.89, < 0.001).
In Figure 7 show modest improvements over time, though only between 10-15% of papers in our sample meet any one of these individual criteria. Alt-text compliance is the lowest of our measured criteria, and as the only criterion of the five requiring author intervention in all cases, the lack of alt-text may be indicative of the general lack of author awareness and contribution to accessibility efforts for scientific papers.
Based on our analysis, typesetting software plays a large role in document accessibility. Of the most common PDF creator software, Microsoft Word appears to produce the most accessibility-compliant PDFs, while LaTeX produces
PDFs with the lowest compliance. Microsoft has recently made investments in the accessibility of their Office 365
Suite. 17 It is clear that software can help increase accessibility compliance by prioritizing accessibility concerns during document creation, and we encourage other developers of typesetting and publishing software to priotize accessibility concerns in their development process.
Improvements in accessibility compliance have stalled over the past decade, likely because accessibility concerns are considered marginal, and are outside of the awareness of most publishing authors and researchers. Significant changes in the authorial and publication processes are needed to change this status quo, and to increase the accessibility of scientific papers for BLV users going forward. Though we believe and encourage change in the academic paper authorial and publication process in relation to accessibility, the likelihood of rapid improvement is low and these changes will not impact the many millions of academic PDFs that have already been published. Therefore, we introduce a technological solution that may mitigate some of the accessibility challenges of existing paper PDFs, and aim to understand how this solution and others like it could serve the immediate needs of the BLV research community.
4. Converting Pdf To Html: The Scia11Y Pipeline
To address the broad accessibility challenges described in Section 3, we propose and prototype a system for extracting semantic content from paper PDFs and re-rendering this content as accessible HTML. HTML is widely accepted as a more accessible document format than PDFs. In the 2019 Access SIGCHI Report, the authors discuss the reasoning behind switching CHI publications to a new HTML5 proceedings format to improve accessibility [27] . By rendering the content of paper PDFs as HTML, and introducing proper reading order and accessibility features such as section headings, links, and figure tags, we can offset many of the issues of reading from an inaccessible PDF. Our PDF to HTML rendering system is named SciA11y after the community-adopted numeronym for digital accessibility. 18 Figure 1 provides a schematic for the approach. SciA11y leverages the two open source PDF processing projects S2ORC [24] and DeepFigures [43] , the Semantic Scholar API, 19 and a custom Flask application for rendering the extracted content of the PDF as HTML. The S2ORC project [24] integrates the Grobid machine learning library [25] and a custom XML to JSON parser 20 to produce a structured representation of paper text. We use a version of the will be placed directly following paragraph 1 and Figure 3 following paragraph 2. This ensures that the layout for the HTML render closely approximates the intended reading order. We justify this decision based on user feedback from our pilot study, which is discussed in Section 6.
In some cases, we are able to successfully process a PDF through S2ORC to extract textual content but DeepFigures either fails to process the PDF or fails to extract some or all figures from the PDF. To mitigate the cognitive dissonance around figure or table mentions without corresponding figure or table objects, we insert placeholder objects into the HTML render as in Figure 8 . For example, if " Figure 2 " is mentioned in the text but is not successfully extracted by
DeepFigures, we would insert a placeholder image for the figure based on the logic described in the previous paragraph along with the text " Figure 2 . Not extracted; please refer to original document. " Similarly, mathematical equations that we cannot currently extract are acknowledged with the same placeholder text.
We add links between inline citations and the corresponding reference entry where possible. We insert links at each inline citation in the body text that link to the corresponding bibliography entry. Following each bibliography entry, we provide links back to the first mention of that entry in each section of the paper in which it was mentioned. For example, if bibliography entry [1] is cited in the "II. Related Works" section and the "III. Methods" section, we provide two links following the entry in the bibliography to the corresponding citation locations in sections II and III, as in: In the current iteration of the HTML render, we do not display author affiliations, footnotes, or mathematical equations due to the difficulty of extracting these pieces of information from the PDF. Though some of the elements are extracted in S2ORC, the overall quality of the extractions for these elements is lower, and is currently insufficient for surfacing in the prototype (see Section 5 for details). Future work includes investigating the possibility of extracting and exposing these elements, either by improving current models or training new models targeted towards the extraction of specific paper elements.
[1]
We leverage the feedback we received during our pilot studies (see Section 6) to make improvements prior to the main user study. We denote the versions of the prototype as v0.1 (initial version; version seen by P1), v0. P2 signaled during his session that URLs in the bibliography were not being correctly extracted, so we patched the data to correctly extract and display URLs in bibliography entries.
Based on our evaluation of the quality of these HTML renders (Section 5) and user feedback and response (Section 6), we believe our approach can dramatically increase the screen reader navigability and accessibility of scientific papers across all disciplines by providing an alternate and more accessible HTML version of these papers. Properly tagged section headings allow for quick navigation and skimming of a paper, links between inline citations and bibliography entries allow users to browse to cited papers without losing their place, and figure tags for figure and table objects allow for direct navigation to these in-paper objects. We now discuss the quality of our PDF extractions (Section 5) and user response to the prototype (Section 6) in detail.
5. Html Render Quality Evaluation
Extracting semantic content from PDF is an imperfect process. Though re-rendering a PDF as HTML can increase a document's accessibility, the process relies on machine learning models that can make mistakes when extracting information. As we glean from user studies, BLV users may have some tolerance for error, but there is an inherent trade-off between errors and perceived trust in the system. We conduct a study to estimate the (1) faithfulness of the HTML renders to the source PDFs, and (2) overall readability of the resulting HTML renders. We define faithfulness as how accurately the HTML render represents different facets of the PDF document, such as displaying the correct title, section headers, and figure captions. These facets are measured as the number of errors that are made in rendering, e.g., mistakenly parsing one figure caption into the body text is counted as one error towards that facet. Readability, on the other hand, is an ordinal variable meant to capture the overall usability of the parse. Documents are given one of three grades, those with no major problems, some problems, and many problems impacting readability.
To evaluate readability and faithfulness, we first perform open coding on a small sample of document PDFs and corresponding SciA11y HTML renders. The purpose of this exercise is to identify facets of extraction that impact the ability to read a paper. A rubric is then designed based on these identified facets. The process taken to design the evaluation rubric, the rubric's content, and annotation instructions are detailed in Section 5.2. We then annotate a sample of 385 papers across different fields of study using this rubric. For each category of errors identified during open coding, we compute the overall error rates seen in our sample. We also present the overall assessed readability, reported in aggregate over our sample and by fields of study. Results of this evaluation are presented in Section 5.3.
5.1 Open Coding Of Document Facets
One author performed open coding on a sample of papers, comparing the PDF and SciA11y HTML renders to identify inconsistencies and facets that impact the faithfulness of document representation. Papers are sampled from the Semantic Scholar API 21 using various search terms, and selecting the top 3 results for each search term for which a PDF and S2ORC parse are available. Search terms were selected to achieve coverage over different domains, and the top papers are sampled to select for relevant publications. The author stopped sampling papers upon reaching saturation, resulting in 8 search terms and 24 papers. The search terms used were: human computer interaction, epilepsy, quasars, language model, influenza epidemiology, anabolic steroids, social networks, and arctic snow cover.
For each paper, the author evaluated the PDF and HTML render side-by-side, scanning through the document to identify points of difference between the two document representations. Specifically, the author looked for any text in the PDF that is not shown in the HTML, any text from the PDF that is mixed into the main text of the HTML (e.g.
figure captions, headers, or footnotes that should be separate from the main text but are mixed in, interrupting the reading flow), and other parsing mistakes (e.g. errors with math, missing lists and tables etc). These observations are detailed qualitatively, and each facet is assessed for its faithfulness to the original PDF document as well as its overall impact on readability.
5.2 Evaluation Rubric
Category | Description | Common errors |
---|---|---|
TITLE | The title and subtitle of the paper | Missing words Extra words |
AUTHORS | A list of authors who wrote the paper; this includes affiliation, though we do not explicitly evaluate affiliation in this study | Missing authors Extra authors Misspellings |
ABSTRACT | The abstract of the paper | Some text not extracted Other text incorrectly extracted as abstract |
SECTION HEADINGS | The text of section headings | Some headings not extracted (part of body text) Other text incorrectly extracted as headings |
BODY TEXT | The main text of the paper, organized by paragraph under each section heading | Some paragraphs not extracted (missing) Some text not extracted Other text incorrectly extracted as body text |
FIGURES | Images, captions, and alt-text of each figure | Figure not extracted Caption text not extracted (part of body text) Other text incorrectly extracted as caption text |
TABLES | Caption/title and content of each table | Table not extracted (not part of body text) Table not extracted (part of body text) Caption text not extracted (part of body text) Other text incorrectly extracted as caption text |
EQUATIONS | Mathematical formulas, represented in TeX or Math ML; note: our current pipeline does not extract math | Some equations not extracted Some equations incorrectly extracted |
BIBLIOGRAPHY | Bibliography entries in the reference section | Some bibliography entries not extracted Some bibliography entries incorrectly extracted Other text incorrectly extracted as bibliography |
INLINE CITATIONS | Inline citations from the body text to papers in the bibliography section | Some inline citations not detected Some inline citations incorrectly linked |
HEADERS, FOOTERS FOOTNOTES | Page headers and footers, footnotes, endnotes, and other text that is not a part of the main body of the document | Some headers and footers incorrectly extracted into body text |
Observations from open coding are coalesced into an evaluation rubric and form for grading the quality and faithfulness of the HTML render. The evaluation form attempts to capture errors in PDF extraction that affect each of the primary Table 4 . Categories of paper objects identified for evaluation along with the common errors seen for each category.
semantic categories identified for proper reading. These semantic categories and common extraction errors are given in Table 4 .
Questions in the form are designed to capture each type of faithfulness error, while allowing annotators to qualify their responses. We also include a question to capture the overall readability of the HTML render. Instructions for completing the annotation form are provided in Appendix A.1; the final version of the form is replicated in Appendix A.1; and the rubric for overall readability evaluation is given in Appendix A.3.
Three authors iterated twice on the content of the evaluation form, until they came to a consensus that all evaluation categories were adequately addressed using a minimum set of questions. Two authors then participated in pilot annotations, where each person independently annotated the same set of five papers sampled from the set labeled by the third author during open coding. Answers to all numeric questions were within ±1 for these five papers when comparing the two authors' annotations. All three authors discussed discrepancies in overall readability score, iterating on the rubric defined in Appendix A.3 and coming to a consensus. The finalized form and rubric are used for evaluation.
Of the categories and errors described in Table 4, our current pipeline does not extract table content and equations. Tables are extracted as images by DeepFigures [43], which do not contain table semantic information. Regarding equations, we distinguish between inline equations (math written in the body text) and display equations (independent line items that can usually be referenced by number); for this work, we evaluated a small sample of papers for successful extraction of display equations. Though some display equations are recognized, the quality of equation extraction is low, usually resulting in missing tokens or improper math formatting. Therefore, we decided to replace display equations in the prototype with the equation placeholder shown in Figure 8 . Since problems with mathematical formulae are among those most mentioned by users in our study, equation extraction is among our most urgent future goals, and we discuss some options in Section 7.1.
5.3 Evaluation Results
Evaluation criteria | Number of classes | Agreement | Cohen's Kappa | ICC | Mean Difference (+ SD) |
---|---|---|---|---|---|
Title | 3 | 0.87 | 0.33 | - | - |
Authors | 3 | 1.00 | 1.00 | - | - |
Abstract | 3 | 0.95 | 0.64 | - | - |
Number of figures | - | 1.00 | - | 1.00 | 0.00 + 0.00 |
Figure extraction errors | - | 0.89 | - | 1.00 | 0.11 + 0.31 |
Figure caption errors | - | 0.89 | - | 1.00 | 0.11 + 0.31 |
Number of tables | - | 0.92 | - | 0.98 | 0.12 + 0.43 |
Table extraction errors | - | 0.89 | - | 0.98 | 0.17 + 0.50 |
Table caption errors | - | 0.78 | - | 0.94 | 0.33 + 0.67 |
Header/footer/footnote errors | - | 0.40 | - | 0.60 | 1.88 + 2.12 |
Section heading errors | - | 0.71 | - | 0.79 | 0.71 + 1.70 |
Body paragraph errors | - | 0.46 | - | 0.66 | 1.50 + 2.22 |
Bibliography extraction | 4 | 0.94 | 0.82 | - | - |
Inline citation linking | 4 | 0.80 | 0.11 | - | - |
Overall score | 3 | 0.55 | 0.07 | - | - |
We Table 5 for these results. Agreement was high for most element-level annotator questions. Annotators had the highest levels of disagreement on the evaluation of header/footer/footnote errors, section heading errors, and body paragraph errors, likely due to these being text-based and the most numerous; though the average differences reported between annotators on these questions are only between 1-2. Likewise, agreement on overall readability score is modest, at 0.55;
we note, however, that neither annotator labeled any paper as having no major readability problems when the other annotator labeled it as having lots of readability problems.