About Paper to HTML

What is Paper to HTML?

This is an experimental prototype created by Semantic Scholar that aims to render scientific papers in HTML so they can be more easily read by screen readers or on mobile devices. Our system is currently able to process PDFs, LaTeX source, and PubMed Central XMLs. We rely on statistical machine learning techniques to extract content from papers, so errors are inevitable. We are working on ways to improve extraction quality.

How do I use the Converter?

To use this system, upload a scientific PDF (or LaTeX zip file or JATS XML) on the main page. When you press upload, the system will process the document in the background and return the HTML when it is done. This usually takes around 1-2 minutes per PDF depending on the size of the file, and is much faster for LaTeX or XML. This time cost is only incurred the first time we process your PDF; upon subsequent uploads, the same file will display much faster. A gallery of pre-processed AI2 papers is also available here.

How does it work?

We use a number of models to extract different components of the paper. The technology used is based off previous work done for the SciA11y project. To find out more about our models, please read our preprint: "Improving the Accessibility of Scientific Documents: Current state, user needs, and a system solution to enhance scientific PDF accessibility for blind and low vision users" [ PDF, HTML]

What are the limitations?

There are several known limitations. Tables are currently extracted from PDFs as images, which are not accessible. Mathematical content is either extracted with low fidelity or not being extracted at all from PDFs. Processing of LaTeX source and PubMed Central XML may lack some of the features implemented for PDF processing. We are working to improve these components, but please let us know if you would like some of these features prioritized over others.

What data do we keep?

We cache a copy of the extracted content as well as the extracted images. This allows us to serve the results more quickly when a user uploads the same file again. We do not retain the uploaded files themselves. Cached content is never served to a user who has not provided the exact same document. We do not share your uploaded files or the content of the uploaded files in any way.

Will this be open-sourced?

We are investigating a path to open sourcing our code. In the meantime, the library we use to extract textual content is open source and available here.

Will this feature be available in Semantic Scholar?

The team has plans to introduce accessibility features in Semantic Scholar. However, additional development and testing is needed to determine how best to put these tools into practice. If this system is valuable to you, please reach out and let us know!


Please send questions or feedback to accessibility@semanticscholar.org. You can also complete our feedback survey. If you are interested in contributing to this project, please contact Lucy Lu Wang!