Use the application here.
Making extractives data as open and accessible as possible means finding existing data and using it, in analyses and visualizations. More often than not, this data is published in a PDF report.
PDFs are not an ideal format for publication of data. Data tables in PDFs are difficult to translate into a machine-readable format for use in a spreadsheet application, like Microsoft Excel. Copying and pasting will not work.
For this reason, over the course of a large data collection project, NRGI data staff members developed an application that simplifies the process of extracting a table from a PDF. This tool is now available online.
The application builds on the open-source software Tabula, which does the heavy lifting of identifying tables in the PDF and extracting them to tabular format. Unlike Tabula, the entire application is available through the web browser, with no download or installation required.
The application is designed around the common challenges of table scraping, like the need to compare values easily to ensure accuracy. With the PDF displayed in the application window alongside a fully-editable spreadsheet of the extracted data, this vital step is as convenient as ever. Additionally, users can scrape multiple pages of tables at once in a single click, then download them as a CSV file.
This application is built on open source technology and all code is available in the Github repo. Suggestions can be made there or by emailing [email protected]. Use the application here.
This application was developed with the help of Publish What You Pay - Canada, Kate Vang at ONE, and numerous NRGI colleagues. The application would not be possible without the open-source contributions of the Tabula team and the rOpenSci team.