jpmckinney/copy_paste_pdf

View on GitHub
README.md

Summary

Maintainability
Test Coverage
# Copy-Paste PDF

[![Gem Version](https://badge.fury.io/rb/copy_paste_pdf.svg)](https://badge.fury.io/rb/copy_paste_pdf)
[![Code Climate](https://codeclimate.com/github/jpmckinney/copy_paste_pdf.png)](https://codeclimate.com/github/jpmckinney/copy_paste_pdf)

[Tabula](https://github.com/jazzido/tabula) was written for those cases where you can’t easily copy-and-paste tables from a PDF to a spreadsheet. Surprisingly, Tabula sometimes fails where copy-and-pasting succeeds. This project is for [those cases](http://www.atipp.gov.nl.ca/info/coordinators.html) when copy-and-pasting is all you need (and where nothing else works).

This gem only works on OS X. It works best on PDFs whose source materials are Excel spreadsheets.

## Getting Started

### PDF to CSV

Install with:

    gem install --no-wrappers copy_paste_pdf

If you omit the `--no-wrappers` switch, the AppleScript will not install properly. You may run the script with:

    copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv

* The script will open the PDF in Preview and copy the contents of the PDF
* The script will open Microsoft Excel, paste the contents and save as CSV

If you want the script to quit Preview and Excel once it's done, pass a third argument, like:

    copy-paste-pdf.applescript /path/to/input.pdf /path/to/output.csv true

The script may [pinwheel](https://en.wikipedia.org/wiki/Spinning_pinwheel) while copying the contents of the PDF and while pasting the contents to the spreadsheet. If it looks like nothing is happening, wait a few seconds.

You can work in other applications while the script is running - just don't use the clipboard as it may interfere with the script.

This method is admittedly not very efficient. Running time averages under 2 seconds per page but varies considerably depending on your system's load.

### Data Cleaning

The Ruby gem defines helper methods for cleaning the CSV. In most cases, the PDF to CSV conversion will create many empty rows. You can easily remove those rows with:

```ruby
require 'csv'

require 'copy_paste_pdf'

table = CopyPastePDF::Table.new(CSV.read('/path/to/output.csv'))

table.remove_empty_rows!

CSV.open('/path/to/clean.csv', 'w') do |csv|
  table.each do |row|
    csv << row
  end
end
```

If the table in the PDF contained vertically-merged cells, then, in the CSV, the first of the merged cells will have a value and the others will be empty. To copy the value of the first cell to the others, use the `copy_into_cell_below` method, which accepts the indices of columns containing merged cells:

```ruby
table.copy_into_cell_below(0, 3, 4)
```

Sometimes, if a cell contains multiple lines of text, the PDF to CSV conversion will incorrectly break the cell into multiple rows. To remove the spurious row and merge its values into the row above, use the `merge_into_cell_above` method, which accepts the indices of columns in which this error occurs:

```ruby
table.merge_into_cell_above(1, 2)
```

With additional time and effort, these two methods can be made to operate without needing columns as hints.

## Troubleshooting

If you see warnings on the command-line like:

    2013-10-09 14:30:03.704 osascript[2056:707] Error loading /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types:  dlopen(/Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types, 262): no suitable image found.  Did find:
      /Library/ScriptingAdditions/Adobe Unit Types.osax/Contents/MacOS/Adobe Unit Types: no matching architecture in universal wrapper
    osascript: OpenScripting.framework - scripting addition "/Library/ScriptingAdditions/Adobe Unit Types.osax" declares no loadable handlers.

See [this Adobe help article](http://helpx.adobe.com/photoshop/kb/unit-type-conversion-error-applescript.html).

## Developers

If, like me, you almost never write AppleScript, you can access much of AppleScript's documentation through Apple's AppleScript Editor. See, for example, how to access [the entries about Microsoft Excel](http://support.microsoft.com/kb/113891).

## Why?

Most of the PDFs I work with contain no tables. In those cases I either:

* Run `pdftotext filename.pdf` to convert the PDF to text, and write a script using regular expressions to parse the output.
* Run `pdftotext -layout filename.pdf` to convert the PDF to text while preserving the text layout – very useful when working with two-column layouts.
* Use [commercial software](http://reviews.reporterslab.org/search?q=&type=products&category=pdf-tools-2011-11-09) like Adobe Acrobat Pro to save the PDF to another format, usually Excel.
* I recently learned that Apple's Automator has an `Extract PDF Text` action which performs well.

For PDFs containing tables, I discovered that copy-pasting from Apple's Preview to Microsoft Excel worked better than all alternatives tested, for the PDFs I was interested in.

## Related projects

* If you need to extract tables from text-based PDFs, see [Tabula](https://github.com/jazzido/tabula)
* If you need to extract text from text-based PDFs, use [docsplit](http://documentcloud.github.io/docsplit/) or `pdftotext`
* If you need to extract tables from image-based PDFs, see [Carpenter](https://github.com/stefanw/carpenter) or [DocHive](https://github.com/raleighpublicrecord/dochive)
* If you need to extract text from image-based PDFs, use [docsplit](http://documentcloud.github.io/docsplit/) or Tesseract
* If you have many volunteers, consider using [Crowdcrafting to transcribe tables](http://www.youtube.com/watch?v=yfnJHALzlZc)

You may also be interested in the Open Knowledge Foundation's [messytables](https://github.com/okfn/messytables) and [pdftables](https://github.com/okfn/pdftables).

Copyright (c) 2013 James McKinney, released under the MIT license