toasterlovin/csv_party

View on GitHub
ROADMAP.md

Summary

Maintainability
Test Coverage
Roadmap
-

- [1.1 Early Return While Parsing](#11-early-return-while-parsing)
- [1.2 Rows to Hash](#12-rows-to-hash)
- [1.3 Generate Unimported Rows CSV](#13-generate-unimported-rows-csv)
- [1.4 Batch API](#14-batch-api)
- [1.5 Runtime Configuration](#15-runtime-configuration)
- [1.6 CSV Parse Error Handling](#16-csv-parse-error-handling)
- [Someday Features](#someday-features)
    - [Parse Row Access](#parse-row-access)
    - [Deferred Parsing](#deferred-parsing)
    - [Columns Macro](#columns-macro)
    - [Column Numbers](#column-numbers)
    - [Multi-column Parsing](#multi-column-parsing)
    - [Parse Dependencies](#parse-dependencies)
    - [Rails Generator](#rails-generator0

#### 1.1 Early Return While Parsing

Currently, CSVParty is pretty well thought out about what should happen when
either 1) one of the built in flow control methods (`next_row`, `skip_row`,
`abort_row`, and `abort_import`) is used, or 2) an error is raised while
the row importer block is being executed. However, all of these things can also
happen when the columns for a row are being parsed. When/if it does, most of the
flow control and error handling kind of assumes that the row has been fully
parsed. So some design work should go into deciding what should happen in these
cases. And then tests should be written for all of the various scenarios.

#### 1.2 Rows to Hash

One of the primary use cases for importing CSV files is to insert their contents
into a database. Apparently this is common enough that the
[csv-importer](https://github.com/pcreux/csv-importer) gem, which almost
completely automates this process without much room for customization, is very
popular. So, in the case where there is a pretty simple correspondence between
the contents of a CSV file and ActiveRecord models, it should be dead simple to
get the job done.

What I have in mind is something like:

    class MyImporter < CSVParty::Importer
      column :product_id
      column :quantity
      column :price

      rows do |row|
        LineItem.create(row.attributes)
      end
    end

Where `row.attributes` returns a hash with all of the column names as keys and
all of the parsed values as values. So, with an importer like the one above,
`row.attributes` would return a hash like so:

    { product_id: 42, quantity: 3, price: 9.99 }

#### 1.3 Generate Unimported Rows CSV

Most user inputs to an application are relatively constrained. CSV files, on the
other hand, are not. Users can, and will, put all kinds of erroneous data into
their CSV files. So, it is useful to be able to provide a user with a list of
the rows in their file that could not be imported, so that they can re-import
these rows after they have resolved whatever issues existed. And CSV is a
natural format for this, since the user can open the file in Excel and make
edits.

A motivated user of CSVParty can already achieve this by accessing the
`skipped_rows`, `aborted_rows`, and `error_rows` arrays and constructing one or
more CSV files from these, but it would be nice to provide a default
implementation that is only a method call away. What I have in mind is for the
CSV file that is created to have the exact same column structure as the original
file, but with three additional columns:

  - The original row number
  - The status (skipped, aborted, errored)
  - A message explaining the reason for the status

Conveniently, all of these pieces of data are available for skipped, aborted,
and errored rows. Then, the file would be generated with a method, like so:

    # all three combined
    importer.unimported_rows_as_csv
    # or separate
    importer.skipped_rows_as_csv
    importer.aborted_rows_as_csv
    importer.error_rows_as_csv

#### 1.4 Batch API

It can be way more performant to batch imports so that expensive operations,
like persisting data, are only done every so often. This would add an API to
accumulate data, execute some logic every X number of rows, reset the
accumulators, then repeat. Here's a rough sketch of what that API might look
like:

    rows do |row|
      customers[row.customer_id] = { name: row.customer_name, phone: row.phone }
      orders[row.order_id] = { customer_id: row.customer_id, invoice_number: row.invoice_number }
    end

    batch 50, customers: {}, orders: {} do
      # insert customers into database
      # insert orders into database
    end

The first argument is how often the batch logic should be executed. In this
case, every 50 rows. Then there is a hash of accumulators, where the keys are
the names of the accumulators and the values are the initial values. Declaring
the accumulators accomplished two things:

1. It provides accessor methods so that the accumulators can be accessed from
   within the row import block.
2. It automatically resets the accumulators to their initial values each time
   the batch block is executed.

So, it is essentially functionally identical to doing the following:

    class MyImporter < CSVParty::Importer
      attr_accessor :customers, :orders

      def customers
        @customers ||= {}
      end

      def orders
        @orders ||= {}
      end

      rows do |row|
        # add customer to customers accumulator
        # add order to orders accumulator
      end

      batch 50 do
        # insert customers into database
        # insert orders into database
        customers = {}
        orders = {}
      end
    end

_Note:_ The following is a rough sketch of an API that would handle a use case
that has come up. However, some research should be done first to figure out if
the use case it addresses is common.

One use case that has been mentioned is when rows are grouped by their
relationship to a parent record and those rows need to be acted on as a group.
So, imagine a CSV file like so:

    Customer,Address,Product,Quantity,Price
    Joe Smith,123 Main St.,Birkenstocks,1,74.99
    Joe Smith,123 Main St.,Air Jordans,1,129.99
    Joe Smith,123 Main St.,Tevas,3,59.99
    Jane Doe,713 Broadway,Converse All-Star,1,39.99
    Jane Doe,713 Broadway,Toms,1,59.99

It might be useful to be able to specify the batch interval in terms of one of
the columns in the CSV file, rather than as a number of rows. So, you would be
able to do:

    class MyImporter < CSVParty::Importer
      column :customer
      column :address
      column :product
      column :quantity, as: :integer
      column :price, as: :decimal

      rows do |row|
        line_items << { product: row.product, quantity: row.quantity, price: row.price }
      end

      batch :customer, line_items: [] do |current_row|
        Customer.create(name: current_row.customer, address: current_row.address)
        line_items.each do |li|
          LineItem.create(li)
        end
      end
    end

In this case, the batch logic gets executed everytime there is a change in the
`:customer` column from one row to the next, rather than every X number of rows.
The accumulator works the same way: accessors are made available for adding
records to the accumulator and then the accumulator is automatically reset to
its initial value each time the batch logic is executed.

#### 1.5 Runtime Configuration

Sometimes it useful to be able to configure an importer at runtime, rather than
at code writing time. An obvious example of when this would be useful is in the
case of user defined column header names. So, imagine a UI in which the user
uploads their CSV file, then specifies which column is, for example, the product
column, which is the quantity column, and which is the price column. In a case
like this, there is no way to specify the column definitions ahead of time; we
have to wait for the header names from the user.

Here is a sketch of what the API for runtime configuration would look like:

    class MyImporter < CSVParty::Importer
      rows do |row|
        # persist data
      end
    end

    # then:

    my_importer = MyImporter.new
    my_importer.configure do
      column :product, header: user_product_header
      column :quantity, header: user_quantity_header, as: :integer
      column :price, header: user_price_header, as: :decimal
    end

An open question is whether all DSL methods should be configurable at runtime.

#### 1.6 CSV Parse Error Handling

Sometimes it is useful to be able to completely ignore parsing and encoding
errors raised by the `CSV` class. To be clear, doing so is dangerous, since the
parsing logic in the `CSV` class is not designed to continue operating after it
encounters an error and raises. But sometimes you don't want to let a single
improperly encoded character prevent you from importing an entire CSV file. So,
this feature would be an optional way to either ignore those errors or respond
to them, and then continue importing. The API would probably be similar to the
error handling API for non-parse errors. So:

    parse_errors :ignore # silently continue importing the next row

    parse_errors do |line_number|
      # handle parse error
    end

    my_import.parse_error_rows # returns array of parse error rows

## Someday Features

#### Parse Row Access

This feature would allow access to the `CSV::Row` object when parsing a column.
It could work something like this:

    column product do |value, row|
      Product.find_by(name: row['Product'])
    end

In theory, the CSVParty API would cover all use cases where somebody would need
to access the raw row data, but perhaps not. Sometimes it's nice to be able to
cut through the stuff in your way and just get at the raw internals.

Additionally, perhaps deferred parsing would allow access to parsed row values,
which would possibly enable some of the features below, Multi Column Parsing and
Parse Dependencies, without requiring additional code. That might look like:

    column product do |value, row|
      Product.find_by(name: row.unparsed.row)
    end

#### Deferred Parsing

Currently, CSVParty parses all columns before the row import logic is executed.
There are situations where parsing columns is expensive and you may want to
defer parsing columns until and unless you actually need them for that a given
row. For example, say you have an import where you are either:

1. Updating a value on existing records, or
2. Setting a value on and creating new records

And when you create new records, you have to also set a bunch of other values
and some of those values require database queries.

In a situation like this, you would want to defer all of the database queries
that need to run when creating a new record so that they aren't done in cases
where you are updating an existing record.

#### Columns Macro

This feature would allow you to declare multiple columns in a single line. So,
rather than:

    column :product
    column :price
    column :color

You could do:

    columns :product, :price, :color

This is probably most useful when there are a bunch of columns that should all
be parsed as text. Though it might make sense to allow specifying parsers and
other options:

    columns product: { as: :raw }, price: { as: :decimal }

It should probably also be possible to combine `columns` and `column` macros:

    columns :product, :price, :color
    column :size

#### Column Numbers

CSVParty is entirely oriented around a CSV file having a header. This is not
always the case, though. This would add the ability to specify columns using a
column number, rather than a header. A rough sketch of the API might look like:

    class MyImporter < CSVParty::Importer
      column :product, number: 7
      column :quantity, number: 8, as: :integer
      column :price, number: 9, as: :decimal
    end

#### Multi-column Parsing

The whole idea behind custom parsers is that it makes for much cleaner code to
get all the logic related to parsing a raw value into a useful intermediate
object in one place, away from the larger logic of what needs to happen to each
row. Sometimes, though, you need access to multiple column values to create a
useful parsed value. Here is what an API for that might look like:

    column :total, header: ['Price', 'Quantity'] do |price, quantity|
      BigDecimal.new(price) * BigDecimal.new(quantity)
    end

#### Parse Dependencies

Sometimes, while parsing a column, it would be useful to have access to the
parsed value from another column. This would make that possible. Here is what
that might look like:

    class MyImporter < CSVParty::Importer
      column :customer do |customer_id|
        Customer.find(customer_id)
      end

      column :order, depends_on: :customer do |order_id, customer|
        customer.orders.find(order_id)
      end
    end

#### Rails Generator

This feature would add a generator for Rails which creates an importer file. So
doing:

    rails generate importer Product

Would generate the following file at `app/importers/product_importer.rb`:

    class ProductImporter
      include CSVParty

      import do |row|
        # import logic goes here
      end
    end