18F/reheader

View on GitHub
README.md

Summary

Maintainability
Test Coverage
# reheader

[![PyPI Status](https://img.shields.io/pypi/v/reheader.svg)](https://pypi.python.org/pypi/reheader)
[![Build Status](https://img.shields.io/travis/18F/reheader.svg?branch=master)](https://travis-ci.org/18F/reheader)
[![Coverage Status](https://coveralls.io/repos/github/18F/reheader.svg?branch=master)](https://coveralls.io/github/18f/reheader?branch=master)
[![Code Climate](https://codeclimate.com/github/18F/reheader.svg)](https://codeclimate.com/github/18F/reheader)

Forces data's headers to match an expected template, using fuzzy column
name matching and/or regular expressions to recognize columns.

Designed for import of CSVs from uncontrolled sources.

* Documentation: https://reheader.readthedocs.io.


## Features

* Process lists of dictionaries or lists of lists, including those
  generated from CSVs

## Using

### Fuzzily matching header names

If `headers` is a `list`, reheader uses a fuzzy match between desired and
actual header values to identify columns.

    $ head -2 data.csv
    name,mail,profession,zip
    Nellie Newsock,nellie@sox.com,Adventuress,45309
    $ python
    >>> import csv
    >>> from reheader.reheader import reheadered
    >>> infile = open('data.csv')
    >>> data_source = csv.DictReader(infile)
    >>> for row in reheadered(data_source, ['email', 'zipcode', 'name']):
    ...     print(row)
    {'email': 'nellie@sox.com', 'name': 'Nellie Newsock', 'zipcode': '45309'}


### Identifying columns by regex

If `headers` is a `dict`, the values are used as regular expression matches
against the data values.

    >>> from commonregex import email
    >>> headers = {'Full Name': None, 'addr': email, 'postal_code': r'^[0-9-+]$'}
    >>> for row in reheadered(data_source, ['email', 'zip_code', 'name']):
    ...     print(row)

The values may be either strings (which will be compiled to regexes) or
compiled regexes (to include regex compilation flags of your choice).
In this instance, we have used a compiled regex supplied by the
separate `commonregex` package.

For a regex of `None`, reheader falls back on a fuzzy match of header name.

### Optional arguments

* `keep_extra` (default `False`): Columns missing from `headers` should
  be included in results

* `minimum_score` (default 60): Fuzzy header match must meet this score
  (of 100) to be considered a hit

* `prefer_fuzzy` (default `False`): Even if a header has a regex, prefer
  a fuzzy match of header name where possible

* `header_present` (default `None`): For a list of lists only, `True`
  indicates that the first non-empty row contains headers.  `False`
  indicates that there is no header row, and regexes must be used to
  identify columns.  An integer indicates multi-row headers.
  By default (`None`), reheader guesses whether the first row is a
  header based on its rough similarity in form to subsequent rows.


## Credits

This package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter)
and the [18F/cookiecutter-pypackage](https://github.com/audreyr/cookiecutter-pypackage)
project template.

## Public domain

This project is in the worldwide [public domain](LICENSE.md). As stated in [CONTRIBUTING](CONTRIBUTING.md):

> This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the [CC0 1.0 Universal public domain dedication](https://creativecommons.org/publicdomain/zero/1.0/).
>
> All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.