heronshoes/red_amber

View on GitHub
doc/qmd/examples_of_red_amber.qmd

Summary

Maintainability
Test Coverage
---
title: 127 examples of Red Amber
author: heronshoes
date: '2023-08-11'
format:
  pdf:
    code-fold: false
jupyter: ruby
format:
  pdf:
    toc: true
    fontfamily: libertinus
    colorlinks: true
---

For RedAmber Version 0.5.1, 0.5.2 and Arrow version 12.0.1, 13.0.0 .

## 1. Install

Install requirements before you install RedAmber.

- Ruby (>= 3.0)

- Apache Arrow (>= 12.0.0)
- Apache Arrow GLib (>= 12.0.0)
- Apache Parquet GLib (>= 12.0.0)  # if you need IO from/to Parquet resource.

  See [Apache Arrow install document](https://arrow.apache.org/install/).

  - Minimum installation example for the latest Ubuntu:
    ```shell
    sudo apt update
    sudo apt install -y -V ca-certificates lsb-release wget
    wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
    sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
    sudo apt update
    sudo apt install -y -V libarrow-dev
    sudo apt install -y -V libarrow-glib-dev
    ```
  - On Fedora 38 (Rawhide):
    ```shell
    sudo dnf update
    sudo dnf -y install gcc-c++ libarrow-devel libarrow-glib-devel ruby-devel libyaml-devel


  - On macOS, you can install Apache Arrow C++ library using Homebrew:
    ```shell
    brew install apache-arrow
    ```

    and GLib (C) package with:
    ```shell
    brew install apache-arrow-glib
    ```

If you prepared Apache Arrow, add these lines to your Gemfile:

```ruby
gem 'red-arrow',   '>= 12.0.0'
gem 'red_amber'
gem 'red-arrow-numo-narray'    # Optional, recommended if you use inputs from Numo::NArray
                               # or use random sampling feature.
gem 'red-parquet', '>= 12.0.0' # Optional, if you use IO from/to parquet
gem 'red-datasets-arrow'       # Optional, recommended if you use Red Datasets
gem 'red-arrow-activerecord'   # Optional, if you use Active Record
gem 'rover-df',                # Optional, if you use IO from/to Rover::DataFrame.
```

And then execute `bundle install` or install it yourself as `gem install red_amber`.

## 2. Require

```{ruby}
#| tags: []
require 'red_amber' # require 'red-amber' is also OK
include RedAmber
{RedAmber: VERSION, Arrow: Arrow::VERSION}
```

## 3. Initialize

There are several ways to initialize a DataFrame.

```{ruby}
#| tags: []
# From a Hash
DataFrame.new(x: [1, 2, 3], y: %w[A B C])
```

```{ruby}
#| tags: []
# From a schema and a row-oriented array
DataFrame.new({ x: :uint8, y: :string }, [[1, 'A'], [2, 'B'], [3, 'C']])
```

```{ruby}
#| tags: []
# From an Arrow::Table
table = Arrow::Table.new(x: [1, 2, 3], y: %w[A B C])
DataFrame.new(table)
```

```{ruby}
#| tags: []
# From a Rover::DataFrame
require 'rover'
rover = Rover::DataFrame.new(x: [1, 2, 3], y: %w[A B C])
DataFrame.new(rover)
```

```{ruby}
#| tags: []
# from a datasets in Red Datasets
require 'datasets-arrow'
dataset = Datasets::Penguins.new
penguins = DataFrame.new(dataset) # Since 0.2.2 . If it is older, it must be `dataset.to_arrow`.
```

```{ruby}
#| tags: []
dataset = Datasets::Rdatasets.new('datasets', 'mtcars')
mtcars = DataFrame.new(dataset)
```

(New from 0.2.3 with Arrow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks!

```{ruby}
#| tags: []
require 'arrow-numo-narray'

DataFrame.new(numo: Numo::DFloat.new(3).rand)
```

Another example by Numo::NArray is [#77. Introduce columns from numo/narray](#77.-Introduce-columns-from-numo/narray).

## 4. Load

`RedAmber::DataFrame` delegates `#load` to `Arrow::Table#load`. We can load from `[.arrow, .arrows, .csv, .csv.gz, .tsv]` files.

`load` accepts following options:

`load(input, format: nil, compression: nil, schema: nil, skip_lines: nil)`

- `format` [:arrow_file, :batch, :arrows, :arrow_stream, :stream, :csv, :tsv]
- `compression` [:gzip, nil]
- `schema` [Arrow::Schema]
- `skip_lines` [Regexp]

Load from a file 'comecome.csv';

```{ruby}
#| tags: []
file = Tempfile.open(['comecome', '.csv']) do |f|
  f.puts(<<~CSV)
    name,age
    Yasuko,68
    Rui,49
    Hinata,28
  CSV
  f
end

DataFrame.load(file)
```

Load from a Buffer;

```{ruby}
#| tags: []
DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv)
  name,age
  Yasuko,68
  Rui,49
  Hinata,28
BUFFER
```

Load from a Buffer skipping comment line;

```{ruby}
#| tags: []
DataFrame.load(Arrow::Buffer.new(<<~BUFFER), format: :csv, skip_lines: /^#/)
  # comment
  name,age
  Yasuko,68
  Rui,49
  Hinata,28
BUFFER
```

## 5. Load from a URI

```{ruby}
#| tags: []
uri = URI("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
DataFrame.load(uri)
```

## 6. Save

`#save` accepts same options as `#load`. See [#4. Load](#4.-Load).

```{ruby}
#| tags: []
penguins.save("penguins.arrow")
penguins.save("penguins.arrows")
penguins.save("penguins.csv")
penguins.save("penguins.csv.gz")
penguins.save("penguins.tsv")
penguins.save("penguins.feather")
```

(Since 0.3.0) `DataFrame#save` returns self.

## 7. to_s/inspect

`to_s` or `inspect` (it uses to_s inside) shows a preview of the dataframe.

It shows first 5 and last 3 rows if it has many rows. Columns are also omitted if line is exceeded 80 letters.

```{ruby}
#| tags: []
df = DataFrame.new(
  x: [1, 2, 3, 4, 5],
  y: [1, 2, 3, 0/0.0, nil],
  s: %w[A B C D] << nil,
  b: [true, false, true, false, nil]
)
```

```{ruby}
#| tags: []
p penguins; nil
```

## 8. Show table

`#table` shows Arrow::Table object. The alias is `#to_arrow`.

```{ruby}
#| tags: []
df.table
```

```{ruby}
#| tags: []
penguins.to_arrow
```

```{ruby}
#| tags: []
# This is a Red Arrow's feature
puts df.table.to_s(format: :column)
```

```{ruby}
#| tags: []
# This is also a Red Arrow's feature
puts df.table.to_s(format: :list)
```

## 9. TDR

TDR means 'Transposed Dataframe Representation'. It shows columns in lateral just the same shape as initializing by a Hash. TDR has some information which is useful for the exploratory data processing.

- DataFrame shape: n_rows x n_columns
- Data types
- Levels: number of unique elements
- Data preview: same data is aggregated if level is smaller (tally mode)
- Show counts of abnormal element: NaN and nil

It is similar to dplyr's (or Polars's) `glimpse()` so we have an alias `#glimpse` (since 0.4.0).

```{ruby}
#| tags: []
df.tdr
```

```{ruby}
#| tags: []
penguins.tdr
```

`#tdr` has some options:

`limit` : to limit a number of variables to show. Default value is `limit=10`.

```{ruby}
#| tags: []
penguins.tdr(3)
```

By default `#tdr` shows 9 variables at maximum. `#tdr(:all)` will show all variables.

```{ruby}
#| tags: []
mtcars.tdr(:all)
```

(Since 0.4.0) `#tdra` method is short cut for `#tdr(:all)`

```{ruby}
#| tags: []
mtcars.tdra
```

`elements` : max number of elements to show in observations. Default value is `elements: 5`.

```{ruby}
#| tags: []
penguins.tdr(elements: 3) # Show first 3 items in data
```

`tally` : max level to use tally mode. Level means size of `tally`ed hash. Default value is `tally: 5`.

```{ruby}
#| tags: []
penguins.tdr(tally: 0) # Don't use tally mode
```

`#tdr_str` returns a String. `#tdr` do the same thing as `puts #tdr_str`

```{ruby}
#| tags: []
puts penguins.tdr_str
```

(Since 0.4.0) `#glimpse` is an alias for `#tdr`.

```{ruby}
#| tags: []
mtcars.glimpse(:all, elements: 10)
```

## 10. Size and shape

```{ruby}
#| tags: []
# same as n_rows, n_obs
df.size
```

```{ruby}
#| tags: []
# same as n_cols, n_vars
df.n_keys
```

```{ruby}
#| tags: []
# [df.size, df.n_keys], [df.n_rows, df.n_cols]
df.shape
```

## 11. Keys

```{ruby}
#| tags: []
df.keys
```

```{ruby}
#| tags: []
penguins.keys
```

## 12. Types

```{ruby}
#| tags: []
df.types
```

```{ruby}
#| tags: []
penguins.types
```

## 13. Data type classes

```{ruby}
#| tags: []
df.type_classes
```

```{ruby}
#| tags: []
penguins.type_classes
```

## 14. Indices

Another example of `indices` is in [66. Custom index](#66.-Custom-index).

```{ruby}
#| tags: []
df.indexes
# or
df.indices
```

(Since 0.2.3) `#indices` returns Vector.

## 15. To an Array or a Hash

DataFrame#to_a returns an array of row-oriented data without a header.

```{ruby}
#| tags: []
df.to_a
```

If you need a column-oriented array with keys, use `.to_h.to_a`

```{ruby}
#| tags: []
df.to_h
```

```{ruby}
#| tags: []
df.to_h.to_a
```

## 16. Schema

Schema is keys and value types pairs as a Hash.

```{ruby}
#| tags: []
df.schema
```

## 17. Vector

Each variable (column in the table) is represented by a Vector object.

```{ruby}
#| tags: []
df[:x] # This syntax will come later
```

Or create new Vector by the constructor.

```{ruby}
#| tags: []
Vector.new(1, 2, 3, 4, 5)
```

```{ruby}
#| tags: []
Vector.new(1..5)
```

```{ruby}
#| tags: []
Vector.new([1, 2, 3], [4, 5])
```

```{ruby}
#| tags: []
array = Arrow::Array.new([1, 2, 3, 4, 5])
Vector.new(array)
```

(Since 0.4.2) New constructor Vector[*array_like] has introduced.

```{ruby}
#| tags: []
Vector[1, 2, 3, 4, 5]
```

## 18. Vectors

Returns an Array of Vectors as a DataFrame.

```{ruby}
#| tags: []
df.vectors
```

## 19. Variables

Returns key and Vector pairs as a Hash.

```{ruby}
#| tags: []
df.variables
```

## 20. Select columns by #[ ]

`DataFrame#[]` is overloading column operations and row operations.

- For columns (variables)
  - Key in a Symbol: `df[:symbol]`
  - Key in a String: `df["string"]`
  - Keys in an Array: `df[:symbol1, "string", :symbol2]`
  - Keys by indeces: `df[df.keys[0]`, `df[df.keys[1,2]]`, `df[df.keys[1..]]`

```{ruby}
#| tags: []
# Keys in a Symbol and a String
df[:x, 'y']
```

```{ruby}
#| tags: []
# Keys in a Range
df[:x..:y]
```

```{ruby}
#| tags: []
# Keys with a index Range, and a symbol
df[df.keys[2..], :x]
```

## 21. Select rows by #[ ]
`DataFrame#[]` is overloading column operations and row operations.

- For rows (observations)
  - Select rows by a Index: `df[index]`
  - Select rows by Indices: `df[indices]` # Array, Arrow::Array, Vectors are acceptable for indices
  - Select rows by Ranges: `df[range]`
  - Select rows by Booleans: `df[booleans]` # Array, Arrow::Array, Vectors are acceptable for booleans

```{ruby}
#| tags: []
# indices
df[0, 2, 1]
```

```{ruby}
#| tags: []
# including a Range
# negative indices are also acceptable
df[1..2, -1]
```

```{ruby}
#| tags: []
# booleans
# length of boolean should be the same as self
df[false, true, true, false, true]
```

```{ruby}
#| tags: []
# Arrow::Array
indices = Arrow::UInt8Array.new([0,2,4])
df[indices]
```

```{ruby}
#| tags: []
# By a Vector as indices
indices = Vector.new(df.indices)
# indices > 1 returns a boolean Vector
df[indices > 1]
```

```{ruby}
#| tags: []
# By a Vector as booleans
booleans = df[:b]
```

```{ruby}
#| tags: []
df[booleans]
```

## 22. empty?

```{ruby}
#| tags: []
df.empty?
```

```{ruby}
#| tags: []
DataFrame.new
```

```{ruby}
#| tags: []
DataFrame.new.empty?
```

## 23. Select columns by pick

`DataFrame#pick` accepts an Array of keys to pick up columns (variables) and creates a new DataFrame. You can change the order of columns at a same time.

The name `pick` comes from the action to pick variables(columns) according to the label keys.

```{ruby}
#| tags: []
df.pick(:s, :y)
# or
df.pick([:s, :y]) # OK too.
```

Or use a boolean Array of lengeh `n_key` to `pick`. This style preserves the order of variables.

```{ruby}
#| tags: []
df.pick(false, true, true, false)
# or
df.pick([false, true, true, false])
# or
df.pick(Vector.new([false, true, true, false]))
```

`#pick` also accepts a block in the context of self.

Next example is picking up numeric variables.

```{ruby}
#| tags: []
# reciever is required with the argument style
df.pick(df.vectors.map(&:numeric?))

# with a block
df.pick { vectors.map(&:numeric?) }
```

`pick` also accepts numeric indexes.

(Since 0.2.1)

```{ruby}
#| tags: []
df.pick(0, 3)
```

## 24. Reject columns by drop

`DataFrame#drop` accepts an Array keys to drop columns (variables) to create a remainer DataFrame.

The name `drop` comes from the pair word of `pick`.

```{ruby}
#| tags: []
df.drop(:x, :b)
# df.drop([:x, :b]) #is OK too.
```

Or use a boolean Array of lengeh `n_key` to `drop`.

```{ruby}
#| tags: []
df.drop(true, false, false, true)
# df.drop([true, false, false, true]) # is OK too
```

`#drop` also accepts a block in the context of self.

Next example will drop variables which have nil or NaN values.

```{ruby}
#| tags: []
df.drop { vectors.map { |v| v.is_na.any } }
```

Argument style is also acceptable but it requires the reciever 'df'.

```{ruby}
#| tags: []
df.drop(df.vectors.map { |v| v.is_na.any })
```

`drop` also accepts numeric indexes.

(Since 0.2.1)

```{ruby}
#| tags: []
df.drop(0, 3)
```

## 25. Pick/drop and nil

When `pick` or `drop` is used with booleans, nil in the booleans is treated as false. This behavior is aligned with Ruby's `BasicObject#!`.

```{ruby}
#| tags: []
booleans = [true, true, false, nil]
booleans_invert = booleans.map(&:!) # => [false, false, true, true] because nil.! is true
df.pick(booleans) == df.drop(booleans_invert)
```

## 26. Vector#invert, #primitive_invert

For the boolean Vector;

```{ruby}
#| tags: []
vector = Vector.new(booleans)
```

nil is converted to nil by `Vector#invert`.

```{ruby}
#| tags: []
vector.invert
# or
!vector
```

So `df.pick(booleans) != df.drop(booleans.invert)` when booleans have any nils.

On the other hand, `Vector#primitive_invert` follows Ruby's `BasicObject#!`'s behavior. Then pick and drop keep 'MECE' behavior.

```{ruby}
#| tags: []
vector.primitive_invert
```

```{ruby}
#| tags: []
df.pick(vector) == df.drop(vector.primitive_invert)
```

## 27. Pick/drop, #[] and #v

When `pick` or `drop` select a single column (variable), it returns a `DataFrame` with one column (variable).

```{ruby}
#| tags: []
df.pick(:x) # or
df.drop(:y, :s, :b)
```

In contrast, when `[]` selects a single column (variable), it returns a `Vector`.

```{ruby}
#| tags: []
df[:x]
```

This behavior may be useful to use with DataFrame manipulation verbs (like pick, drop, slice, remove, assign, rename).

```{ruby}
#| tags: []
df.pick { keys.select { |key| df[key].numeric? } }
```

`df#v` method is same as `df#[]` to pick a Vector. But a little bit faster and easy to use in the block.

```{ruby}
#| tags: []
df.v(:x)
```

## 28. Slice

Another example of `slice` is [#70. Row index label by slice_by](#70.-Row-index-label-by-slice_by).

`slice` selects rows (records) to create a subset of a DataFrame.

`slice(indeces)` accepts indices as arguments. Indices should be Integers, Floats or Ranges of Integers. Negative index from the tail like Ruby's Array is also acceptable.

```{ruby}
#| tags: []
# returns 5 rows from the start and 5 rows from the end
penguins.slice(0...5, -5..-1)
```

```{ruby}
#| tags: []
# slice accepts Float index
# 33% of 344 observations in index => 113.52 th data ??
indexed_penguins = penguins.assign_left { [:index, indexes] } # #assign_left and assigner by Array is 0.2.0 feature
indexed_penguins.slice(penguins.size * 0.33)
```

Indices in Vectors or Arrow::Arrays are also acceptable.

Another way to select in `slice` is to use booleans. An alias for this feature is `filter`.
- Booleans is an Array, Arrow::Array, Vector or their Array.
- Each data type must be boolean.
- Size of booleans must be same as the size of self.

```{ruby}
#| tags: []
# make boolean Vector to check over 40
booleans = penguins[:bill_length_mm] > 40
```

```{ruby}
#| tags: []
penguins.slice(booleans)
```

`slice` accepts a block.
- We can't use both arguments and a block at a same time.
- The block should return indeces in any length or a boolean Array with a same length as `size`.
- Block is called in the context of self. So reciever 'self' can be omitted in the block.

```{ruby}
#| tags: []
# return a DataFrame with bill_length_mm is in 2*std range around mean
penguins.slice do
  min = bill_length_mm.mean - bill_length_mm.std
  max = bill_length_mm.mean + bill_length_mm.std
  bill_length_mm.to_a.map { |e| (min..max).include? e }
end
```

## 29. Slice and nil option

`Arrow::Table#slice` uses `#filter` method with a option `Arrow::FilterOptions.null_selection_behavior = :emit_null`. This will propagate nil at the same row.

```{ruby}
#| tags: []
hash = { a: [1, 2, 3], b: %w[A B C], c: [1.0, 2, 3] }
table = Arrow::Table.new(hash)
table.slice([true, false, nil])
```

Whereas in RedAmber, `DataFrame#slice` with booleans containing nil is treated as false. This behavior comes from `Allow::FilterOptions.null_selection_behavior = :drop`. This is a default value for `Arrow::Table.filter` method.

```{ruby}
#| tags: []
RedAmber::DataFrame.new(table).slice([true, false, nil]).table
```

## 30. Remove

Slice and reject rows (observations) to create a remainer DataFrame.

`#remove(indeces)` accepts indeces as arguments. Indeces should be an Integer or a Range of Integer.

```{ruby}
#| tags: []
# returns 6th to 339th obs. Remainer of penguins.slice(0...5, -5..-1)
penguins.remove(0...5, -5..-1)
```

`remove(booleans)` accepts booleans as a argument in an Array, a Vector or an Arrow::BooleanArray . Booleans must be same length as `#size`.

```{ruby}
#| tags: []
# remove all observation contains nil
removed = penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
```

`remove {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return indeces or a boolean Array with a same length as size. Block is called in the context of self.

```{ruby}
#| tags: []
# Remove data in 2*std range around mean
penguins.remove do
  vector = self[:bill_length_mm]
  min = vector.mean - vector.std
  max = vector.mean + vector.std
  vector.to_a.map { |e| (min..max).include? e }
end
```

## 31. Remove and nil

When `remove` used with booleans, nil in booleans is treated as false. This behavior is aligned with Ruby's `nil#!`.

```{ruby}
#| tags: []
df = RedAmber::DataFrame.new(a: [1, 2, nil], b: %w[A B C], c: [1.0, 2, 3])
```

```{ruby}
#| tags: []
booleans = df[:a] < 2
```

```{ruby}
#| tags: []
booleans_invert = booleans.to_a.map(&:!)
```

```{ruby}
#| tags: []
df.slice(booleans) == df.remove(booleans_invert)
```

Whereas `Vector#invert` returns nil for elements nil. This will bring different result. (See #26)

```{ruby}
#| tags: []
booleans.invert
```

```{ruby}
#| tags: []
df.remove(booleans.invert)
```

We have `#primitive_invert` method in Vector. This method returns the same result as `.to_a.map(&:!)` above.

```{ruby}
#| tags: []
booleans.primitive_invert
```

```{ruby}
#| tags: []
df.remove(booleans.primitive_invert)
```

```{ruby}
#| tags: []
df.slice(booleans) == df.remove(booleans.primitive_invert)
```

## 32. Remove nil

Remove any observations containing nil.

```{ruby}
#| tags: []
penguins.remove_nil
```

The roundabout way for this is to use `#remove`.

```{ruby}
#| tags: []
penguins.remove { vectors.map(&:is_nil).reduce(&:|) }
```

## 33. Rename

Rename keys (column names) to create a updated DataFrame.

`#rename(key_pairs)` accepts key_pairs as arguments. key_pairs should be a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]` .

```{ruby}
#| tags: []
h = { name: %w[Yasuko Rui Hinata], age: [68, 49, 28] }
comecome = RedAmber::DataFrame.new(h)
```

```{ruby}
#| tags: []
comecome.rename(age: :age_in_1993)
# comecome.rename(:age, :age_in_1993) # is also OK
# comecome.rename([:age, :age_in_1993]) # is also OK
```

`#rename {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return key_pairs as a Hash of `{existing_key => new_key}` or an Array of Array `[[existing_key, new_key], ...]`. Block is called in the context of self.

Symbol key and String key are distinguished.

## 34. Assign

Another example of `assign` is [68. Assign revised](#68.-Assign-revised), [#69. Variations of assign](#69.-Variations-of-assign) .

Assign new or updated columns (variables) and create a updated DataFrame.

- Columns with new keys will append new variables at right (bottom in TDR).
- Columns with exisiting keys will update corresponding vectors.

`#assign(key_pairs)` accepts pairs of key and array_like values as arguments. The pairs should be a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`.

```{ruby}
#| tags: []
comecome = RedAmber::DataFrame.new( name: %w[Yasuko Rui Hinata], age: [68, 49, 28] )
```

```{ruby}
#| tags: []
# update :age and add :brother
assigner = { age: [97, 78, 57], brother: ['Santa', nil, 'Momotaro'] }
comecome.assign(assigner)
```

`#assign {block}` is also acceptable. We can't use both arguments and a block at a same time. The block should return pairs of key and array_like values as a Hash of `{key => array_like}` or an Array of Array `[[key, array_like], ... ]`. `array_like` is one of `Vector`, `Array` or `Arrow::Array`. Block is called in the context of self.

```{ruby}
#| tags: []
df = RedAmber::DataFrame.new(
  index: [0, 1, 2, 3, nil],
  float: [0.0, 1.1,  2.2, Float::NAN, nil],
  string: ['A', 'B', 'C', 'D', nil])
```

```{ruby}
#| tags: []
# update numeric variables
df.assign do
  vectors.select(&:numeric?).map { |v| [v.key, -v] }
end
```

In this example, columns :x and :y are updated. Column :x returns complements for #negate method because :x is :uint8 type.

```{ruby}
#| tags: []
df.types
```

## 35. Coerce in Vector

Vector has coerce method.

```{ruby}
#| tags: []
vector = RedAmber::Vector.new(1,2,3)
```

```{ruby}
#| tags: []
# Vector's `#*` method
vector * -1
```

```{ruby}
#| tags: []
# coerced calculation
-1 * vector
```

```{ruby}
#| tags: []
# `@-` operator
-vector
```

## 36. Vector#to_ary

`Vector#to_ary` will enable implicit conversion to an Array.

```{ruby}
#| tags: []
Array(Vector.new([3, 4, 5]))
```

```{ruby}
#| tags: []
[1, 2] + Vector.new([3, 4, 5])
```

```{ruby}
#| tags: []
[1, 2, Vector.new([3, 4, 5])].flatten
```

## 37. Vector#fill_nil

`Vector#fill_nil_forward` or `Vector#fill_nil_backward` will
propagate the last valid observation forward (or backward).
Or preserve nil if all previous values are nil or at the end.

```{ruby}
#| tags: []
integer = Vector.new([0, 1, nil, 3, nil])
integer.fill_nil_forward
```

```{ruby}
#| tags: []
integer.fill_nil_backward
```

(Since 0.4.2) `Vector#fill_nil(value)` will fill `value` to `nil` in self.

```{ruby}
#| tags: []
integer.fill_nil(-1)
```

If value has upper type, self will automatically upcasted.
Int16 will casted into double in next example.

```{ruby}
#| tags: []
integer.fill_nil(0.1)
```

## 38. Vector#all?/any?

`Vector#all?` returns true if all elements is true.

`Vector#any?` returns true if exists any true.

These are unary aggregation function.

```{ruby}
#| tags: []
booleans = Vector.new([true, true, nil])
booleans.all?
```

```{ruby}
#| tags: []
booleans.any?
```

If these methods are used with option `skip_nulls: false` nil is considered.

```{ruby}
#| tags: []
booleans.all?(skip_nulls: false)
```

```{ruby}
#| tags: []
booleans.any?(skip_nulls: false)
```

## 39. Vector#count/count_uniq

`Vector#count` counts element.

`Vector#count_uniq` counts unique element. `#count_distinct` is an alias (Arrow's name).

These are unary aggregation function.

```{ruby}
#| tags: []
string = Vector.new(%w[A B A])
string.count
```

```{ruby}
#| tags: []
string.count_uniq # count_distinct is also OK
```

## 40. Vector#stddev/variance

These are unary element-wise function.

For biased standard deviation;

```{ruby}
#| tags: []
integers = Vector.new([1, 2, 3, nil])
integers.stddev
```

For unbiased standard deviation;

```{ruby}
#| tags: []
integers.sd
```

For biased variance;

```{ruby}
#| tags: []
integers.variance
```

For unbiased variance;

```{ruby}
#| tags: []
integers.var
```

## 41. Vector#negate

These are unary element-wise function.

```{ruby}
#| tags: []
double = Vector.new([1.0, -2, 3])
double.negate
```

Same as #negate;

```{ruby}
#| tags: []
-double
```

## 42. Vector#round

Otions for `#round`;

- `:n-digits` The number of digits to show.
- `round_mode` Specify rounding mode.

This is a unary element-wise function.

```{ruby}
#| tags: []
double = RedAmber::Vector.new([15.15, 2.5, 3.5, -4.5, -5.5])
```

```{ruby}
#| tags: []
double.round
```

```{ruby}
#| tags: []
double.round(mode: :half_to_even)
```

```{ruby}
#| tags: []
double.round(mode: :towards_infinity)
```

```{ruby}
#| tags: []
double.round(mode: :half_up)
```

```{ruby}
#| tags: []
double.round(mode: :half_towards_zero)
```

```{ruby}
#| tags: []
double.round(mode: :half_towards_infinity)
```

```{ruby}
#| tags: []
double.round(mode: :half_to_odd)
```

```{ruby}
#| tags: []
double.round(n_digits: 0)
```

```{ruby}
#| tags: []
double.round(n_digits: 1)
```

```{ruby}
#| tags: []
double.round(n_digits: -1)
```

## 43. Vector#and/or

RedAmber select `and_kleene`/`or_kleene` as default `&`/`|` method.

These are unary element-wise function.

```{ruby}
#| tags: []
bool_self  = Vector.new([true, true, true, false, false, false, nil, nil, nil])
bool_other = Vector.new([true, false, nil, true, false, nil, true, false, nil])

bool_self & bool_other  # same as bool_self.and_kleene(bool_other)
```

```{ruby}
#| tags: []
# Ruby's primitive `&&`
bool_self && bool_other
```

```{ruby}
#| tags: []
# Arrow's default `and`
bool_self.and_org(bool_other)
```

```{ruby}
#| tags: []
bool_self | bool_other  # same as bool_self.or_kleene(bool_other)
```

```{ruby}
#| tags: []
# Ruby's primitive `||`
bool_self || bool_other
```

```{ruby}
#| tags: []
# Arrow's default `or`
bool_self.or_org(bool_other)
```

## 44. Vector#is_finite/is_nan/is_nil/is_na

These are unary element-wise function.

```{ruby}
#| tags: []
double = Vector.new([Math::PI, Float::INFINITY, -Float::INFINITY, Float::NAN, nil])
```

```{ruby}
#| tags: []
double.is_finite
```

```{ruby}
#| tags: []
double.is_inf
```

```{ruby}
#| tags: []
double.is_na
```

```{ruby}
#| tags: []
double.is_nil
```

```{ruby}
#| tags: []
double.is_valid
```

## 45. Prime-th rows

```{ruby}
#| tags: []
# prime-th rows ... Don't ask me what it means.
require 'prime'
penguins.assign_left(:index, penguins.indices + 1) # since 0.2.0
        .slice { Vector.new(Prime.each(size).to_a) - 1 }
```

## 46. Slice by Enumerator

Slice accepts Enumerator.

```{ruby}
#| tags: []
# Select every 10 samples
penguins.assign_left(index: penguins.indices) # 0.2.0 feature
        .slice(0.step(by: 10, to: 340))
```

```{ruby}
#| tags: []
# Select every 2 samples by step 100
penguins.assign_left(index: penguins.indices) # 0.2.0 feature
        .slice { 0.step(by: 100, to: 300).map { |i| i..(i+1) } }
```

## 47. Output mode

Output mode of `DataFrame#inspect` and `DataFrame#to_iruby` is Table mode by default. If you prefer other mode set the environment variable `RED_AMBER_OUTPUT_MODE` .

```{ruby}
#| tags: []
ENV['RED_AMBER_OUTPUT_MODE'] = 'Table' # or nil (default)
penguins  # Almost same as `puts penguins.to_s` in any mode
```

```{ruby}
#| tags: []
penguins[:species]
```

```{ruby}
#| tags: []
ENV['RED_AMBER_OUTPUT_MODE'] = 'Plain' # Since 0.2.2
penguins
```

```{ruby}
#| tags: []
penguins[:species]
```

```{ruby}
#| tags: []
ENV['RED_AMBER_OUTPUT_MODE'] = 'Minimum'  # Since 0.2.2
penguins
```

```{ruby}
#| tags: []
penguins[:species]
```

```{ruby}
#| tags: []
ENV['RED_AMBER_OUTPUT_MODE'] = 'TDR'
penguins
```

```{ruby}
#| tags: []
penguins[:species]
```

```{ruby}
#| tags: []
ENV['RED_AMBER_OUTPUT_MODE'] = nil
```

## 48. Empty key

Empty key `:""` will be automatically renamed to `:unnamed1`.

If `:unnamed1` was used, `:unnamed1.succ` will be used.

(Since 0.1.8)

```{ruby}
#| tags: []
df = DataFrame.new("": [1, 2], unnamed1: [3, 4])
```

## 49. Grouping

`DataFrame#group` takes group_keys as arguments, and creates `Group` class.

Group class inspects counts of each unique elements.

(Since 0.1.7)

```{ruby}
#| tags: []
group = penguins.group(:species)
```

The instance of `Group` class has methods to summary functions.

It returns `function(key)` style summarized columns as a result.

```{ruby}
#| tags: []
group.count
```

If count result is same in multiple columns, count column is aggregated to one column `:count`.

```{ruby}
#| tags: []
penguins.pick(:species, :bill_length_mm, :bill_depth_mm).group(:species).count
```

Grouping key comes first (leftmost) in the columns.

## 50. Grouping with a block

`DataFrame#group` takes a block and we can specify multiple functions.

Inside the block is the context of instance of Group. So we can use summary functions without the reciever.

(Since 0.1.8)

```{ruby}
#| tags: []
penguins.group(:species) { [count(:species), mean(:body_mass_g)] }
```

`Group#summarize` accepts same block as `DataFrame#group`.

```{ruby}
#| tags: []
group.summarize { [count(:species), mean] }
```

## 51. Group#count family

`Group#count` counts the number of non-nil values in each group.
If counts are the same (and do not include NaN or nil), columns for counts are unified.

```{ruby}
dataframe = DataFrame.new(
  x: [*1..6],
  y: %w[A A B B B C],
  z: [false, true, false, nil, true, false])
```

Non-nil counts in column y and z are different.

```{ruby}
dataframe.group(:y).count
```

Non-nil counts in column x and y are same, so only one column is emitted.

```{ruby}
dataframe.group(:z).count
```

`Group#count_all` returns each record group size as a DataFrame. `Group#group_count` is an alias.

```{ruby}
dataframe.group(:y).count_all
```

`Group#count_uniq` count the unique values in each group and return as a DataFrame. `Group#count_distinct` is an alias.

```{ruby}
dataframe.group(:y).count_uniq
```

## 52. Group#one

`Group#one` gets one value from each group.

```{ruby}
dataframe.group(:y).one
```

## 53. Group aggregation functions

`Group#all` emits aggragated booleans Whether all elements in each group evaluate to true.

```{ruby}
dataframe.group(:y).all
```

`Group#any` emits aggragated booleans Whether any elements in each group evaluate to true.

```{ruby}
dataframe.group(:y).any
```

`Group#max` computes maximum of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).max
```

`Group#mean` computes mean of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).mean
```

`Group#median` computes median of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).median
```

`Group#min` computes minimum of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).min
```

`Group#product` computes product of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).product
```

`Group#stddev` computes standrad deviation of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).stddev
```

`Group#sum` computes sum of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).sum
```

`Group#variance` computes variance of values in each group for numeric columns.

```{ruby}
dataframe.group(:y).variance
```

## 54. Group#grouped_frame

`Group#grouped_frame` returns grouped DataFrame only for group keys. The alias is `#none`

```{ruby}
dataframe.group(:y).grouped_frame
```

## 55. Vector#shift

`Vector#shift(amount = 1, fill: nil)`

Shift vector's values by specified `amount`. Shifted space is filled by value `fill`.

(Since 0.1.8)

```{ruby}
#| tags: []
vector = RedAmber::Vector.new([1, 2, 3, 4, 5])
vector.shift
```

```{ruby}
#| tags: []
vector.shift(-2)
```

```{ruby}
#| tags: []
vector.shift(fill: Float::NAN)
```

## 56. From the Pandas cookbook - if-then

https://pandas.pydata.org/docs/user_guide/cookbook.html#if-then

```python
# by Python Pandas
df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df.loc[df.AAA >= 5, "BBB"] = -1

# returns =>
   AAA  BBB  CCC
0    4   10  100
1    5   -1   50
2    6   -1  -30
3    7   -1  -50
```

```{ruby}
#| tags: []
# RedAmber
df = DataFrame.new(
  "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]  # You can omit {}
)

df.assign(BBB: df[:BBB].replace(df[:AAA] >= 5, -1))
```

If you want to replace both :BBB and :CCC ;

```{ruby}
#| tags: []
df.assign do
  replacer = v(:AAA) >= 5  # Boolean Vector
  {
    BBB: v(:BBB).replace(replacer, -1),
    CCC: v(:CCC).replace(replacer, -2)
  }
end
```

## 57. From the Pandas cookbook - Splitting
Split a frame with a boolean criterion

https://pandas.pydata.org/docs/user_guide/cookbook.html#splitting

```python
# by Python Pandas
df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df[df.AAA <= 5]

# returns =>
   AAA  BBB  CCC
0    4   10  100
1    5   20   50

df[df.AAA > 5]

# returns =>
   AAA  BBB  CCC
2    6   30  -30
3    7   40  -50
```

```{ruby}
#| tags: []
# RedAmber
df = DataFrame.new(
  # You can omit outer {}
  "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]
)

df.slice(df[:AAA] <= 5)
# df[df[:AAA] <= 5] # is also OK
```

```{ruby}
#| tags: []
df.remove(df[:AAA] <= 5)
# df.slice(df[:AAA] > 5) # do the same thing
```

## 58. From the Pandas cookbook - Building criteria
Split a frame with a boolean criterion

https://pandas.pydata.org/docs/user_guide/cookbook.html#building-criteria

```python
# by Python Pandas
df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)

# and
df.loc[(df["BBB"] < 25) & (df["CCC"] >= -40), "AAA"]

# returns a series =>
0    4
1    5
Name: AAA, dtype: int64

# or
df.loc[(df["BBB"] > 25) | (df["CCC"] >= -40), "AAA"]

# returns a series =>
0    4
1    5
2    6
3    7
Name: AAA, dtype: int64
```

```{ruby}
#| tags: []
# RedAmber
df = DataFrame.new(
  # You can omit {}
  "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]
)

df.slice( (df[:BBB] < 25) & (df[:CCC] >= 40) ).pick(:AAA)
```

```{ruby}
#| tags: []
df.slice( (df[:BBB] > 25) | (df[:CCC] >= 40) ).pick(:AAA)
# df[ (df[:BBB] > 25) | (df[:CCC] >= 40) ][:AAA)] # also OK
```

```python
# by Python Pandas
# or (with assignment)
df.loc[(df["BBB"] > 25) | (df["CCC"] >= 75), "AAA"] = 0.1
df

# returns a dataframe =>
   AAA  BBB  CCC
0  0.1   10  100
1  5.0   20   50
2  0.1   30  -30
3  0.1   40  -50
```

```{ruby}
#| tags: []
# df.assign(AAA: df[:AAA].replace((df[:BBB] > 25) | (df[:CCC] >= 75), 0.1)) # by one liner

booleans = (df[:BBB] > 25) | (df[:CCC] >= 75)
replaced = df[:AAA].replace(booleans, 0.1)
df.assign(AAA: replaced)
```

```python
# by Python Pandas
# Select rows with data closest to certain value using argsort
df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
aValue = 43.0
df.loc[(df.CCC - aValue).abs().argsort()]

# returns a dataframe =>
   AAA  BBB  CCC
1    5   20   50
0    4   10  100
2    6   30  -30
3    7   40  -50
```

```{ruby}
#| tags: []
a_value = 43
df[(df[:CCC] - a_value).abs.sort_indexes]
# df.slice (df[:CCC] - a_value).abs.sort_indexes # also OK
```

```python
# by Python Pandas
# Dynamically reduce a list of criteria using a binary operators
df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
Crit1 = df.AAA <= 5.5
Crit2 = df.BBB == 10.0
Crit3 = df.CCC > -40.0
AllCrit = Crit1 & Crit2 & Crit3

import functools

CritList = [Crit1, Crit2, Crit3]
AllCrit = functools.reduce(lambda x, y: x & y, CritList)
df[AllCrit]

# returns a dataframe =>
   AAA  BBB  CCC
0    4   10  100
```

```{ruby}
#| tags: []
crit1 = df[:AAA] <= 5.5
crit2 = df[:BBB] == 10.0
crit3 = df[:CCC] >= -40.0
df[crit1 & crit2 & crit3]
```

## 59. From the Pandas cookbook - Dataframes

https://pandas.pydata.org/docs/user_guide/cookbook.html#dataframes

```python
# by Python Pandas
# Using both row labels and value conditionals
df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
)
df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]

# returns =>
   AAA  BBB  CCC
0    4   10  100
2    6   30  -30
```

```{ruby}
#| tags: []
# RedAmber
df = DataFrame.new(
  "AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]
)

df[(df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }]
```

```python
# by Python Pandas
# Use loc for label-oriented slicing and iloc positional slicing GH2904
df = pd.DataFrame(
    {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]},
    index=["foo", "bar", "boo", "kar"],
)

# There are 2 explicit slicing methods, with a third general case
# 1. Positional-oriented (Python slicing style : exclusive of end)
# 2. Label-oriented (Non-Python slicing style : inclusive of end)
# 3. General (Either slicing style : depends on if the slice contains labels or positions)

df.loc["bar":"kar"]  # Label
# returns =>
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50

# Generic
df[0:3]
# returns =>
     AAA  BBB  CCC
foo    4   10  100
bar    5   20   50
boo    6   30  -30

df["bar":"kar"]
# returns =>
     AAA  BBB  CCC
bar    5   20   50
boo    6   30  -30
kar    7   40  -50
```

```{ruby}
#| tags: []
# RedAmber does not have row index. Use a new column as indexes.
labeled = df.assign_left(index: %w[foo bar boo kar])
# labeled = df.assign(index: %w[foo bar boo kar]).pick { [keys[-1], keys[0...-1]] } # until v0.1.8
```

```{ruby}
#| tags: []
labeled[1..3]
```

```{ruby}
#| tags: []
labeled.slice do
  v = v(:index)
  v.index("bar")..v.index("kar")
end
```

`slice_by` returns the same result as above.

(Since 0.2.1)

```{ruby}
#| tags: []
labeled.slice_by(:index, keep_key: true) { "bar".."kar"}
```

```python
# by Python Pandas
# Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.
df2 = pd.DataFrame(data=data, index=[1, 2, 3, 4])  # Note index starts at 1.

df2.iloc[1:3]  # Position-oriented
# returns =>
   AAA  BBB  CCC
2    5   20   50
3    6   30  -30

df2.loc[1:3]  # Label-oriented
# returns =>
   AAA  BBB  CCC
1    4   10  100
2    5   20   50
3    6   30  -30
```

```{ruby}
#| tags: []
# RedAmber only have an implicit integer index 0...size,
# does not happen any ambiguity unless you create a new column and use it for indexes :-).
```

```python
# by Python Pandas
# Using inverse operator (~) to take the complement of a mask
df[~((df.AAA <= 6) & (df.index.isin([0, 2, 4])))]

# returns =>
   AAA  BBB  CCC
1    5   20   50
3    7   40  -50
```

```{ruby}
#| tags: []
# RedAmber offers #! method for boolean Vector.
df[!((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i })]

# or
# df[((df[:AAA] <= 6) & df.indices.map { |i| [0, 2, 4].include? i }).invert]
```

If you have `nil` in your data, consider #primitive_invert for consistent result. See example #26.

## 60. From the Pandas cookbook - New columns

https://pandas.pydata.org/docs/user_guide/cookbook.html#new-columns

```python
# by Python Pandas
# Efficiently and dynamically creating new columns using applymap
df = pd.DataFrame({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]})
df

# returns =>
   AAA  BBB  CCC
0    1    1    2
1    2    1    1
2    1    2    3
3    3    2    1

source_cols = df.columns  # Or some subset would work too
new_cols = [str(x) + "_cat" for x in source_cols]
categories = {1: "Alpha", 2: "Beta", 3: "Charlie"}
df[new_cols] = df[source_cols].applymap(categories.get)
df

# returns =>
   AAA  BBB  CCC  AAA_cat BBB_cat  CCC_cat
0    1    1    2    Alpha   Alpha     Beta
1    2    1    1     Beta   Alpha    Alpha
2    1    2    3    Alpha    Beta  Charlie
3    3    2    1  Charlie    Beta    Alpha
```

```{ruby}
#| tags: []
# RedAmber
df = DataFrame.new({"AAA": [1, 2, 1, 3], "BBB": [1, 1, 2, 2], "CCC": [2, 1, 3, 1]})
```

```{ruby}
#| tags: []
categories = {1 => "Alpha", 2 => "Beta", 3 => "Charlie"}

# Creating a Hash from keys
df.assign do
  keys.each_with_object({}) do |key, h|
    h["#{key}_cat"] = v(key).to_a.map { |x| categories[x] }
  end
end

# Creating an Array from vectors, from v0.2.0
df.assign do
  vectors.map do |v|
    ["#{v.key}_cat", v.to_a.map { |x| categories[x] } ]
  end
end
```

```python
# by Python Pandas
# Keep other columns when using min() with groupby
df = pd.DataFrame(
    {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]}
)
df

# returns =>
   AAA  BBB
0    1    2
1    1    1
2    1    3
3    2    4
4    2    5
5    2    1
6    3    2
7    3    3

# Method 1 : idxmin() to get the index of the minimums
df.loc[df.groupby("AAA")["BBB"].idxmin()]

# returns =>
   AAA  BBB
1    1    1
5    2    1
6    3    2

# Method 2 : sort then take first of each
df.sort_values(by="BBB").groupby("AAA", as_index=False).first()

# returns =>
   AAA  BBB
0    1    1
1    2    1
2    3    2

# Notice the same results, with the exception of the index.
```

```{ruby}
#| tags: []
# RedAmber
df = DataFrame.new(AAA: [1, 1, 1, 2, 2, 2, 3, 3], BBB: [2, 1, 3, 4, 5, 1, 2, 3])
```

```{ruby}
#| tags: []
df.group(:AAA).min

# Add `.rename { [keys[-1], :BBB] }` if you want.
```

## 61. Summary/describe

```{ruby}
#| tags: []
penguins.summary
# or
penguins.describe
```

If you need a variables in row, use `transpose`. (Since 0.2.0)

```{ruby}
#| tags: []
penguins.summary.transpose(name: :stats)
```

## 62. Quantile/Quantiles

`Vector#quantile(prob)` returns quantile at probability `prob`.

(Since 0.2.0)

```{ruby}
#| tags: []
penguins[:bill_depth_mm].quantile # default is prob = 0.5
```

`Vector#quantiles` accepts an Array for multiple quantiles. Returns a DataFrame.

```{ruby}
#| tags: []
penguins[:bill_depth_mm].quantiles([0.05, 0.95])
```

## 63. Transpose

`DataFrame#transpose` creates transposed DataFrame for wide type dataframe.

(Since 0.2.0)

```{ruby}
#| tags: []
uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
import_cars = RedAmber::DataFrame.load(uri)
```

```{ruby}
#| tags: []
import_cars.transpose
```

Default name of created column is `:NAME`.

We can name the column from the keys in original by the option `name:`.

```{ruby}
#| tags: []
import_cars.transpose(key: :Year, name: :Manufacturer)
```

You can specify index column by option `:key` even if it is in the middle of the original DataFrame.

```{ruby}
#| tags: []
# locate `:Year` in the middle
df = import_cars.pick(1..2, 0, 3..)
```

```{ruby}
#| tags: []
df.transpose(key: :Year)
```

## 64. To_long

`DataFrame#to_long(*keep_keys)` reshapes wide DataFrame to the long DataFrame.

- Parameter `keep_keys` specifies the key names to keep.

(Since 0.2.0)

```{ruby}
#| tags: []
uri = URI("https://raw.githubusercontent.com/heronshoes/red_amber/master/test/entity/import_cars.tsv")
import_cars = RedAmber::DataFrame.load(uri)
```

```{ruby}
#| tags: []
import_cars.to_long(:Year)
```

- Option `:name` specify the key of the column which is come **from key names**. Default is `:NAME`.
- Option `:value` specify the key of the column which is come **from values**. Default is `:VALUE`.

```{ruby}
#| tags: []
import_cars.to_long(:Year, name: :Manufacturer, value: :Num_of_imported)
```

## 65. To_wide

`DataFrame#to_wide(*keep_keys)` reshapes long DataFrame to a wide DataFrame.

- Option `:name` specify the key of the column which will be expanded **to key name**. Default is `:NAME`.
- Option `:value` specify the key of the column which will be expanded **to values**. Default is `:VALUE`.

(Since 0.2.0)

```{ruby}
#| tags: []
import_cars.to_long(:Year).to_wide
```

```{ruby}
#| tags: []
import_cars.to_long(:Year).to_wide(name: :NAME, value: :VALUE)
# is also OK
```

## 66. Custom index

Another example of `indices` is [14. Indices](#14.-Indices).

We can set the start of indices by the option.

(Since 0.2.1)

```{ruby}
#| tags: []
df = DataFrame.new(x: [0, 1, 2, 3, 4])
df.indices
```

```{ruby}
#| tags: []
df.indices(1)
```

You can put the first value which accepts `#succ` method.

```{ruby}
#| tags: []
df.indices("a")
```

## 67. Method missing

`RedAmber::DataFrame` has `#method_missing` to enable to call key names as methods.

This feature is limited to what can be called as a method (`:key` is OK, not allowed for the keys `:Key`, `:"key.1"`, `:"1key"`, etc. ). But it will be convenient in many cases.

(Since 0.2.1)

```{ruby}
#| tags: []
df = DataFrame.new(x: [1, 2, 3])
df.x.sum
```

```{ruby}
#| tags: []
# Some ways to pull a Vector
df[:x] # Formal style

df.v(:x) # #v method

df.x # method
```

```{ruby}
#| tags: []
df.x.sum
```

## 68. Assign revised

Another example of `assign` is [#34. Assign](#34.-Assign), [#69. Variations of assign](#69.-Variations-of-assign) .

```{ruby}
#| tags: []
df = DataFrame.new(x: [1, 2, 3])

# Assign by a Hash
df.assign(y: df.x / 10.0)
```

```{ruby}
#| tags: []
# Assign by separated key and value
df.assign(:y) { x / 10.0 }
```

```{ruby}
#| tags: []
# Separated keys and values
df.assign(:y, :z) { [x * 10, x / 10.0] }
```

## 69. Variations of assign

Another example of `assign` is [#34. Assign](#34.-Assign), [#68. Assign revised](#68.-Assign-revised) .

```{ruby}
#| tags: []
df = DataFrame.new(x: [1, 2, 3])
```

```{ruby}
#| tags: []
# Hash args
df.assign(y: df[:x] * 10, z: df[:x] / 10.0)

# Hash
hash = {y: df[:x] * 10, z: df[:x] / 10.0}
df.assign(hash)

# Array
array = [[:y, df[:x] * 10], [:z, df[:x] / 10.0]]
df.assign(array)

# Array
df.assign [
  [:y, df[:x] * 10],
  [:z, df[:x] / 10.0]
]

# Hash
df.assign({
  y: df[:x] * 10,
  z: df[:x] / 10.0
})

# Block, Hash
df.assign { {y: df[:x] * 10, z: df[:x] / 10.0} }

# Block, Array
df.assign { [[:y, df[:x] * 10], [:z, df[:x] / 10.0]] }

# Block, Array, method
#df.assign { [:y, x * 10], [:z, x / 10.0]] }

# Separated
#df.assign(:y, :z) { [x * 10, x / 10.0] }
```

## 70. Row index label by slice_by

Another example of `slice` is [#28. Slice](#28.-Slice).

(Since 0.2.1)

```{ruby}
#| tags: []
df = DataFrame.new(num: [1.1, 2.2, 3.3, 4.4, 5.5])
              .assign_left(:label) { indices("a") }
```

`slice_by(key) { row_selector }` selects rows in column `key` with `row_selector`.

```{ruby}
#| tags: []
df.slice_by(:label) { "b".."d" }
```

```{ruby}
#| tags: []
df.slice_by(:label) { ["c", "b", "e"] }
```

If the option `keep_key:` set to `true`, index label column is preserved.

```{ruby}
#| tags: []
df.slice_by(:label, keep_key: true) { "b".."d" }
```

## 71. Simpson's paradox in COVID-19 data

https://www.rdocumentation.org/packages/openintro/versions/2.3.0/topics/simpsons_paradox_covid

```{ruby}
#| tags: []
require 'datasets-arrow'

ds = Datasets::Rdatasets.new('openintro', 'simpsons_paradox_covid')
df = RedAmber::DataFrame.new(ds.to_arrow)
```

Create group and count by vaccine status and outcome.

```{ruby}
#| tags: []
count = df.group(:vaccine_status, :outcome).count
```

Reshape to human readable wide table.

```{ruby}
#| tags: []
all_count = count.to_wide(name: :vaccine_status, value: :count)
```

Compute death or survived ratio for vaccine status.

```{ruby}
#| tags: []
all_count.assign do
  {
    "vaccinated_%": 100.0 * vaccinated / vaccinated.sum,
    "unvaccinated_%": 100.0 * unvaccinated / unvaccinated.sum
  }
end
```

Death ratio for vaccinated is higher than unvaccinated. Is it true?

Next, do the same thing above for each age group. Temporally create methods.

```{ruby}
#| tags: []
def make_covid_table(df)
  df.group(:vaccine_status, :outcome)
    .count
    .to_wide(name: :vaccine_status, value: :count)
    .assign do
      {
        "vaccinated_%": (100.0 * vaccinated / vaccinated.sum).round(n_digits: 3),
        "unvaccinated_%": (100.0 * unvaccinated / unvaccinated.sum).round(n_digits: 3)
      }
     end
end
```

```{ruby}
#| tags: []
# under 50
make_covid_table(df[df[:age_group] == "under 50"])
```

```{ruby}
#| tags: []
# 50 +
make_covid_table(df[df[:age_group] == "50 +"])
```

Death ratio for vaccinated is lower than unvaccinated for grouped subset by age. This is an exaple of "Simpson's paradox" .

```{ruby}
#| tags: []
# Vaccine status vs age
# 50+ is highly vaccinated.
df.group(:vaccine_status, :age_group).count.to_wide(name: :age_group, value: :count)
```

```{ruby}
#| tags: []
# Outcome vs age
# 50+ also has higher death rate.
df.group(:outcome, :age_group).count.to_wide(name: :age_group, value: :count)
```

## 72. Clean up dirty data

```{ruby}
#| tags: []
file = Tempfile.open(['dirty_data', '.csv']) do |f|
  f.puts(<<~CSV)
    height,weight
    154.9,52.2
    156.8cm,51.1kg
    152,49
    148.5cm,45.4kg
    155cm,
    ,49.9kg
    1.58m,49.8kg
    166.8cm,53.6kg
  CSV
  f
end

df = DataFrame.load(file)
```

It was loaded as String Vectors.

```{ruby}
#| tags: []
df.schema
```

First for the `:weight` column. Replacing "" to NaN causes casting to Float.

```{ruby}
#| tags: []
df.assign do
  {
    weight: weight.replace(weight == "", Float::NAN)
  }
end
```

Apply same conversion for `:height` followed by unit conversion by `if_else`.

```{ruby}
#| tags: []
df = df.assign do
  {
    weight: weight.replace(weight == '', Float::NAN),
    height: height.replace(height == '', Float::NAN)
                  .then { |h| (h < 10).if_else(h * 100, h) }
  }
end
puts df.schema
df
```

We got clean data, then compute BMI as a new column.

```{ruby}
#| tags: []
df.assign(:BMI) { (weight / height ** 2 * 10000).round(n_digits: 1) }
```

## 73. From the Pandas cookbook - Multiindexing

(Updated on v0.3.0)
https://pandas.pydata.org/docs/user_guide/cookbook.html#multiindexing

```python
# by Python Pandas
# Efficiently and dynamically creating new columns using applymap

df = pd.DataFrame(
    {
        "row": [0, 1, 2],
        "One_X": [1.1, 1.1, 1.1],
        "One_Y": [1.2, 1.2, 1.2],
        "Two_X": [1.11, 1.11, 1.11],
        "Two_Y": [1.22, 1.22, 1.22],
    }
)
df

# =>
   row  One_X  One_Y  Two_X  Two_Y
0    0    1.1    1.2   1.11   1.22
1    1    1.1    1.2   1.11   1.22
2    2    1.1    1.2   1.11   1.22

# As Labelled Index
df = df.set_index("row")
df

# =>
     One_X  One_Y  Two_X  Two_Y
row
0      1.1    1.2   1.11   1.22
1      1.1    1.2   1.11   1.22
2      1.1    1.2   1.11   1.22

# With Hierarchical Columns
df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns])
df

# =>
     One        Two
       X    Y     X     Y
row
0    1.1  1.2  1.11  1.22
1    1.1  1.2  1.11  1.22
2    1.1  1.2  1.11  1.22

# Now stack & Reset
df = df.stack(0).reset_index(1)
df

# =>
    level_1     X     Y
row
0       One  1.10  1.20
0       Two  1.11  1.22
1       One  1.10  1.20
1       Two  1.11  1.22
2       One  1.10  1.20
2       Two  1.11  1.22

# And fix the labels (Notice the label 'level_1' got added automatically)
df.columns = ["Sample", "All_X", "All_Y"]
df

# =>
    Sample  All_X  All_Y
row
0      One   1.10   1.20
0      Two   1.11   1.22
1      One   1.10   1.20
1      Two   1.11   1.22
2      One   1.10   1.20
2      Two   1.11   1.22
```

(Until 0.2.3)
This is an example before `Vector#split_*` has introduced. See [88. Vector#split_columns](#88.-Vector#split_to_columns) .

```{ruby}
#| tags: []
df = RedAmber::DataFrame.new(
        "row": [0, 1, 2],
        "One_X": [1.1, 1.1, 1.1],
        "One_Y": [1.2, 1.2, 1.2],
        "Two_X": [1.11, 1.11, 1.11],
        "Two_Y": [1.22, 1.22, 1.22],
)
```

```{ruby}
#| tags: []
df_x = df.pick(:row, :One_X, :Two_X)
  .to_long(:row, name: :Sample, value: :All_X)
```

```{ruby}
#| tags: []
df_y = df.pick(:row, :One_Y, :Two_Y)
  .to_long(:row, name: :Sample, value: :All_Y)
```

```{ruby}
#| tags: []
df_x.pick(:row)
 .assign [
   [:Sample, df_x[:Sample].each.map { |x| x.split("_").first }],
   [:All_X, df_x[:All_X]],
   [:All_Y, df_y[:All_Y]]
 ]
```

(Since 0.3.0)
This example will use `Vector#split_to_columns`.

```{ruby}
#| tags: []
df = RedAmber::DataFrame.new(
        "row": [0, 1, 2],
        "One_X": [1.1, 1.1, 1.1],
        "One_Y": [1.2, 1.2, 1.2],
        "Two_X": [1.11, 1.11, 1.11],
        "Two_Y": [1.22, 1.22, 1.22],
)
```

```{ruby}
#| tags: []
df.to_long(:row)
```

`Vector#split_to_colums` returns two splitted Vectors.

```{ruby}
#| tags: []
df.to_long(:row, name: :Sample)
  .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') }
```

```{ruby}
#| tags: []
df.to_long(:row, name: :Sample)
  .assign(:Sample, :xy) { v(:Sample).split_to_columns('_') }
  .to_wide(name: :xy, value: :VALUE)
```

## 74. From the Pandas cookbook - Arithmetic

https://pandas.pydata.org/docs/user_guide/cookbook.html#arithmetic

```python
# by Python Pandas
cols = pd.MultiIndex.from_tuples(
    [(x, y) for x in ["A", "B", "C"] for y in ["O", "I"]]
)

df = pd.DataFrame(np.random.randn(2, 6), index=["n", "m"], columns=cols)
df

# =>
          A                   B                   C
          O         I         O         I         O         I
n  0.469112 -0.282863 -1.509059 -1.135632  1.212112 -0.173215
m  0.119209 -1.044236 -0.861849 -2.104569 -0.494929  1.071804

df = df.div(df["C"], level=1)
df

# =>
          A                   B              C
          O         I         O         I    O    I
n  0.387021  1.633022 -1.244983  6.556214  1.0  1.0
m -0.240860 -0.974279  1.741358 -1.963577  1.0  1.0
```

This is a tentative example. This work may be refined by the coming feature which treats multiple key header easily.

```{ruby}
#| tags: []
require "arrow-numo-narray"

values = Numo::DFloat.new(6, 2).rand_norm
```

For consistency with the pandas result, we will use same data of them.

```{ruby}
#| tags: []
values = [
  [0.469112, -0.282863, -1.509059, -1.135632,  1.212112, -0.173215],
  [0.119209, -1.044236, -0.861849, -2.104569, -0.494929,  1.071804]
].transpose
```

```{ruby}
#| tags: []
keys = %w[A B C].product(%w[O I]).map(&:join)
```

```{ruby}
#| tags: []
df = RedAmber::DataFrame.new(index: %w[n m])
                        .assign(*keys) { values }
```

```{ruby}
#| tags: []
df.assign do
  assigner = {}
  %w[A B C].each do |abc|
    %w[O I].each do |oi|
      key = "#{abc}#{oi}".to_sym
      assigner[key] = v(key) / v("C#{oi}".to_sym)
    end
  end
  assigner
end
```

```{ruby}
#| tags: []
coords = [["AA", "one"], ["AA", "six"], ["BB", "one"], ["BB", "two"], ["BB", "six"]].transpose
df = RedAmber::DataFrame.new(MyData: [11, 22, 33, 44, 55])
                        .assign_left(:label1, :label2) { coords }
```

## 75. From the Pandas cookbook - Slicing

https://pandas.pydata.org/docs/user_guide/cookbook.html#slicing

```python
# by Python Pandas
coords = [("AA", "one"), ("AA", "six"), ("BB", "one"), ("BB", "two"), ("BB", "six")]
index = pd.MultiIndex.from_tuples(coords)
df = pd.DataFrame([11, 22, 33, 44, 55], index, ["MyData"])
df

# =>
        MyData
AA one      11
   six      22
BB one      33
   two      44
   six      55
```

To take the cross section of the 1st level and 1st axis the index:

```python
# by Python Pandas
# Note : level and axis are optional, and default to zero
df.xs("BB", level=0, axis=0)

# =>
     MyData
one      33
two      44
six      55
```

```{ruby}
#| tags: []
df.slice { label1 == "BB" }.drop(:label1)
```

…and now the 2nd level of the 1st axis.

```python
# by Python Pandas
df.xs("six", level=1, axis=0)

# =>
    MyData
AA      22
BB      55
```

```{ruby}
#| tags: []
df.slice { label2 == "six" }.drop(:label2)
```

```python
# by Python Pandas
import itertools

index = list(itertools.product(["Ada", "Quinn", "Violet"], ["Comp", "Math", "Sci"]))
headr = list(itertools.product(["Exams", "Labs"], ["I", "II"]))
indx = pd.MultiIndex.from_tuples(index, names=["Student", "Course"])
cols = pd.MultiIndex.from_tuples(headr)  # Notice these are un-named
data = [[70 + x + y + (x * y) % 3 for x in range(4)] for y in range(9)]
df = pd.DataFrame(data, indx, cols)
df

# =>
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Comp      70  71   72  73
        Math      71  73   75  74
        Sci       72  75   75  75
Quinn   Comp      73  74   75  76
        Math      74  76   78  77
        Sci       75  78   78  78
Violet  Comp      76  77   78  79
        Math      77  79   81  80
        Sci       78  81   81  81
```

```{ruby}
#| tags: []
indexes = %w[Ada Quinn Violet].product(%w[Comp Math Sci]).transpose
df = RedAmber::DataFrame.new(%w[Student Course].zip(indexes))
                        .assign do
                          assigner = {}
                          keys = %w[Exams Labs].product(%w[I II]).map { |a| a.join("/") }
                          keys.each.with_index do |key, x|
                            assigner[key] = (0...9).map { |y| 70 + x + y + (x * y) % 3 }
                          end
                          assigner
                        end
```

```python
# by Python Pandas
All = slice(None)

df.loc["Violet"]

# =>
       Exams     Labs
           I  II    I  II
Course
Comp      76  77   78  79
Math      77  79   81  80
Sci       78  81   81  81
```

```{ruby}
#| tags: []
df.slice(df[:Student] == "Violet").drop(:Student)
```

```python
# by Python Pandas
df.loc[(All, "Math"), All]

# =>
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77
Violet  Math      77  79   81  80
```

```{ruby}
#| tags: []
df.slice(df[:Course] == "Math")
```

```python
# by Python Pandas
df.loc[(slice("Ada", "Quinn"), "Math"), All]

# =>
               Exams     Labs
                   I  II    I  II
Student Course
Ada     Math      71  73   75  74
Quinn   Math      74  76   78  77
```

```{ruby}
#| tags: []
df.slice(df[:Course] == "Math")
  .slice { (v(:Student) == "Ada") | (v(:Student) == "Quinn") }
```

```python
# by Python Pandas
df.loc[(All, "Math"), ("Exams")]

# =>
                 I  II
Student Course
Ada     Math    71  73
Quinn   Math    74  76
Violet  Math    77  79
```

```{ruby}
#| tags: []
df.slice(df[:Course] == "Math")
  .pick {
    [:Student, :Course].concat keys.select { |key| key.to_s.start_with?("Exams") }
  }
```

```python
# by Python Pandas
df.loc[(All, "Math"), (All, "II")]

# =>
               Exams Labs
                  II   II
Student Course
Ada     Math      73   74
Quinn   Math      76   77
Violet  Math      79   80
```

```{ruby}
#| tags: []
df.slice(df[:Course] == "Math")
  .pick {
    [:Student, :Course].concat keys.select { |key| key.to_s.end_with?("II") }
  }
```

## 76. Vector#map

`Vector#map` method accepts a block and return yielded results from the block in a Vector.

```{ruby}
#| tags: []
v = Vector.new(1, 2, 3, 4)
v.map { |x| x / 100.0 }
```

If no block is given, return a Enumerator.

```{ruby}
#| tags: []
v.map
```

If you need ruby's map from a Vector, try `.each.map` .

```{ruby}
#| tags: []
v.each.map { |x| x / 100.0 }
```

Alias for `#map` is `#collect`

Similar method is `Vector#filter/#select`.

## 77. Introduce columns from numo/narray

(Until 0.2.2 w/Arrow 9.0.0) We couldn't construct the DataFrame directly from Numo/NArray, but following trick enables.

```{ruby}
#| tags: []
DataFrame.new(index: Array(1..10))
  .assign do
    {
      x0: Numo::DFloat.new(size).rand_norm(0, 2),
      x1: Numo::DFloat.new(size).rand_norm(5, 2),
      x2: Numo::DFloat.new(size).rand_norm(10, 2),
      y0: Numo::DFloat.new(size).rand_norm(100, 10),
      y1: Numo::DFloat.new(size).rand_norm(200, 10),
      y2: Numo::DFloat.new(size).rand_norm(300, 10)
    }
  end
```

If you do not need the index column, try this.

```{ruby}
#| tags: []
DataFrame.new(_: Array(1..10))
  .assign do
    {
      x0: Numo::DFloat.new(size).rand_norm(0, 2),
      x1: Numo::DFloat.new(size).rand_norm(5, 2),
      x2: Numo::DFloat.new(size).rand_norm(10, 2),
      y0: Numo::DFloat.new(size).rand_norm(100, 10),
      y1: Numo::DFloat.new(size).rand_norm(200, 10),
      y2: Numo::DFloat.new(size).rand_norm(300, 10)
    }
  end
  .drop(:_)
```

(New from 0.2.3 with Aroow 10.0.0) It is possible to initialize by objects responsible to `to_arrow` since 0.2.3 . Arrays in Numo::NArray is responsible to `to_arrow` with `red-arrow-numo-narray` gem. This feature is proposed by the Red Data Tools member @kojix2 and implemented by @kou in Arrow 10.0.0 and Red Arrow Numo::NArray 0.0.6. Thanks!

```{ruby}
#| tags: []
require 'arrow-numo-narray'

size = 10
DataFrame.new(
  x0: Numo::DFloat.new(size).rand_norm(0, 2),
  x1: Numo::DFloat.new(size).rand_norm(5, 2),
  x2: Numo::DFloat.new(size).rand_norm(10, 2),
  y0: Numo::DFloat.new(size).rand_norm(100, 10),
  y1: Numo::DFloat.new(size).rand_norm(200, 10),
  y2: Numo::DFloat.new(size).rand_norm(300, 10)
)
```

## 78. Join (mutating joins)

(Since 0.2.3)

```{ruby}
#| tags: []
df = DataFrame.new(
  KEY: %w[A B C],
  X1: [1, 2, 3]
)
```

```{ruby}
#| tags: []
other = DataFrame.new(
  KEY: %w[A B D],
  X2: [true, false, nil]
)
```

Inner join will join data leaving only the matching records.

```{ruby}
#| tags: []
df.inner_join(other, :KEY)
```

If we omit join keys, common keys are automatically chosen (natural key).

```{ruby}
#| tags: []
df.inner_join(other)
```

Full join will join data leaving all records.

```{ruby}
#| tags: []
df.full_join(other)
```

Left join will join matching values to self from other (type: left_outer).

```{ruby}
#| tags: []
df.left_join(other)
```

Right join will join matching values from self to other (type: right_outer).

```{ruby}
#| tags: []
df.right_join(other)
```

Left join will join matching values to self from other.

```{ruby}
#| tags: []
df.left_join(other)
```

## 79. Join (filtering joins)

(Since 0.2.3)

Semi join will return records of self that have a match in other.

```{ruby}
#| tags: []
df.semi_join(other)
```

Anti join will return records of self that do not have a match in other.

```{ruby}
#| tags: []
df.anti_join(other)
```

## 80. Partial joins

(Since 0.2.3)

```{ruby}
#| tags: []
df2 = DataFrame.new(
  KEY1: %w[A B C],
  KEY2: %w[s t u],
  X: [1, 2, 3]
)
```

```{ruby}
#| tags: []
other2 = DataFrame.new(
  KEY1: %w[A B D],
  KEY2: %w[s u v],
  Y: [3, 2, 1]
)
```

```{ruby}
#| tags: []
# natural join
df2.inner_join(other2)
# Same as df2.inner_join(other2, [:KEY1, :KEY2])
```

Partial join enables some part of common keys as join keys.

Common keys of other not used as join keys will renamed as `:suffix`. Default suffix is '.1'.

```{ruby}
#| tags: []
# partial join
df2.inner_join(other2, :KEY1)
```

```{ruby}
#| tags: []
df2.inner_join(other2, :KEY1, suffix: '_')
```

## 81. Order of record in join

Order of records is not guaranteed to be preserved before or after join. This is a similar property to RDB. Records behave like a set.

If you want to preserve the order of records, it is recommended to add an index or sort.

(Since 0.2.3)

```{ruby}
#| tags: []
df2
```

```{ruby}
#| tags: []
other2
```

```{ruby}
#| tags: []
df2.full_join(other2, :KEY2)
```

## 82. Set operations

Keys in self and other must be same in set operations.

(Since 0.2.3)

```{ruby}
#| tags: []
df = DataFrame.new(
  KEY1: %w[A B C],
  KEY2: [1, 2, 3]
)
```

```{ruby}
#| tags: []
other = DataFrame.new(
  KEY1: %w[A B D],
  KEY2: [1, 4, 5]
)
```

Intersect will select records appearing in both self and other.

```{ruby}
#| tags: []
df.intersect(other)
```

Union will select records appearing in both self or other.

```{ruby}
#| tags: []
df.union(other)
```

Difference will select records appearing in self but not in other.

It has an alias `#setdiff`.

```{ruby}
#| tags: []
df.difference(other)
```

## 83. Join (big method)

Undocumented big method `join` supports all mutating joins, filtering joins and set operations.

|category|method of RedAmber|:type in join method|requirement|
|-|-|-|-|
|mutating joins|#inner_join|:inner||
|mutating joins|#full_join|:full_outer||
|mutating joins|#left_join|:left_outer||
|mutating joins|#right_join|:right_outer||
|-|-|:right_semi||
|-|-|:right_anti||
|filtering joins|#semi_join|:left_semi||
|filtering joins|#anti_join|:left_anti||
|set operations|#intersect|:inner|must have same keys with self and other|
|set operations|#union|:full_outer|must have same keys with self and other|
|set operations|#difference|:left_anti|must have same keys with self and other|

(Since 0.2.3)

```{ruby}
#| tags: []
df = DataFrame.new(
  KEY: %w[A B C],
  X1: [1, 2, 3]
)
```

```{ruby}
#| tags: []
other = DataFrame.new(
  KEY: %w[A B D],
  X2: [true, false, nil]
)
```

```{ruby}
#| tags: []
df.join(other, :KEY, type: :inner)
# Same as df.inner_join(other)
```

(Since 0.5.0) `#join` will not force ordering of original column by default.

## 84. Force order for #join

We can use `:force_order` option to ensure unique order for `join` families.
This option is true by default in `#inner_join`, `#full_join`, `#left_join`, `#right_join`, `#semi_join` and `#anti_join`.
It will append index to the source and sort after joining. It will cause some degradation in performance.
(Since 0.4.0)

(Since 0.5.0) `#join` will not force ordering of original column by default.

```{ruby}
#| tags: []
df2 = DataFrame.new(
  KEY1: %w[A B C],
  KEY2: %w[s t u],
  X: [1, 2, 3]
)
```

```{ruby}
#| tags: []
right2 = DataFrame.new(
  KEY1: %w[A B D],
  KEY2: %w[s u v],
  Y: [3, 2, 1]
)
```

```{ruby}
#| tags: []
df2.full_join(right2, :KEY2)
```

```{ruby}
#| tags: []
df2.full_join(right2, :KEY2, force_order: false)
```

```{ruby}
#| tags: []
df2.full_join(right2, { left: :KEY2, right: 'KEY2' })
```

```{ruby}
#| tags: []
df2.full_join(right2, { left: :KEY2, right: 'KEY2' }, force_order: false)
```

## 85. Binding DataFrames in vertical (concatenate)

Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.

The alias is `concat`.
(Since 0.2.3)

```{ruby}
#| tags: []
df = DataFrame.new(x: [1, 2], y: ['A', 'B'])
```

```{ruby}
#| tags: []
other = DataFrame.new(x: [3, 4], y: ['C', 'D'])
```

```{ruby}
#| tags: []
df.concatenate(other)
```

## 86. Binding DataFrames in lateral (merge)

Concatenate another DataFrame or Table onto the bottom of self. The shape and data type of other must be the same as self.

(Since 0.2.3)

```{ruby}
#| tags: []
df = DataFrame.new(x: [1, 2], y: [3, 4])
```

```{ruby}
#| tags: []
other = DataFrame.new(a: ['A', 'B'], b: ['C', 'D'])
```

```{ruby}
#| tags: []
df.merge(other)
```

## 87. Join - larger example by nycflight13

(Since 0.2.3)

'nycflights13' dataset is a large dataset. It will take a while for the first run to fetch and prepare red-datasets cache.

```{ruby}
#| tags: []
require 'datasets-arrow'

package = 'nycflights13'

airlines = DataFrame.new(Datasets::Rdatasets.new(package, 'airlines'))
airlines
```

Creating `Datasets::Rdatasets.new('flights', 'airlines')` is very slow because Red Datasets uses Ruby's primitive CSV as csv parser. We can parse csv by Arrow's faster parser.

```{ruby}
uri = URI('https://vincentarelbundock.github.io/Rdatasets/csv/nycflights13/flights.csv')
flights  = DataFrame.load(uri)
  .pick(%i[month day carrier flight tailnum origin dest air_time distance])
flights
```

```{ruby}
# inner join
flights.inner_join(airlines, :carrier)
# flights.inner_join(airlines) # natural join (same result)
```

## 88. Vector#split_to_columns

Another example using in the DataFrame operation is in [73. From the Pandas cookbook - Multiindexing](#73.-From-the-Pandas-cookbook---Multiindexing).

`self` must be a String type Vector.

(Since 0.3.0)

```{ruby}
#| tags: []
v = Vector.new(['a b', 'c d', 'e f'])
```

```{ruby}
#| tags: []
v.split_to_columns
```

`#split` accepts `sep` argument as a separator. `sep` is passed to `String#split(sep)`.

```{ruby}
#| tags: []
Vector.new('ab', 'cd', 'ef')
      .split_to_columns('')
```

nil will separated as nil.

```{ruby}
#| tags: []
Vector.new(nil, 'c d', 'e f')
      .split_to_columns
```

## 89. Vector#split_to_rows

`#split_to_rows` will separate strings and flatten into row.

(Since 0.3.0)

```{ruby}
#| tags: []
v = Vector.new(['a b', 'c d', 'e f'])
```

```{ruby}
#| tags: []
v.split_to_rows
```

## 90. Vector#merge
(Since 0.3.0)

`Vector#merge(other)` merges `self` and `other` if they are String Vector.

```{ruby}
#| tags: []
vector = Vector.new(%w[a c e])
other = Vector.new(%w[b d f])
vector.merge(other)
```

If `other` is scalar, it will be appended to each elements of `self`.

```{ruby}
#| tags: []
vector.merge('x')
```

Option `:sep` is used to concatenating elements. Its default value is ' '.

```{ruby}
#| tags: []
vector.merge('x', sep: '')
```

## 91. Separate a variable (column) in a DataFrame
(Since 0.3.0)

R's separate operation.

https://tidyr.tidyverse.org/reference/separate.html

```{ruby}
#| tags: []
df = DataFrame.new(xyz: [nil, 'x.y', 'x.z', 'y.z'])
```

Instead of `separate(:xyz, [:a, :b])` we will do:

```{ruby}
#| tags: []
df.assign(:A, :B) { xyz.split_to_columns('.') }
  .drop(:xyz)
```

If you need :B only, instead of `separate(:xyz, [nil, :B])` we will do:

```{ruby}
#| tags: []
df.assign(:A, :B) { xyz.split_to_columns('.') }
  .pick(:B)
```

When splitted length is not equal, split returns max size of Vector Array filled with nil.

```{ruby}
#| tags: []
df = DataFrame.new(xyz: ['x', 'x y', 'x y z', nil])
df.assign(:x, :y, :z) { xyz.split_to_columns }
```

Split limiting max 2 elemnts.

```{ruby}
#| tags: []
df.assign(:x, :yz) { xyz.split_to_columns(' ', 2) }
```

Another example:

```{ruby}
#| tags: []
df = DataFrame.new(id: 1..3, 'month-year': %w[8-2022 9-2022 10-2022])
  .assign(:month, :year) { v(:'month-year').split_to_columns('-') }
```

Split between the letters.

```{ruby}
#| tags: []
df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211])
  .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) }
```

## 92. Unite variables (columns) in a DataFrame
(Since 0.3.0)

R's unite operation.

```{ruby}
#| tags: []
df = DataFrame.new(id: 1..3, year: %w[2022 2022 2022], month: %w[09 10 11])
```

```{ruby}
#| tags: []
df.assign(:yearmonth) { year.merge(month, sep: '') }
  .pick(:id, :yearmonth)
```

```{ruby}
#| tags: []
# Or directly create:
DataFrame.new(id: 1..3, yearmonth: df.year.merge(df.month, sep: ''))
```

## 93. Separate variable and lengthen into several rows.
(Since 0.3.0)

R's separate_rows operation.

```{ruby}
#| tags: []
df = DataFrame.new(id: 1..3, yearmonth: %w[202209 202210 202211])
  .assign(:year, :month) { yearmonth.split_to_columns(/(?=..$)/) }
  .drop(:yearmonth)
  .to_long(:id)
```

Another example with different list size.

```{ruby}
#| tags: []
df = DataFrame.new(
  x: 1..3,
  y: ['a', 'd,e,f', 'g,h'],
  z: ['1', '2,3,4', '5,6'],
)
```

```{ruby}
#| tags: []
sizes = df.y.split(',').list_sizes
a = sizes.to_a.map.with_index(1) { |n, i| [i] * n }.flatten
```

```{ruby}
#| tags: []
DataFrame.new(
  x: a,
  y: df.y.split_to_rows(','),
  z: df.z.split_to_rows(',')
)
```

Another way to use `#split_to_columns`.

```{ruby}
#| tags: []
xy = df.pick(:x, :y)
  .assign(:y, :y1, :y2) { v(:y).split_to_columns(',') }
  .to_long(:x, value: :y)
  .remove_nil
```

```{ruby}
#| tags: []
xz = df.pick(:x, :z)
  .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') }
  .to_long(:x, value: :z)
  .remove_nil
```

```{ruby}
#| tags: []
xy.pick(:x, :y).merge(xz.pick(:z))
```

Get all combinations of :y and :z.

```{ruby}
#| tags: []
df.assign(:y, :y1, :y2) { v(:y).split_to_columns(',') }
  .to_long(:x, :z, value: :y)
  .drop(:NAME)
  .assign(:z, :z1, :z2) { v(:z).split_to_columns(',') }
  .to_long(:x, :y, value: :z)
  .drop(:NAME)
  .drop_nil
```

## 94. Vector#propagate

Spread the return value of an aggregate function as if it is a element-wise function.

It has an alias `#expand`.

(Since 0.4.0)

```{ruby}
#| tags: []
vec = Vector.new(1, 2, 3, 4)
vec.propagate(:mean)
```

Block is also available.

```{ruby}
#| tags: []
vec.propagate { |v| v.mean.round }
```

## 95. DataFrame#propagate

Returns a Vector such that all elements have value `scalar` and have same size as self.

(Since 0.5.0)

```{ruby}
#| tags: []
df
```

```{ruby}
#| tags: []
df.assign(:sum_x) { propagate(x.sum) }
```

With a block.

```{ruby}
#| tags: []
df.assign(:range) { propagate { x.max - x.min } }
```

## 96. Vector#sort / #sort_indices

`#sort` will arrange values in Vector.

Accepts :sort order option:
   - `:+`, `:ascending` or without argument will sort in increasing order.
   - `:-` or `:descending` will sort in decreasing order.

(Since 0.4.0)

```{ruby}
#| tags: []
vector = Vector.new(%w[B D A E C])
vector.sort
# same as vector.sort(:+)
# same as vector.sort(:ascending)
```

Sort in decreasing order;

```{ruby}
#| tags: []
vector.sort(:-)
# same as vector.sort(:descending)
```

## 97. Vector#rank

Returns 1-based numerical rank of self.

- Nil values are considered greater than any value.
- NaN values are considered greater than any value but smaller than nil values.
- Sort order can be controlled by the option `order`.
  * `:ascending` or `+` will compute rank in ascending order (default).
  * `:descending` or `-` will compute rank in descending order.
- Tiebreakers will configure how ties between equal values are handled.
  * `tie: :first` : Ranks are assigned in order of when ties appear in the input (default).
  * `tie: :min` : Ties get the smallest possible rank in the sorted order.
  * `tie: :max` : Ties get the largest possible rank in the sorted order.
  * `tie: :dense` : The ranks span a dense [1, M] interval where M is the number of distinct values in the input.
- Placement of nil and NaN is controlled by the option `null_placement`.
  * `null_placement: :at_end` : place nulls at end (default).
  * `null_placement: :at_start` : place nulls at the top of Vector.

(Since 0.4.0, revised in 0.5.1)

Rank of float Vector;

```{ruby}
#| tags: []
float = Vector[1, 0, nil, Float::NAN, Float::INFINITY, -Float::INFINITY, 3, 2]
```

```{ruby}
#| tags: []
# Same as float.rank(:ascending, tie: :first, null_placement: :at_end)
float.rank
```

With sort order;

```{ruby}
#| tags: []
float.rank(:descending) # or float.rank('-')
```

With null placement;

```{ruby}
#| tags: []
float.rank(null_placement: :at_start)
```

Rank of string Vector with tiebreakers;

```{ruby}
#| tags: []
string = Vector['A', 'A', nil, nil, 'C', 'B']
```

```{ruby}
#| tags: []
string.rank # same as string.rank(tie: :first)
```

```{ruby}
#| tags: []
string.rank(tie: :min)
```

```{ruby}
#| tags: []
string.rank(tie: :max)
```

```{ruby}
#| tags: []
string.rank(tie: :dense)
```

## 98. Vector#sample
Pick up elements at random.

(Since 0.4.0)

Return a randomly selected element. This is one of an aggregation function.

```{ruby}
#| tags: []
v = Vector.new('A'..'H')
```

Returns scalar without any arguments.

```{ruby}
#| tags: []
v.sample
```

`sample(n)` will pick up `n` elements at random. `n` is a positive number of elements to pick.

If n is smaller or equal to size, elements are picked by non-repeating.

If n == 1 (in case of `sample(1)`), it returns a Vector of size == 1 not a scalar.

```{ruby}
#| tags: []
v.sample(1)
```

Sample same size of self: every element is picked in random order.

```{ruby}
#| tags: []
v.sample(8)
```

If n is greater than `size`, some elements are picked repeatedly.

```{ruby}
#| tags: []
v.sample(9)
```

`sample(prop)` will pick up elements by proportion `prop` at random. `prop` must be positive float.
 - Absolute number of elements to pick:`prop*size` is truncated.
 - If prop is smaller or equal to 1.0, elements are picked by non-repeating.

```{ruby}
#| tags: []
v.sample(0.7)
```

If picked element is only one, it returns a Vector of size == 1 not a scalar.

```{ruby}
#| tags: []
v.sample(0.1)
```

Sample same size of self: every element is picked in random order.

```{ruby}
#| tags: []
v.sample(1.0)
```

If prop is greater than 1.0, some elements are picked repeatedly.

```{ruby}
#| tags: []
# 2 times over sampling
sampled = v.sample(2.0)
```

```{ruby}
#| tags: []
sampled.tally
```

## 99. DataFrame#sample/shuffle

(Since 0.5.0)

Select records randomly to create a DataFrame.

```{ruby}
#| tags: []
penguins.sample(0.1)
```

Returns a DataFrame with shuffled rows.

```{ruby}
#| tags: []
penguins.shuffle
```

## 100. Vector#concatenate

Concatenate other array-like to self.

(Since 0.4.0)

Concatenate to string;

```{ruby}
#| tags: []
string = Vector.new(%w[A B])
```

```{ruby}
#| tags: []
string.concatenate([1, 2])
```

Concatenate to integer;

```{ruby}
#| tags: []
integer = Vector.new(1, 2)
```

```{ruby}
#| tags: []
integer.concatenate(["A", "B"])
```

## 101. Vector#resolve

Return other as a Vector which is same data type as self.

(Since 0.4.0)

Integer to String;

```{ruby}
#| tags: []
Vector.new('A').resolve([1, 2])
```

String to Ineger;

```{ruby}
#| tags: []
Vector.new(1).resolve(['A'])
```

Upcast to uint16;

```{ruby}
#| tags: []
vector = Vector.new(256)
```

Not a uint8 Vector;

```{ruby}
#| tags: []
vector.resolve([1, 2])
```

## 102. Vector#cast

Cast self to `type`.

(since 0.4.2)

```{ruby}
#| tags: []
vector = Vector.new(1, 2, nil)
vector.cast(:int16)
```

```{ruby}
#| tags: []
vector.cast(:double)
```

```{ruby}
#| tags: []
vector.cast(:string)
```

## 103. Vector#one

Get a non-nil element in self. If all elements are nil, return nil.

(since 0.4.2)

```{ruby}
#| tags: []
vector = Vector.new([nil, 1, 3])
vector.one
```

## 104. SubFrames

`SubFrames` is a new concept of DataFrame collection. It represents ordered subsets of a DataFrame collected by some rules. It includes both grouping and windowing concepts in a unified manner, and also covers broader cases more flexibly.

(Since 0.4.0)

```{ruby}
#| tags: []
dataframe = DataFrame.new(
  x: [*1..6],
  y: %w[A A B B B C],
  z: [false, true, false, nil, true, false]
)
p dataframe; nil
```

```{ruby}
#| tags: []
sf = SubFrames.new(dataframe, [[0, 1], [2, 3, 4], [5]])
```

Source DataFrame (univarsal set).

```{ruby}
#| tags: []
sf.baseframe
```

Size of subsets.

```{ruby}
#| tags: []
sf.size
```

Sizes of each subsets.

```{ruby}
#| tags: []
sf.sizes
```

`#each` will return an Enumerator or iterates each subset as a DataFrame.

```{ruby}
#| tags: []
sf.each
```

```{ruby}
#| tags: []
sf.each.next
```

`SubFrames.new` also accepts a block.

```{ruby}
#| tags: []
usf = SubFrames.new(dataframe) { |df| [df.indices] }
```

`#universal?` tests if self is an univarsal set.

```{ruby}
#| tags: []
usf.universal?
```

`#empty?` tests if self is an empty set.

```{ruby}
#| tags: []
esf = SubFrames.new(dataframe, [])
```

```{ruby}
#| scrolled: true
#| tags: []
esf.empty?
```

`#take(n)` takes n sub dataframes and return them by SubFrames. If n >= size, it returns self.

```{ruby}
sf.take(2)
```

`#offset_indices` returns indices at the top of each sub DataFrames.

```{ruby}
sf.offset_indices
```

`#frames` returns an Array of sub DataFrames.

```{ruby}
sf.frames
```

`SubFrames.new` also accepts boolean filters even from the block.

```{ruby}
#| tags: []
small = dataframe.x < 4
large = !small
small_large = SubFrames.new(dataframe) { [small, large] }
```

## 105. SubFrames#concatenate

`SubFrames#concatenate` (or alias `#concat`) will concatenate SubFrames to create a DataFrame.

(Since 0.4.0)

```{ruby}
#| tags: []
sf.concatenate
```

## 106. SubFrames.by_group

Create SubFrames by Group object.

(Since 0.4.0)

```{ruby}
#| tags: []
p dataframe; nil
```

```{ruby}
#| tags: []
group = Group.new(dataframe, [:y])
sf = SubFrames.by_group(group)
```

## 107. SubFrames.by_indices/.by_filters

`SubFrames.by_indices(dataframe, subset_indices)` creates a new SubFrames object from a DataFrame and an array of indices.#

```{ruby}
SubFrames.by_indices(dataframe, [[0, 2, 4], [1, 3, 5]])
```

`SubFrames.by_filters(dataframe, subset_filters)` creates a new SubFrames object from a DataFrame and an array of filters.

```{ruby}
#| scrolled: true
SubFrames.by_filters(dataframe, [[true, false, true, false, nil, false], [true, true, false, false, nil, false]])
```

## 108. SubFrames.by_dataframes

`SubFrames.by_dataframes(dataframes)` creates a new SubFrames from an Array of DataFrames.

```{ruby}
dataframes = [
    DataFrame.new(x: [1, 2, 3], y: %w[A A B], z: [false, true, false]),
    DataFrame.new(x: [4, 5, 6], y: %w[B B C], z: [nil, true, false])
]
```

```{ruby}
SubFrames.by_dataframes(dataframes)
```

## 109. DataFrame#sub_by_value

`sub_by_value(*keys)` make subframes by value. It is corresponding to Group processing.

Create SubFrames from keys and group by values in columns specified by the key.

(Since 0.4.0)

```{ruby}
#| tags: []
dataframe.sub_by_value(:y)
```

## 110. DataFrame#sub_by_window

Create SubFrames by window in `size` rolling `from` by `step`.

Default values is `from: 0`, `size: nil` and `step: 1`.

(Since 0.4.0)

```{ruby}
#| tags: []
dataframe.sub_by_window(size: 4, step: 2)
```

## 111. DataFrame#sub_by_enum

Create SubFrames by Grouping/Windowing by posion. The position is specified by `Array`'s enumerator method such as `each_slice` or `each_cons`.

(Since 0.4.0)

Create a SubFrames object sliced by 3 rows. This is MECE (Mutually Exclusive and Collectively Exhaustive) SubFrames.

```{ruby}
#| tags: []
dataframe.sub_by_enum(:each_slice, 3)
```

Create a SubFrames object for each consecutive 3 rows.

```{ruby}
#| tags: []
dataframe.sub_by_enum(:each_cons, 4)
```

## 112. DataFrame#sub_by_kernel

Create SubFrames by windowing with a kernel and step.
Kernel is a boolean Array and it behaves like a masked window.

(Since 0.4.0)

```{ruby}
#| tags: []
kernel = [true, false, false, true]
dataframe.sub_by_kernel(kernel, step: 2)
```

## 113. DataFrame#build_subframes

Generic builder of sub-dataframe from self.

(Sice 0.4.0)

```{ruby}
#| tags: []
dataframe.build_subframes([[0, 2, 4], [1, 3, 5]])
```

`#build_subframes` also accepts a block.

```{ruby}
#| tags: []
dataframe.build_subframes do |df|
  even = df.indices.map(&:even?)
  [even, !even]
end
```

## 114. SubFrames#aggregate

Aggregate SubFrames to create a DataFrame. There are 4 APIs in this method.

(Since 0.4.0)

- `#aggregate(keys) { columns }`

Aggregate SubFrames creating DataFrame with label `keys` and  its column values by block.

```{ruby}
#| tags: []
sf = dataframe.sub_by_value(:y)
```

```{ruby}
sf.aggregate(:y, :sum_x) { [y.one, x.sum] }  # sf.aggregate([:y, :sum_x]) { [y.one, x.sum] } is also acceptable
```

- `#aggregate { key_and_aggregated_values }`

Aggregate SubFrames creating DataFrame with pairs of key and aggregated values  in Hash from the block.

```{ruby}
sf.aggregate do
  { y: y.one, sum_x: x.sum }
end
```

- `#aggregate { [keys, values] }`

Aggregate SubFrames creating DataFrame with an Array of key and aggregated value  from the block.

```{ruby}
#| tags: []
sf.aggregate do
  [[:y, y.one], [:sum_x, x.sum]]
end
```

- `#aggregate(group_keys, aggregations)`

Aggregate SubFrames for first values of the columns of  `group_keys` and the aggregated results of key-function pairs.
( [Experiment)l] This API may be changed in the future.

```{ruby}
#| tags: []
sf.aggregate(:y, { x: :sum, z: :count })
```

## 115. SubFrames#map/#assign

`#map` Returns a SubFrames containing DataFrames returned by the block. It has an alias `collect`.

```{ruby}
sf
```

This example assigns a new column.

```{ruby}
sf.map { |df| df.assign(x_plus1: df[:x] + 1) }
```

There is a shortcut of `map { assign }`. We can use `assign(key) { updated_column }`.

```{ruby}
sf.assign(:x_plus1) { x + 1 }
```

We can use `assign(keys) { updated_columns }` for multiple columns.

```{ruby}
sf.assign(:sum_x, :flac_x) do
  group_sum = x.sum
  [[group_sum] * x.size, x / group_sum.to_f]
end
```

Also `assign { keys_and_columns }` is possible.

```{ruby}
sf.assign do
  { 'x*z': x * z.if_else(1, 0) }
end
```

(Notice) `SubFrames#assign` has a same syntax as `DataFrame#assign`.

If you need an Array of DataFrames (not a SubFrames), use `each.map` instead.

```{ruby}
sf.each.map { |df| df.assign(x_plus1: df[:x] + 1) }
```

## 116. SubFrames#select/#reject

`#select` returns a SubFrames containing DataFrames selected by the block.#

```{ruby}
sf.select { |df| df[:z].any? }
```

`#select` has aliases `#filter` and `#find_all`.

`#reject` returns a SubFrames containing truthy DataFrames returned by the block.#

```{ruby}
sf.reject { |df| df[:z].any? }
```

## 117. SubFrames#filter_map

It returns a SubFrames containing truthy DataFrames returned by the block.

```{ruby}
sf.filter_map do |df|
  if df.size > 1
    df.assign(:y) do
      y.merge(indices('1'), sep: '')
    end
  end
end
```

## 118. Vector#modulo

(Since 0.4.1)

`#%` is an alias of `#modulo`.

```{ruby}
#| tags: []
vector = Vector.new(5, -3, 1)
vector % 3
```

`#%` and `#modulo` is equivalent to `self-divisor*(self/divisor).floor`.

```{ruby}
#| tags: []
vector.modulo(-2)
```

## 119. Vector#mode

Compute the 1 most common values and their respective occurence counts.

(since 0.5.0) ModeOptions are not supported in 0.5.0 . Only one mode value is returned.

```{ruby}
#| tags: []
Vector[true, true, false, nil].mode
```

```{ruby}
#| tags: []
Vector[0, 1, 1, 2, nil].mode
```

```{ruby}
#| tags: []
Vector[1, 0/0.0, -1/0.0, 1/0.0, nil].mode
```

## 120. Vector#end_with/start_with

Check if elements in self ends/starts with a literal pattern.

(since 0.5.0)

```{ruby}
#| tags: []
v = Vector['array', 'Arrow', 'carrot', nil, 'window']
```

Emits true if it contains `string`. Emit false if not found. Nil inputs emit nil.

```{ruby}
#| tags: []
v.end_with('ow')
```

```{ruby}
#| tags: []
v.start_with('arr')
```

## 121. Vector#match_substring

For each string in self, emit true if it contains a given pattern.

(since 0.5.0)

```{ruby}
#| tags: []
v = Vector['array', 'Arrow', 'carrot', nil, 'window']
```

Emits true if it contains `string`. Emit false if not found. Nil inputs emit nil.

```{ruby}
#| tags: []
v.match_substring('arr')
```

Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library.

```{ruby}
#| tags: []
v.match_substring(/arr/)
```

You can ignore case if you use regexp with `i` option, or `igfnore_case: true`

```{ruby}
#| tags: []
v.match_substring(/arr/i)  # same as v.find_substring(/arr/, ignore_case: true)
```

## 122. Vector#match_like

Match elements of self against SQL-style LIKE pattern. The pattern matches a given pattern at any position.

- '%' will match any number of characters,
- '_' will match exactly one character, and any other character matches itself.
- To match a literal '%', '_', or '\', precede the character with a backslash.

(since 0.5.0)

```{ruby}
#| tags: []
v = Vector['array', 'Arrow', 'carrot', nil, 'window']
```

You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil.

```{ruby}
#| tags: []
v.match_like('_arr%')
```

You can ignore case if you use the option `igfnore_case: true`.

```{ruby}
#| tags: []
v.match_like('arr%', ignore_case: true)
```

## 123. Vector#find_substring

Find first occurrence of substring in string Vector.

(since 0.5.1)

```{ruby}
#| tags: []
v = Vector['array', 'Arrow', 'carrot', nil, 'window']
```

You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil.

```{ruby}
#| tags: []
v.find_substring('arr')
```

Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library.

```{ruby}
#| tags: []
v.find_substring(/arr/)
```

You can ignore case if you use regexp with `i` option, or `igfnore_case: true`

```{ruby}
#| tags: []
v.find_substring(/arr/i)  # same as v.find_substring(/arr/, ignore_case: true)
```

## 124. Vector#count_substring

For each string in self, count occuerences of substring in given pattern.

(since 0.5.0)

```{ruby}
#| tags: []
v = Vector['amber', 'Amazon', 'banana', nil]
```

You can find indices of a literal string. Emit -1 if not found. Nil inputs emit nil.

```{ruby}
#| tags: []
v.count_substring('an')
```

Otherwise use it with Regexp pattern. It calls `count_substring_regex` in Arrow compute function and uses re2 library.

```{ruby}
#| tags: []
v.count_substring(/a[mn]/)
```

You can ignore case if you use regexp with `i` option, or `igfnore_case: true`

```{ruby}
#| tags: []
v.count_substring(/a[mn]/i)  # same as v.find_substring(/arr/, ignore_case: true)
```

## 125. Grouped DataFrame as a list

This API was introduced in 0.2.3, and supply a new DataFrame group (experimental).

This additional API will treat a grouped DataFrame as a list of DataFrames. I think this API has pros such as:

- API is easy to understand and flexible.
- It has good compatibility with Ruby's primitive Enumerables.
- We can only use non hash-ed aggregation functions.
- Do not need grouped DataFrame state, nor `#ungroup` method.
- May be useful for concurrent operations.

This feature is implemented by Ruby, so it is pretty slow and experimental. Use original Group API for practical purpose.

(Since 0.2.3, experimental feature => This was upgraded to SubFrames feature)

```{ruby}
enum = penguins.group(:island).each
```

```{ruby}
enum.to_a
```

```{ruby}
array = enum.map do |df|
  DataFrame.new(island: [df.island[0]]).assign do
    df.variables.each_with_object({}) do |(key, vec), hash|
      next unless vec.numeric?
      hash["mean(#{key})"] = [vec.mean]
    end
  end
end
```

```{ruby}
array.reduce { |a, df| a.concat df }
```

## 126. ArrowFunction helpers

`ArrowFunction` module adds two helper method.

`ArrowFunction.find(function_name)` returns Arrow Function object in Arrow C++ Compute Functions.

```{ruby}
ArrowFunction.find(:mean)
```

To execute this function,

```{ruby}
ArrowFunction.find(:mean).execute([[1, 2, 3, 4]]).value.value
```

`ArrowFunction.arrow_doc(function_name)` returns a document of Arrow C++ Compute Function in a string.

```{ruby}
puts ArrowFunction.arrow_doc(:mean)
```

## 127. DataFrame.auto_cast

A data set for planetary data in https://nssdc.gsfc.nasa.gov/planetary/factsheet/ is used here. Let's manually copy the data in the html table and get the tab separated text values.

```{ruby}
tsv = '      MERCURY      VENUS      EARTH      MOON      MARS      JUPITER      SATURN      URANUS      NEPTUNE      PLUTO
Mass (1024kg)    0.330    4.87    5.97    0.073    0.642    1898    568    86.8    102    0.0130
Diameter (km)    4879    12,104    12,756    3475    6792    142,984    120,536    51,118    49,528    2376
Density (kg/m3)    5429    5243    5514    3340    3934    1326    687    1270    1638    1850
Gravity (m/s2)    3.7    8.9    9.8    1.6    3.7    23.1    9.0    8.7    11.0    0.7
Escape Velocity (km/s)    4.3    10.4    11.2    2.4    5.0    59.5    35.5    21.3    23.5    1.3
Rotation Period (hours)    1407.6    -5832.5    23.9    655.7    24.6    9.9    10.7    -17.2    16.1    -153.3
Length of Day (hours)    4222.6    2802.0    24.0    708.7    24.7    9.9    10.7    17.2    16.1    153.3
Distance from Sun (106 km)    57.9    108.2    149.6    0.384*    228.0    778.5    1432.0    2867.0    4515.0    5906.4
Perihelion (106 km)    46.0    107.5    147.1    0.363*    206.7    740.6    1357.6    2732.7    4471.1    4436.8
Aphelion (106 km)    69.8    108.9    152.1    0.406*    249.3    816.4    1506.5    3001.4    4558.9    7375.9
Orbital Period (days)    88.0    224.7    365.2    27.3*    687.0    4331    10,747    30,589    59,800    90,560
Orbital Velocity (km/s)    47.4    35.0    29.8    1.0*    24.1    13.1    9.7    6.8    5.4    4.7
Orbital Inclination (degrees)    7.0    3.4    0.0    5.1    1.8    1.3    2.5    0.8    1.8    17.2
Orbital Eccentricity    0.206    0.007    0.017    0.055    0.094    0.049    0.052    0.047    0.010    0.244
Obliquity to Orbit (degrees)    0.034    177.4    23.4    6.7    25.2    3.1    26.7    97.8    28.3    122.5
Mean Temperature (C)    167    464    15    -20    -65    -110    -140    -195    -200    -225
Surface Pressure (bars)    0    92    1    0    0.01    Unknown*    Unknown*    Unknown*    Unknown*    0.00001
Number of Moons    0    0    1    0    2    79    82    27    14    5
Ring System?    No    No    No    No    No    Yes    Yes    Yes    Yes    No
Global Magnetic Field?    Yes    No    Yes    No    No    Yes    Yes    Yes    Yes    Unknown
'

raw_dataframe = DataFrame.load(Arrow::Buffer.new(tsv), format: :tsv)

ENV['RED_AMBER_OUTPUT_MODE'] = 'plain'
raw_dataframe
```

This dataframe has row oriented calues. So we must transpose the dataframe.

```{ruby}
transposed = raw_dataframe.transpose
```

This dataframe has string columns. We can cast each numeric columns, recommended way is to use `#auto_cast`. `#auto_cast` save it in temporally tsv file and re-open it to get a casted dataframe.

```{ruby}
transposed.auto_cast
```

There are still some dirts to be cleaned in this dataframe, we don't touch them here. If you are interested, give it a try!

- Rename a column 'NAME' to 'Planet_name'.
- Remove preceding/trailing spaces in 'Planet_name' values.
- Capitalize 'Planet_name' values.
- Remove data for 'Moon' and 'Pluto' to create the Table for planets.
- Convert 'Unknown*' to nil.
- Change 'Yes' / 'No' values to true / false (change column type to boolean).
- Remove comma in numeric values. They obstruct to be numeric columns.
- Correct cell values which have '*'. They obstruct to be numeric columns.
- Add missing '^' to unit in labels.