rubocop-hq/rubocop-ast

View on GitHub
docs/modules/ROOT/pages/node_pattern.adoc

Summary

Maintainability
Test Coverage
= Node Pattern

Node pattern is a DSL to help find specific nodes in the Abstract Syntax Tree
using a simple string.

It reminds the simplicity of regular expressions but used to find specific
nodes of Ruby code.

== History

The Node Pattern was introduced by https://github.com/alexdowad[Alex Dowad]
and solves a problem that RuboCop contributors were facing for a long time:

* Ability to declaratively define rules for node search, matching, and capture.

The code below belongs to https://www.rubydoc.info/gems/rubocop/RuboCop/Cop/Style/ArrayJoin[Style/ArrayJoin]
cop and it's in favor of `Array#join` over `Array#*`. Then it tries to find
code like `%w(one two three) * ", "` and suggest to use `#join` instead.

It can also be an array of integers, and the code doesn't check it. However,
it checks if the argument sent is a string.

[source,ruby]
----
def on_send(node)
  receiver_node, method_name, *arg_nodes = *node
  return unless receiver_node && receiver_node.array_type? &&
    method_name == :* && arg_nodes.first.str_type?

  add_offense(node, location: :selector)
end
----

This code was replaced in the cop defining a new matcher that does the same as the code above:

[source,ruby]
----
def_node_matcher :join_candidate?, '(send $array :* $str)'
----

And the `on_send` method is simplified to a method usage:

[source,ruby]
----
def on_send(node)
  join_candidate?(node) { add_offense(node, location: :selector) }
end
----

== Ruby Abstract Syntax Tree (AST)

Parser translates Ruby source code to a tree structure represented in text.
A simple integer literal like `1` is represented by `(int 1)` in the AST.
A method call with two integer literals:

[source,ruby]
----
foo(1, 2)
----

is represented with:

[source]
----
(send nil :foo
  (int 1)
  (int 2)
)
----

Every node is represented with a sequence.
The first element is the node type.
Other elements are the children. They are optionally present and depend on the node type.
E.g.:

* `nil` is just `(nil)`
* `1` is `(int 1)`
* `[1]` is `(array (int 1))`
* `[1, 2]` is `(array (int 1) (int 2))`
* `foo` is `(send nil :foo)`
* `foo(1)` is `(send nil :foo (int 1))`

=== Getting the AST representation

==== From the command-line with `ruby-parse`

[source,sh]
----
$ ruby-parse --legacy -e 'foo(1)'
(send nil :foo
  (int 1))
----

NOTE: Use the `--legacy` `ruby-parse` flag to get https://github.com/whitequark/parser/#usage[the same AST that RuboCop AST returns].
There are several differences, e.g. without `--legacy`, `foo(a: 1)` would return `kwargs`, and with `--legacy` it returns `hash`.

==== From REPL

[source,ruby]
----
> puts RuboCop::AST::ProcessedSource.new('foo(1)', RUBY_VERSION.to_f).ast.to_s
(send nil :foo
  (int 1))
----

== Basic Node Pattern Structure

The simplest Node Pattern would match just the node type.
E.g. the `int` node pattern would match the `(int 1)` AST (literal `1` in Ruby code).
More sophisticated node patterns match more than one child.

== `(` and `)` to Match Elements

Several matchers surrounded by parentheses would match a node with elements each matching a corresponding matcher, order-dependently.
Ruby code with an array with two integer literals, `[1, 2]` represented in AST as `(array (int 1) (int 2))` could be matched with `(array int int)` node pattern.

For a literal integer, e.g. `1` Ruby code represented by `(int 1)` in AST:

* `int` node pattern will match exactly the node, looking only the node type
* `(int 1)` node pattern will match precisely the node
* `(int 2)` node pattern will not match

== `(` and `)` for Nested Matching

Ruby code with a method call with two integer literals as arguments, `foo(1, 2)` represented in AST as `(send nil :foo (int 1) (int 2))` could be matched with `(send nil? :foo int int)` node pattern.
To match just those method calls where the first argument is a literal `1`, use `(send nil? :foo (int 1) int)`.
Any child that is a node can be a target for nested matching.

== `_` for any single node

`_` will check if there's something present in the specific position, no matter the
value:

* `(int _)` will match any number
* `(int _ _)` will not match because `int` types have just one child that
contains the value.

== `+...+` for several subsequent nodes

Where `_` matches any single node, `+...+` matches any number of nodes.

Say for example you want to find instances of calls to the method `sum` with any
number of arguments, be it `sum(1, 2)` or `sum(1, 2, 3, n)`.
First, let's check how it looks like in the AST:

[source,sh]
----
$ ruby-parse -e 'sum(1, 2)'
(send nil :sum
  (int 1)
  (int 2))
----

Or with more children:

[source,sh]
----
$ ruby-parse -e 'sum(1, 2, 3, n)'
(send nil :sum
  (int 1)
  (int 2)
  (int 3)
  (send nil :n))
----

The following expression would only match a call with 2 arguments:

----
(send nil? :sum _ _)
----

Instead, the following expression will any number of arguments (and thus both examples above):

----
(send nil? :sum ...)
----

Note that `+...+` can be appear anywhere in a sequence, for example `+(send nil? :sum ... int)+`
would no longer match the second example, as the last argument is not an integer.

Nesting `+...+` is also supported; the only limitation is that `+...+` and
other "variable length" patterns can only appear once within a sequence.
For example `+(send ... :sum ...)+` is not supported.

== `*`, `+`, `?` for repetitions

Another way to handle a variable number of nodes is by using `*`, `+`, `?` to signify
a particular pattern should match any number of times, at least once and at most once respectively.

Following on the previous example, to find sums of integer literals, we could use:

----
(send nil? :sum int*)
----

This would match our first example `sum(1, 2)` but not the other `sum(1, 2, 3, n)`

This pattern would also match a call to `sum` without any argument, which might not be desirable.

Using `+` would insure that only sums with at least one argument would be matched.

----
(send nil? :sum int+)
----

The `?` can limit the match only 0 or 1 nodes.
The following example would match any sum of three integer literals
optionally followed by a method call:

----
(send nil? :sum int int int send ?)
----

Note that we have to put a space between `send` and `?`,
since `send?` would be considered as a predicate (described below).

== `<>` for match in any order

You may not care about the exact order of the nodes you want to match.
In this case you can put the nodes without brackets:

----
(send nil? :sum <(int 2) int>)
----

This will match our first example (`sum(1, 2)`).

It won't match our second example though, as it specifies that there must be
exactly two arguments to the method call `sum`.

You can add `+...+` before the closing bracket to allow for additional parameters:

----
(send nil? :sum <(int 2) int ...>)
----

This will match both our examples, but not `sum(1.0, 2)` or `sum(2)`,
since the first node in the brackets is found, but not the second (`int`).

== `{}` for "OR" (union)

Lets make it a bit more complex and introduce floats:

[source,sh]
----
$ ruby-parse -e '1'
(int 1)
$ ruby-parse -e '1.0'
(float 1.0)
----

* `({int | float} _)` - int or float types, no matter the value

Branches of the union can contain more than one term:

* `(array {int int | range})` - matches an array with two integers or a single range element

If all the branches have a single term, you can omit the `|`, so `{int | float}` can be
simplified to `{int float}`.

When checking for symbols or string, you can use regexp literals for a similar effect:

[source,sh]
----
(send _ /to_s|inspect/) # => matches calls to `to_s` or `inspect`
----

== `[]` for "AND"

Imagine you want to check if the number is `odd?` and also positive numbers:

`(int [odd? positive?])` - is an int and the value should be odd and positive.

NOTE: Refer to <<Predicate methods>> to see how `odd?` works.

== `!` for Negation

Node pattern `(send nil? :sum !int _)` would match a `sum` call where the first argument is *not* a literal integer.
E.g.:

* it will match `sum(2.0, 3)`, as the first argument is of a `float` type
* it will not match `sum(2, 3)`, as the first argument is of an `int` type

NOTE: Negation operator works with other node pattern syntax elements, `{}`, `[]`, `()`, `$`, but only with those that target a single element. E.g. `$!(int 1)`, `!{false nil}`, `![#positive? #even?]` will work, while `!{int int | sym}`, `!{int int | sym sym}`, and any use of `<>` won't.

== `$` for captures

You can capture elements or nodes along with your search, prefixing the expression
with `$`. For example, in a tuple like `(int 1)`, you can capture the value using `(int $_)`.

You can also capture multiple things like:

----
(${int float} $_)
----

The tuple can be entirely captured using the `$` before the open parens:

----
$({int float} _)
----

Or remove the parens and match directly from node head:

----
${int float}
----

All variable length patterns (`+...+`, `*`, `+`, `?`, `<>`) are captured as arrays.

The following pattern will have two captures, both arrays:

----
(send nil? $int+ (send $...))
----

== `^` for parent

One may use the `^` character to check against a parent.

For example, the following pattern would find any node with two children and
with a parent that is a hash:

----
(^hash _key $_value)
----

It is possible to use `^` somewhere else than the head of a sequence; in that
case it is relative to that child (i.e. the current node). One case also use
multiple `^` to go up multiple levels.
For example, the previous example is basically the same as:

----
(pair ^^hash $_value)
----

== ``` for descendants

The ``` character can be used to search a node and all its descendants.
For example if looking for a `return` statement anywhere within a method definition,
we can write:

----
(def _method_name _args `return)
----

This would match both of these methods `foo` and `bar`, even though
these `return` for `foo` and `bar` are not at the same level.

----
def foo              # (def :foo
  return 42          #   (args)
end                  #   (return
                     #     (int 42)))

def bar              # (def :bar
  return 42 if foo   #   (args)
  nil                #   (begin
end                  #     (if
                     #       (send nil :foo)
                     #       (return
                     #         (int 42)) nil)
                     #     (nil)))
----

== Predicate methods

Words which end with a `?` are predicate methods, are called on the target
to see if it matches any Ruby method which the matched object supports can be
used.

Example:

* `int_type?` can be used herein replacement of `(int _)`.

And refactoring the expression to allow both int or float types:

* `{int_type? float_type?}` can be used herein replacement of `({int float} _)`

You can also use it at the node level, asking for each child:

* `(int odd?)` will match only with odd numbers, asking it to the current
number.

== `#` to call functions

Sometimes, we want to add extra logic. Let's imagine we're searching for
prime numbers, so we have a method to detect it:

[source,ruby]
----
def prime?(n)
  if n <= 1
    false
  elsif n == 2
    true
  else
    (2..n/2).none? { |i| n % i == 0 }
  end
end
----

We can use the `#prime?` function directly in the expression:

----
(int #prime?)
----

You may call a method on a constant too. Let's say you define:

[source,ruby]
----
module Util
  def self.palindrome?(str)
    str == str.reverse
  end
end
----

You can refer to it like this:
----
(str #Util.palindrome?)
----

== Arguments for predicate and function calls

Arguments can be passed to predicates and function calls, like literals, parameters:

[source,ruby]
----
def divisible_by?(value, divisor)
  value % divisor == 0
end
----

Example patterns using this function:
----
(int #divisible_by?(42))
(send (int _value) :+ (int #divisible_by?(_value))
----

The arguments can be pattern themselves, in which case a matcher responding to `===` will be passed. This makes patterns composable:

```ruby
def_node_matcher :global_const?, '(const {nil? cbase} %1)'
def_node_matcher :class_creator, '(send #global_const?({:Class :Module}) :new ...)'
```

== Using node matcher macros

The RuboCop base includes two useful methods to use the node pattern with Ruby in a
simple way. You can use the macros to define methods. The basics are
https://www.rubydoc.info/gems/rubocop-ast/RuboCop/AST/NodePattern/Macros#def_node_matcher-instance_method[def_node_matcher]
and https://www.rubydoc.info/gems/rubocop-ast/RuboCop/AST/NodePattern/Macros#def_node_search-instance_method[def_node_search].

When you define a pattern, it creates a method that accepts a node and tries to match.

Lets create an example where we're trying to find the symbols `user` and
`current_user` in expressions like: `user: current_user` or
`current_user: User.first`, so the objective here is pick all keys:

[source,sh]
----
$ ruby-parse -e ':current_user'
(sym :current_user)
$ ruby-parse -e ':user'
(sym :user)
$ ruby-parse -e '{ user: current_user }'
(hash
  (pair
    (sym :user)
    (send nil :current_user)))
----

Our minimal matcher can get it in the simple node `sym`:

[source,ruby]
----
def_node_matcher :user_symbol?, '(sym {:current_user :user})'
----

=== Composing complex expressions with multiple matchers

Now let's go deeply combining the previous expression and also match if the
current symbol is being called from an initialization method, like:

[source,sh]
----
$ ruby-parse --legacy -e 'Comment.new(user: current_user)'
(send
  (const nil :Comment) :new
  (hash
    (pair
      (sym :user)
      (send nil :current_user))))
----

And we can also reuse this and check if it's a constructor:

[source,ruby]
----
def_node_matcher :initializing_with_user?, <<~PATTERN
  (send _ :new (hash (pair #user_symbol?)))
PATTERN
----

== `%` for arguments

Arguments can be passed to matchers, either as external method arguments,
or to be used to compare elements. An example of method argument:

[source,ruby]
----
def multiple_of?(n, factor)
  n % factor == 0
end

def_node_matcher :int_node_multiple?, '(int #multiple_of?(%1))'

# ...

int_node_multiple?(node, 10) # => true if node is an 'int' node with a multiple of 10
----

Arguments can be used to match nodes directly:

[source,ruby]
----
def_node_matcher :has_sensitive_data?, '(hash <(pair (_ %1) $_) ...>)'

# ...

has_sensitive_data?(node, :password) # => true if node is a hash with a key +:password+

# matching uses ===, so to match strings or symbols, 'pass' or 'password' one can:
has_sensitive_data?(node, /^pass(word)?$/i)

# one can also pass lambdas...
has_sensitive_data?(node, ->(key) { # return true or false depending on key })
----

NOTE: `Array#===` will never match a single node element (so don't pass arrays),
but `Set#===` is an alias to `Set#include?` (Ruby 2.5+ only), and so can be
very useful to match within many possible literals / Nodes.

== `%param_name` for named parameters

Arguments can be passed as named parameters. They will be matched using `===`
(see `%` above).

Contrary to positional arguments, defaults values can be passed to
`def_node_matcher` and `def_node_search`:

[source,ruby]
----
def_node_matcher :interesting_call?, '(send _ %method ...)',
                 method: Set[:transform_values, :transform_keys,
                             :transform_values!, :transform_keys!,
                             :to_h].freeze

# Usage:

interesting_call?(node) # use the default methods
interesting_call?(node, method: /^transform/) # match anything starting with 'transform'
----

Named parameters as arguments to custom methods are also supported.

== `CONST` or `%CONST` for constants

Constants can be included in patterns. They will be matched using `===`, so
+Regexp+ / +Set+ / +Proc+ can be used in addition to literals and +Nodes+:

[source,ruby]
----
SOME_CALLS = Set[:transform_values, :transform_keys,
                 :transform_values!, :transform_keys!,
                 :to_h].freeze

def_node_matcher :interesting_call?, '(send _ SOME_CALLS ...)'

----

Constants as arguments to custom methods are also supported.

== Comments

You may have comments in node patterns at the end of lines
by preceding them with `'# '`:

[source,ruby]
----
def_node_matcher :complex_stuff, <<~PATTERN
  (send
    {#global_const?(:Kernel) nil?}  # check for explicit call like Kernel.p too
    {:p :pp}                        # let's consider `pp` also
    $...                            # capture all arguments
  )
PATTERN
----

== `nil` or `nil?`

Take a special attention to nil behavior:

[source,sh]
----
$ ruby-parse -e 'nil'
(nil)
----

In this case, the `nil` implicit matches with expressions like: `nil`, `(nil)`, or `nil_type?`.

But, nil is also used to represent a call from `nothing` from a simple method call:

[source,sh]
----
$ ruby-parse -e 'method'
(send nil :method)
----

Then, for such case you can use the predicate `nil?`. And the code can be
matched with an expression like:

----
(send nil? :method)
----

== More resources

Curious about how it works?

Check more details in the
https://www.rubydoc.info/gems/rubocop-ast/RuboCop/AST/NodePattern[documentation]
or browse the https://github.com/rubocop/rubocop-ast/blob/master/lib/rubocop/ast/node_pattern.rb[source code]
directly. It's easy to read and hack on.

The https://github.com/rubocop/rubocop-ast/blob/master/spec/rubocop/ast/node_pattern_spec.rb[specs]
are also very useful to comprehend each feature.