docs/references/dataset.rst from takuseno/d3rlpy

docs/references/dataset.rst
Summary

Maintainability

Test Coverage

Issues
.. _mdp_dataset:

Replay Buffer
=============

.. module:: d3rlpy.dataset

You can also check advanced use cases at `examples <https://github.com/takuseno/d3rlpy/tree/master/examples>`_ directory.

MDPDataset
~~~~~~~~~~

d3rlpy provides useful dataset structure for data-driven deep reinforcement
learning.
In supervised learning, the training script iterates input data :math:`X` and
label data :math:`Y`.
However, in reinforcement learning, mini-batches consist with sets of
:math:`(s_t, a_t, r_t, s_{t+1})` and episode terminal flags.
Converting a set of observations, actions, rewards and terminal flags into this
tuples is boring and requires some codings.

Therefore, d3rlpy provides ``MDPDataset`` class which enables you to handle
reinforcement learning datasets without any efforts.

.. code-block:: python

    import d3rlpy

    # 1000 steps of observations with shape of (100,)
    observations = np.random.random((1000, 100))
    # 1000 steps of actions with shape of (4,)
    actions = np.random.random((1000, 4))
    # 1000 steps of rewards
    rewards = np.random.random(1000)
    # 1000 steps of terminal flags
    terminals = np.random.randint(2, size=1000)

    dataset = d3rlpy.dataset.MDPDataset(observations, actions, rewards, terminals)

    # save as HDF5
    with open("dataset.h5", "w+b") as f:
        dataset.dump(f)

    # load from HDF5
    with open("dataset.h5", "rb") as f:
        new_dataset = d3rlpy.dataset.ReplayBuffer.load(f, d3rlpy.dataset.InfiniteBuffer())


Note that the ``observations``, ``actions``, ``rewards`` and ``terminals``
must be aligned with the same timestep.

.. code-block:: python

  observations = [s1, s2, s3, ...]
  actions      = [a1, a2, a3, ...]
  rewards      = [r1, r2, r3, ...]  # r1 = r(s1, a1)
  terminals    = [t1, t2, t3, ...]  # t1 = t(s1, a1)

``MDPDataset`` is actually a shortcut of ``ReplayBuffer`` class.


.. autosummary::
   :toctree: generated/
   :nosignatures:

   d3rlpy.dataset.MDPDataset


Replay Buffer
~~~~~~~~~~~~~

``ReplayBuffer`` is a class that represents an experience replay buffer in d3rlpy.
In d3rlpy, ``ReplayBuffer`` is a highly moduralized interface for flexibility.
You can compose sub-components of ``ReplayBuffer``, ``Buffer``, ``TransitionPicker``, `TrajectorySlicer` and ``WriterPreprocess`` to customize experiments.

.. code-block:: python

   import d3rlpy

   # Buffer component
   buffer = d3rlpy.dataset.FIFOBuffer(limit=100000)

   # TransitionPicker component
   transition_picker = d3rlpy.dataset.BasicTransitionPicker()

   # TrajectorySlicer component
   trajectory_slicer = d3rlpy.dataset.BasicTrajectorySlicer()

   # WriterPreprocess component
   writer_preprocessor = d3rlpy.dataset.BasicWriterPreprocess()

   # Need to specify signatures of observations, actions and rewards

   # Option 1: Initialize with Gym environment
   import gym
   env = gym.make("Pendulum-v1")
   replay_buffer = d3rlpy.dataset.ReplayBuffer(
      buffer=buffer,
      transition_picker=transition_picker,
      trajectory_slicer=trajectory_slicer,
      writer_preprocessor=writer_preprocessor,
      env=env,
   )

   # Option 2: Initialize with pre-collected dataset
   dataset, _ = d3rlpy.datasets.get_pendulum()
   replay_buffer = d3rlpy.dataset.ReplayBuffer(
      buffer=buffer,
      transition_picker=transition_picker,
      trajectory_slicer=trajectory_slicer,
      writer_preprocessor=writer_preprocessor,
      episodes=dataset.episodes,
   )

   # Option 3: Initialize with manually specified signatures
   observation_signature = d3rlpy.dataset.Signature(shape=[(3,)], dtype=[np.float32])
   action_signature = d3rlpy.dataset.Signature(shape=[(1,)], dtype=[np.float32])
   reward_signature = d3rlpy.dataset.Signature(shape=[(1,)], dtype=[np.float32])
   replay_buffer = d3rlpy.dataset.ReplayBuffer(
      buffer=buffer,
      transition_picker=transition_picker,
      trajectory_slicer=trajectory_slicer,
      writer_preprocessor=writer_preprocessor,
      observation_signature=observation_signature,
      action_signature=action_signature,
      reward_signature=reward_signature,
   )

   # shortcut
   replay_buffer = d3rlpy.dataset.create_fifo_replay_buffer(limit=100000, env=env)


.. autosummary::
   :toctree: generated/
   :nosignatures:

   d3rlpy.dataset.ReplayBufferBase
   d3rlpy.dataset.ReplayBuffer
   d3rlpy.dataset.MixedReplayBuffer
   d3rlpy.dataset.create_infinite_replay_buffer
   d3rlpy.dataset.create_fifo_replay_buffer


Buffer
~~~~~~

``Buffer`` is a list-like component that stores and drops transitions.

.. autosummary::
   :toctree: generated/
   :nosignatures:

   d3rlpy.dataset.BufferProtocol
   d3rlpy.dataset.InfiniteBuffer
   d3rlpy.dataset.FIFOBuffer



TransitionPicker
~~~~~~~~~~~~~~~~

``TransitionPicker`` is a component that defines how to pick transition data used for Q-learning-based algorithms.
You can also implement your own ``TransitionPicker`` for custom experiments.

.. code-block:: python

   import d3rlpy

   # Example TransitionPicker that simply picks transition
   class CustomTransitionPicker(d3rlpy.dataset.TransitionPickerProtocol):
       def __call__(self, episode: d3rlpy.dataset.EpisodeBase, index: int) -> d3rlpy.dataset.Transition:
          observation = episode.observations[index]
          is_terminal = episode.terminated and index == episode.size() - 1
          if is_terminal:
              next_observation = d3rlpy.dataset.create_zero_observation(observation)
          else:
              next_observation = episode.observations[index + 1]
          return d3rlpy.dataset.Transition(
              observation=observation,
              action=episode.actions[index],
              reward=episode.rewards[index],
              next_observation=next_observation,
              terminal=float(is_terminal),
              interval=1,
          )


.. autosummary::
   :toctree: generated/
   :nosignatures:

   d3rlpy.dataset.TransitionPickerProtocol
   d3rlpy.dataset.BasicTransitionPicker
   d3rlpy.dataset.FrameStackTransitionPicker
   d3rlpy.dataset.MultiStepTransitionPicker
   d3rlpy.dataset.SparseRewardTransitionPicker


TrajectorySlicer
~~~~~~~~~~~~~~~~

``TrajectorySlicer`` is a component that defines how to slice trajectory data used for Decision Transformer-based algorithms.
You can also implement your own ``TrajectorySlicer`` for custom experiments.

.. code-block:: python

   import d3rlpy

   class CustomTrajectorySlicer(d3rlpy.dataset.TrajectorySlicerProtocol):
       def __call__(
           self, episode: d3rlpy.dataset.EpisodeBase, end_index: int, size: int
       ) -> d3rlpy.dataset.PartialTrajectory:
           end = end_index + 1
           start = max(end - size, 0)
           actual_size = end - start

           # prepare terminal flags
           terminals = np.zeros((actual_size, 1), dtype=np.float32)
           if episode.terminated and end_index == episode.size() - 1:
               terminals[-1][0] = 1.0

           # slice data
           observations = episode.observations[start:end]
           actions = episode.actions[start:end]
           rewards = episode.rewards[start:end]
           ret = np.sum(episode.rewards[start:])
           all_returns_to_go = ret - np.cumsum(episode.rewards[start:], axis=0)
           returns_to_go = all_returns_to_go[:actual_size].reshape((-1, 1))

           # prepare metadata
           timesteps = np.arange(start, end)
           masks = np.ones(end - start, dtype=np.float32)

           # compute backward padding size
           pad_size = size - actual_size

           if pad_size == 0:
               return d3rlpy.dataset.PartialTrajectory(
                   observations=observations,
                   actions=actions,
                   rewards=rewards,
                   returns_to_go=returns_to_go,
                   terminals=terminals,
                   timesteps=timesteps,
                   masks=masks,
                   length=size,
               )

           return d3rlpy.dataset.PartialTrajectory(
               observations=d3rlpy.dataset.batch_pad_observations(observations, pad_size),
               actions=d3rlpy.dataset.batch_pad_array(actions, pad_size),
               rewards=d3rlpy.dataset.batch_pad_array(rewards, pad_size),
               returns_to_go=d3rlpy.dataset.batch_pad_array(returns_to_go, pad_size),
               terminals=d3rlpy.dataset.batch_pad_array(terminals, pad_size),
               timesteps=d3rlpy.dataset.batch_pad_array(timesteps, pad_size),
               masks=d3rlpy.dataset.batch_pad_array(masks, pad_size),
               length=size,
           )


.. autosummary::
   :toctree: generated/
   :nosignatures:

   d3rlpy.dataset.TrajectorySlicerProtocol
   d3rlpy.dataset.BasicTrajectorySlicer
   d3rlpy.dataset.FrameStackTrajectorySlicer



WriterPreprocess
~~~~~~~~~~~~~~~~

``WriterPreprocess`` is a component that defines how to write experiences to an experience replay buffer.
You can also implement your own ``WriterPreprocess`` for custom experiments.


.. code-block:: python

   import d3rlpy

   class CustomWriterPreprocess(d3rlpy.dataset.WriterPreprocessProtocol):
       def process_observation(self, observation: d3rlpy.dataset.Observation) -> d3rlpy.dataset.Observation:
           return observation

       def process_action(self, action: np.ndarray) -> np.ndarray:
           return action

       def process_reward(self, reward: np.ndarray) -> np.ndarray:
           return reward



.. autosummary::
   :toctree: generated/
   :nosignatures:

   d3rlpy.dataset.WriterPreprocessProtocol
   d3rlpy.dataset.BasicWriterPreprocess
   d3rlpy.dataset.LastFrameWriterPreprocess