getindata/dbt-airflow-factory

View on GitHub
docs/configuration.rst

Summary

Maintainability
Test Coverage
Configuration
-----

Description
+++++++++++++++++++

airflow.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - default_args
     - dictionary
     - x
     -
     - Values that are passed to DAG as default_arguments (check Airflow documentation for more details)
   * - dag
     - dictionary
     - x
     -
     - Values used to DAG creation. Currently ``dag_id``, ``description``, ``schedule_interval`` and ``catchup`` are supported. Check Airflow documentation for more details about each of them.
   * - seed_task
     - boolean
     -
     - False
     - Enables first task of the DAG that is responsible for executing *dbt seed* command to load some data.
   * - manifest_file_name
     - string
     -
     - manifest.json
     - Name of the file with DBT manifest.
   * - use_task_group
     - boolean
     -
     - False
     - Enable grouping ``dbt run`` and ``dbt test`` into Airflow's Task Groups. It is only available in Airflow 2+ (check Airflow documentation for more details).
   * - show_ephemeral_models
     - boolean
     -
     - True
     - Enabled/disables separate tasks for DBT's ephemeral models. The tasks are finished in second as they have nothing to do.
   * - failure_handlers
     - list
     -
     - empty list
     - Each item of the list contains configuration of notifications handler in case Tasks or DAG fails. Each item is a dictionary with following fields
       ``type`` (type of handler eg. *slack* or *teams*), ``webserver_url`` (Airflow Webserver URL), ``connection_id`` (id of the Airflow's connection) and ``message_template`` that will be sent.
       More on how to configure the webhooks can be found here: `Slack <https://airflow.apache.org/docs/apache-airflow-providers-slack/stable/_api/airflow/providers/slack/operators/slack_webhook/index.html>`_ or `MS Teams <https://code.mendhak.com/Airflow-MS-Teams-Operator/#copy-hook-and-operator>`_
   * - enable_project_dependencies
     - boolean
     -
     - False
     - When True it creates sensors for all sources that have dag name in metadata. The sources wait for selected DAGs to finish.
   * - save_points
     - list of string
     -
     - empty list
     - List of schemas between which the gateway should be created.

dbt.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - target
     - string
     - x
     -
     - Values that are passed to DAG as default_arguments (check Airflow documentation for more details)
   * - project_dir_path
     - string
     -
     - /dbt
     - Values used to DAG creation. Currently ``dag_id``, ``description``, ``schedule_interval`` and ``catchup`` are supported. Check Airflow documentation for more details about each of them.
   * - profile_dir_path
     - string
     -
     - /root/.dbt
     - Enables first task of the DAG that is responsible for executing ``dbt seed`` command to load some data.
   * - vars
     - dictionary
     -
     -
     - Dictionary of variables passed to DBT tasks.

execution_env.yml file
~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 25 20 2 53
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Description
   * - image.repository
     - string
     - x
     - Docker image repository URL
   * - image.tag
     - string
     - x
     - Docker image tag
   * - type
     - string
     - x
     - Selects type of execution environment. Currently only k8s is available.

k8s.yml file
~~~~~~~~~~~~~~~~~~~~~~~

.. list-table::
   :widths: 25 20 2 53
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Description
   * - image_pull_policy
     - string
     - x
     - See kubernetes documentation for details: https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy
   * - namespace
     - string
     - x
     - Name of the namespace to run processing
   * - envs
     - dictionary
     -
     - Environment variables that will be passed to container
   * - secrets
     - list of dictionaries
     -
     - List that contains secrets mounted to each container. It is required to set ``secret`` as name, ``deploy_type`` (env or volume) and ``deploy_target`` which is path for volume type and name for envs.
   * - labels
     - dictionary
     -
     - Dictionary that contains labels set on created pods
   * - annotations
     - dictionary
     -
     - Annotations applied to created pods
   * - is_delete_operator_pod
     - boolean
     -
     - If set to True finished containers will be deleted
   * - config_file
     - string
     -
     - Path to the k8s configuration available in Airflow
   * - resources.node_selectors
     - dictionary
     -
     - See more details in Kubernetes documentation: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
   * - resources.tolerations
     - list of dictionaries
     -
     - See more details in Kubernetes documentation: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
   * - resources.limit
     - dictionary
     -
     - See more details in Kubernetes documentation: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
   * - resources.requests
     - dictionary
     -
     - See more details in Kubernetes documentation: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
   * - execution_script
     - str
     -
     - Script that will be executed inside pod
   * - in_cluster
     - bool
     -
     - Run kubernetes client with in_cluster configuration
   * - cluster_context
     - str
     -
     - Context that points to kubernetes cluster, ignored when ``in_cluster`` is ``True``. If ``None``, current-context is used.
   * - startup_timeout_seconds
     - int
     -
     - Timeout in seconds to startup the pod.

airbyte.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - airbyte_connection_id
     - string
     - x
     -
     - Connection id for Airbyte in Airflow instance. Remember to add this to Airflow's dependencies
       ``apache-airflow-providers-airbyte`` to be able to add such connection.
   * - tasks
     - list of objects
     - x
     -
     - Each task consist of fields

       **task_id**: string - name of the task which will be shown on airflow

       **connection_id**: string - id of Airbyte connection

       **asyncrounous**: boolean - Flag to get job_id after submitting the job to the Airbyte API.

       **api_version**: string - Airbyte API version

       **wait_seconds**: integer - Number of seconds between checks. Only used when ``asynchronous`` is False

       **timeout**: float - The amount of time, in seconds, to wait for the request to complete


ingestion.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - enable
     - boolean
     - x
     -
     - Boolean value to specify if ingestion task should be added to Airflow's DAG.
   * - engine
     - string
     - x
     -
     - Enumeration based option, currently only supported value is ``airbyte``


airbyte.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - airbyte_connection_id
     - string
     - x
     -
     - Connection id for airbyte in airflow instance. Remember to add this to airflow dependencies
       ``apache-airflow-providers-airbyte`` to be able to add such connection.
   * - tasks
     - list of objects
     - x
     -
     - Each task consist of fields

       **task_id**: string - name of the task which will be shown in Airflow

       **connection_id**: string - id of Airbyte connection

       **asyncrounous**: boolean - Flag to get job_id after submitting the job to the Airbyte API.

       **api_version**: string - Airbyte API version

       **wait_seconds**: integer - Number of seconds between checks. Only used when ``asynchronous`` is ``False``

       **timeout**: float - The amount of time, in seconds, to wait for the request to complete


ingestion.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - enable
     - boolean
     - x
     -
     - Boolean value to specify if ingestion task should be added to airflow dag.
   * - engine
     - string
     - x
     -
     - Enumeration based option, currently only supported value is ``airbyte``


airbyte.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - airbyte_connection_id
     - string
     - x
     -
     - Connection id for airbyte in airflow instance. Remember to add this to Airflow's dependencies
       ``apache-airflow-providers-airbyte`` to be able to add such connection.
   * - tasks
     - list of objects
     - x
     -
     - Each task consist of fields

       **task_id**: string - name of the task which will be shown on airflow

       **connection_id**: string - id of airbyte connection

       **asyncrounous**: boolean - Flag to get job_id after submitting the job to the Airbyte API.

       **api_version**: string - Airbyte API version

       **wait_seconds**: integer - Number of seconds between checks. Only used when ``asynchronous`` is False

       **timeout**: float - The amount of time, in seconds, to wait for the request to complete


ingestion.yml file
~~~~~~~~~~~~~~~~~~~~~~~
.. list-table::
   :widths: 25 20 2 13 40
   :header-rows: 1

   * - Parameter
     - Data type
     - Required
     - Default
     - Description
   * - enable
     - boolean
     - x
     -
     - Boolean value to specify if ingestion task should be added to Airflow's DAG.
   * - engine
     - string
     - x
     -
     - Enumeration based option, currently only supported value is ``airbyte``

Example files
+++++++++++++++++++

It is best to look up the example configuration files in
`tests directory <https://github.com/getindata/dbt-airflow-factory/tree/develop/tests/config>`_ to get a glimpse
of correct configs.