sscu-budapest/datazimmer

View on GitHub
notebooks/doc-002-glossary.ipynb

Summary

Maintainability
Test Coverage
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9c71b839",
   "metadata": {},
   "source": [
    "# Glossary\n",
    "\n",
    "### Namespace \n",
    "\n",
    "> The atomic unit of the knowledge system containing data and metadata\n",
    "\n",
    "- defines\n",
    "  - tables\n",
    "  - composite types\n",
    "  - entity classes\n",
    "  - code to build data based on these\n",
    "- represented by\n",
    "  - a module in a data project (as datascript) - nested right below the main (src) module\n",
    "  - a set of YAML files in `{namespace name}/**.yaml` as serialized metadata in the released sdist\n",
    "    - automatically generated from the code\n",
    "  - an exported .py file with basic datascript in `{namespace name}/__init__.py` in the released sdist\n",
    "- can import other namespaces, either to use\n",
    "  - data (even for foreign keys in tables)\n",
    "  - defined composite types / entity classes\n",
    "\n",
    "### Data Project\n",
    "\n",
    "> A versioned set of interconnected namespaces with metadata and different environments\n",
    "\n",
    "- defines\n",
    "  - namespaces\n",
    "  - different environments where (usually) the same code runs for different data\n",
    "- represented by\n",
    "  - a git repository\n",
    "    - is a DVC repository\n",
    "    - based on a [template]\n",
    "    - has fixed form tags representing the releases and data versions\n",
    "\n",
    "\n",
    "### Registry\n",
    "\n",
    "> A repository containing data about the releases and dependencies of projects to make importing namespaces straightforward\n",
    "\n",
    "- represented by\n",
    "  - a git repository (either local or remote)\n",
    "  - write access needed to the repo to release to it\n",
    "- contains data about \n",
    "  - (named) projects\n",
    "    - URI\n",
    "    - versions\n",
    "    - environment->dvc remote mapping\n",
    "- contains sdist forms of metadata of projects release there\n",
    "  - to set up a special PyPI index so that installation and dependency resolution is outsourced\n",
    "\n",
    "\n",
    "### Metadata\n",
    "\n",
    "> Information about the data contained in projects\n",
    "\n",
    "- defines\n",
    "  - for each namespace\n",
    "    - tables\n",
    "    - composite types\n",
    "    - entity classes\n",
    "- represented\n",
    "  - in a project repository\n",
    "    - defined in code (datascript object)\n",
    "      - scrutable\n",
    "      - entitybase\n",
    "      - compositetypebase\n",
    "    - serialized  (generated from code)\n",
    "      - YAML files\n",
    "  - in runtime\n",
    "    - converted as soon as possible to dataclasses in bedrock module\n",
    "  - some even in data output in parquet\n",
    "\n",
    "\n",
    "### Config\n",
    "\n",
    "\n",
    "- defines\n",
    "  - name\n",
    "  - version (this is the metadata version, the data version is determined at release)\n",
    "  - default-environment name (the first environment in envs config by default)\n",
    "  - validation-environments (the default-environment by default)\n",
    "  - registry address (the SSCUB registry by default)  TODO: link\n",
    "  - imported_projects\n",
    "    - either a list of project names to be imported, where other than name, all default values are used\n",
    "    - or a dictionary, where the key is the project name (in the registry), and values are:\n",
    "      - version (metadata version)\n",
    "      - data_namespaces - the namespaces where loading the data is required\n",
    "  - in `envs` for each environment  (one empty env named complete by default)\n",
    "    - params for all local namespaces and global params (namespace params default to these if not defined)\n",
    "      - logged to DVC from here\n",
    "    - environments of imported projects (where data is needed)\n",
    "    - specific DVC remote\n",
    "      - where to push data generated as outputs of running the code from namespaces (TODO - find a proper name, e.g. namespace processor) - identified by the name of a remote defined in DVC config\n",
    "    - parent env (default-environment by default)\n",
    "      - all missing keys of parameters or imported ns \n",
    "- represented as `zimmer.yaml` in project root\n",
    "\n",
    "\n",
    "### Environment\n",
    "\n",
    "> A complete run of the code in an project with its values for parameters and environments for imported data\n",
    "\n",
    "- defined by config\n",
    "- ...\n",
    "\n",
    "\n",
    "### Tabular data related phrases\n",
    "\n",
    "- feature: *A named set of columns in a table*\n",
    "    - can be a primitive feature, foreign key or composite feature\n",
    "- the subject of records: entity class that is represented in a table\n",
    "\n",
    "\n",
    "[template]: TODO"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}