notebooks/doc-002-glossary.ipynb
{
"cells": [
{
"cell_type": "markdown",
"id": "9c71b839",
"metadata": {},
"source": [
"# Glossary\n",
"\n",
"### Namespace \n",
"\n",
"> The atomic unit of the knowledge system containing data and metadata\n",
"\n",
"- defines\n",
" - tables\n",
" - composite types\n",
" - entity classes\n",
" - code to build data based on these\n",
"- represented by\n",
" - a module in a data project (as datascript) - nested right below the main (src) module\n",
" - a set of YAML files in `{namespace name}/**.yaml` as serialized metadata in the released sdist\n",
" - automatically generated from the code\n",
" - an exported .py file with basic datascript in `{namespace name}/__init__.py` in the released sdist\n",
"- can import other namespaces, either to use\n",
" - data (even for foreign keys in tables)\n",
" - defined composite types / entity classes\n",
"\n",
"### Data Project\n",
"\n",
"> A versioned set of interconnected namespaces with metadata and different environments\n",
"\n",
"- defines\n",
" - namespaces\n",
" - different environments where (usually) the same code runs for different data\n",
"- represented by\n",
" - a git repository\n",
" - is a DVC repository\n",
" - based on a [template]\n",
" - has fixed form tags representing the releases and data versions\n",
"\n",
"\n",
"### Registry\n",
"\n",
"> A repository containing data about the releases and dependencies of projects to make importing namespaces straightforward\n",
"\n",
"- represented by\n",
" - a git repository (either local or remote)\n",
" - write access needed to the repo to release to it\n",
"- contains data about \n",
" - (named) projects\n",
" - URI\n",
" - versions\n",
" - environment->dvc remote mapping\n",
"- contains sdist forms of metadata of projects release there\n",
" - to set up a special PyPI index so that installation and dependency resolution is outsourced\n",
"\n",
"\n",
"### Metadata\n",
"\n",
"> Information about the data contained in projects\n",
"\n",
"- defines\n",
" - for each namespace\n",
" - tables\n",
" - composite types\n",
" - entity classes\n",
"- represented\n",
" - in a project repository\n",
" - defined in code (datascript object)\n",
" - scrutable\n",
" - entitybase\n",
" - compositetypebase\n",
" - serialized (generated from code)\n",
" - YAML files\n",
" - in runtime\n",
" - converted as soon as possible to dataclasses in bedrock module\n",
" - some even in data output in parquet\n",
"\n",
"\n",
"### Config\n",
"\n",
"\n",
"- defines\n",
" - name\n",
" - version (this is the metadata version, the data version is determined at release)\n",
" - default-environment name (the first environment in envs config by default)\n",
" - validation-environments (the default-environment by default)\n",
" - registry address (the SSCUB registry by default) TODO: link\n",
" - imported_projects\n",
" - either a list of project names to be imported, where other than name, all default values are used\n",
" - or a dictionary, where the key is the project name (in the registry), and values are:\n",
" - version (metadata version)\n",
" - data_namespaces - the namespaces where loading the data is required\n",
" - in `envs` for each environment (one empty env named complete by default)\n",
" - params for all local namespaces and global params (namespace params default to these if not defined)\n",
" - logged to DVC from here\n",
" - environments of imported projects (where data is needed)\n",
" - specific DVC remote\n",
" - where to push data generated as outputs of running the code from namespaces (TODO - find a proper name, e.g. namespace processor) - identified by the name of a remote defined in DVC config\n",
" - parent env (default-environment by default)\n",
" - all missing keys of parameters or imported ns \n",
"- represented as `zimmer.yaml` in project root\n",
"\n",
"\n",
"### Environment\n",
"\n",
"> A complete run of the code in an project with its values for parameters and environments for imported data\n",
"\n",
"- defined by config\n",
"- ...\n",
"\n",
"\n",
"### Tabular data related phrases\n",
"\n",
"- feature: *A named set of columns in a table*\n",
" - can be a primitive feature, foreign key or composite feature\n",
"- the subject of records: entity class that is represented in a table\n",
"\n",
"\n",
"[template]: TODO"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}