bio-miga/miga

View on GitHub
manual/part5/metadata.md

Summary

Maintainability
Test Coverage
# Metadata

## All objects

All metadata objects support the following fields:

| Field        | Supported values                  | Description
| ------------:|:---------------------------------:|:-------------------
| created*     | [Date](../glossary.md#miga-dates) | Date of creation
| updated*     | [Date](../glossary.md#miga-dates) | Date of last update

> **\*** Mandatory

## Projects

The following metadata fields are recognized by different interfaces for
**Projects**:

### Project Features

Metadata with additional information and features about the project:

| Field        | Supported values | Description
| ------------:|:----------------:|:------------------------------------
| comments     | String           | Free-form comments
| description  | String           | Free-form description
| name*        | [Name](../glossary.md#miga-names) | Name‡

> **\*** Mandatory


### Project System Metadata

Metadata entries automatically set by MiGA:

| Field        | Supported values | Description
| ------------:|:----------------:|:------------------------------------
| datasets*    | Array of String  | List of datasets in the project
| type*        | String           | [Type](../part2/types.md#project-types)

> **\*** Mandatory
> 
> **‡** By default the base name of the project path


### Project Flags

Metadata entries that trigger specific behaviors in MiGA:

| Field        | Supported values | Description
| ------------:|:----------------:|:------------------------------------
| ref_project  | Path             | Project with reference taxonomy {1}
| db_proj_dir  | Path             | Directory containing database projects {1} {2}
| tax_pvalue   | Float [0,1]      | Max p-value to transfer taxonomy (def: 0.1)
| haai_p       | String           | hAAI engine {3} (def: fastaai)
| aai_p        | String           | AAI engine {3} (def: diamond)
| ani_p        | String           | ANI engine {3} (def: fastani)
| max_try      | Integer          | Max number of task attempts (def: 10)
| aai_save_rbm | Boolean          | Should RBMs be saved for OGS analysis?
| ogs_identity | Float [0,100]    | Min RBM identity for OGS (def: 80)
| clean_ogs    | Boolean          | If false, keeps ABC (clades only)
| run_clades   | Boolean          | Should clades be estimated from distances?
| gsp_ani      | Float [0,100]    | ANI limit to propose gsp clades (def: 95)
| gsp_aai      | Float [0,100]    | AAI limit to propose gsp clades (def: 90)
| gsp_metric   | String           | Metric to propose clades: `ani` (def), `aai`
| ess_coll     | String           | Collection of essential genes to use {4}
| min_qual     | Float (or 'no')  | Min. genome quality (or no filter; def: 25)
| distances_checkpoint | Integer  | Comparisons before storing data (def: 10)

> **{1}** This path can be either absolute or relative to the project's path.
>
> **{2}** This is the location of the databases used by
> [db_project](#dataset-flags). If not set, it is assumed to be the parent
> folder of the current project.
>
> **{3}** Supported values: `blast`, `blat`, `diamond`
> (only for hAAI and AAI), `fastani` (only for ANI), `no` (only for hAAI),
> and `fastaai` (only for hAAI).
>
> **{4}** One of: `dupont_2012` (default), or `lee_2019`


### Project Hooks

Additionally, hooks can be defined for projects as arrays of arrays containing
the action name and the arguments (if any). For example, one can define:

```
on_processing_ready: [
  ['run_cmd', 'date > {{project}}/ALL_DONE.txt'],
  ['run_cmd', 'sendmail ...']
]
```

or

```
on_add_dataset: [
  ['run_cmd', 'echo {{object}} > {{project}}/LATEST_DATASET.txt']
]
```

Supported events:
- `on_create()`: When created
- `on_load()`: When loaded
- `on_save()`: When saved
- `on_add_dataset(object)`: When a dataset is added, with name `object`
- `on_unlink_dataset(object)`: When dataset with name `object` is unlinked
- `on_result_ready(object)`: When any result is ready, with key `object`
- `on_result_ready_{result}()`: When `result` is ready
- `on_processing_ready()`: When processing is complete

Supported hooks:
- `run_lambda(lambda, args...)`
- `run_cmd(cmd)`


## Datasets

The following metadata fields are recognized by different interfaces for
**Datasets**:

### Dataset Features

Metadata with additional information and features about the dataset:

| Field        | Supported values | Description
| ------------:|:----------------:|:----------------------------------
| tax          | MiGA::Taxonomy   | Taxonomy of the dataset
| quality      | String           | Description of genome quality
| trna_count   | Integer          | Number of tRNA elements detected
| trna_aa      | Integer          | Number of distinct AA with tRNA elements
| dprotologue  | String           | Taxonumber in the Digital Protologue DB
| ncbi_tax_id  | String           | Linking ID(s) {1} for NCBI Taxonomy
| ncbi_nuccore | String           | Linking ID(s) {1} for NCBI Nucleotide
| ncbi_asm     | String           | Linking ID(s) {1} for NCBI Assembly
| ebi_embl     | String           | Linking ID(s) {1} for EBI EMBL
| ebi_ena      | String           | Linking ID(s) {1} for EBI ENA
| web_assembly | String           | URL to download assembly
| web_assembly_gz | String        | URL to download gzipped assembly
| see_also     | String           | Link(s) {1} in the format text:url
| is_type      | Boolean          | If it is type material
| is_ref_type  | Boolean          | If it is reference material {2}
| type_rel     | String           | Relationship to type material
| suspect      | Array(String)    | Flags indicating a suspect dataset

> **{1}** Multiple values can be provided separated by commas or colons
> 
> **{2}** This is not a valid type, but it represents the closest
> available dataset to material that is unavailable and unlikely to ever become
> available. See also [Federhen, 2015, NAR](https://doi.org/10.1093/nar/gku1127)


### Dataset System Metadata

Metadata entries automatically set by MiGA:

| Field        | Supported values | Description
| ------------:|:----------------:|:----------------------------------
| type*        | String           | [Type](../part2/types.md#dataset-types)
| ref          | Boolean          | [Reference](../part2/types.md#reference)
| inactive     | Boolean          | If auto-processing should stop
| metadata_only | Boolean         | Dataset with metadata but without input data
| status       | String           | Proc. status: complete, incomplete, inactive
| _step        | String           | For internal control of processing
| \_try_`step` | Integer          | For internal control of processing
| ~~user~~     | String           | Deprecated

> **\*** Mandatory


### Dataset Flags

Metadata entries that trigger specific behaviors in MiGA:

| Field        | Supported values | Description
| ------------:|:----------------:|:----------------------------------
| run_`step`   | Boolean          | Forces running or not `step`
| db_project   | Path             | Project to use as database
| dist_req     | Array of String  | Run distances against these datasets*

> **\*** When searching best-matching datasets, include these datasets even if
> they are not visited using the medoid tree


### Dataset Hooks

Additionally, hooks can be defined for datasets as arrays of arrays containing
the action name and the arguments. See above ([project hooks](#project-hooks))
for examples.

Supported events:
- `on_load()`: When loaded
- `on_save()`: When saved
- `on_remove()`: When removed
- `on_inactivate()`: When inactivated
- `on_activate()`: When activated
- `on_result_ready(object)`: When any result is ready, with key `object`
- `on_result_ready_{result}()`: When `result` is ready
- `on_preprocessing_ready()`: When preprocessing is complete

Supported hooks:
- `run_lambda(lambda, args...)`
- `clear_run_counts()`
- `run_cmd(cmd)`