whylabs/whylogs-python

View on GitHub
python/docs/features/data_validation.md

Summary

Maintainability
Test Coverage

Line length
Open

Conditions also enable you to trigger actions when the condition is met while data is being logged. To do that, you need to specify a set of actions in your Condition object, along with the Relation that will trigger those actions. This is the same Relation that will be used to track the Condition Count Metrics in the profile.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Multiple consecutive blank lines
Open


MD012 - Multiple consecutive blank lines

Tags: whitespace, blank_lines

Aliases: no-multiple-blanks

This rule is triggered when there are multiple consecutive blank lines in the document:

Some text here


Some more text here

To fix this, delete the offending lines:

Some text here

Some more text here

Note: this rule will not be triggered if there are multiple consecutive blank lines inside code blocks.

Line length
Open

- Creating existing, but non-standard metrics for specific data types, such as [String Tracking](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/advanced/String_Tracking.ipynb) and [Image](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Image_Logging.ipynb) metrics

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

A whylogs __profile__ contains a set of metrics that summarize the original data. __Metric Constraints__ validate __Metrics__ present on the __Profile__ to generate a Report. This report will tell us whether the data meets the data quality constraints we defined.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

You can validate your profile by calling `constraints.validate()`, which will return __True__ if all of the constraints passed, and __False__ otherwise.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

Of the examples above, the __Condition Count Metric__ has some particularities, so let's use it to explain some details.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

The constraints report can also be displayed in a more human-friendly way by using the __NotebookProfileVisualizer__ integration:

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

Constraints are a powerful feature built on top of whylogs profiles that enable you to quickly and easily validate that your data looks the way that it should. There are numerous types of constraints that you can set on your data (that numerical data will always fall within a certain range, that text data will always be in a JSON format, etc) and, if your dataset fails to satisfy a constraint, you can fail your unit tests or your CI/CD pipeline.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

In whylogs, you validate data by creating __Metric Constraints__ and validating those constraints against a whylogs profile.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

- By directly using out-of-the-box helper constraints, such as _greater\_than\_number_, _no\_missing\_values_, or _is\_non\_negative_. You can see the full list of existing helper constraints, and how to use them, in the [Constraints Suite Notebook](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/basic/Constraints_Suite.ipynb)

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

- By assembling your own set of tailored constraints: You can see how to do this in the [Metric Constraints Notebook](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Metric_Constraints.ipynb).

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Multiple consecutive blank lines
Open


MD012 - Multiple consecutive blank lines

Tags: whitespace, blank_lines

Aliases: no-multiple-blanks

This rule is triggered when there are multiple consecutive blank lines in the document:

Some text here


Some more text here

To fix this, delete the offending lines:

Some text here

Some more text here

Note: this rule will not be triggered if there are multiple consecutive blank lines inside code blocks.

Line length
Open

Those metrics can be standard metrics, such as Distribution, Frequent Items, or Types metrics. They can also be __Condition Count Metrics__. Condition Count Metrics count the number of occurrences a certain relation passed/failed. For example, the number of times the rows for the column `Name` was equal to `Bob`.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

For a profile to contain Condition Count Metrics, you have first to specify a Condition __before__ logging your data and pass that to your Dataset Schema. The Dataset Schema configures the behavior for tracking metrics in whylogs.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

- Creating Condition Count Metrics - See the [Condition Count Metrics](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Condition_Count_Metrics.ipynb) and [Metric Constraints with Condition Count Metrics](https://nbviewer.org/github/whylabs/whylogs/blob/mainline/python/examples/advanced/Metric_Constraints_with_Condition_Count_Metrics.ipynb) examples

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

Once you have a set of constraints, you can use it to validate or generate a report of a Profile.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

As its name suggest, a metrics constraint uses metrics in a profile to perform the validation. With no additional customization, the constraint will have at its disposal standard metrics captured in the logging process, such as Distribution, Frequent Items and Cardinality metrics:

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Multiple consecutive blank lines
Open


MD012 - Multiple consecutive blank lines

Tags: whitespace, blank_lines

Aliases: no-multiple-blanks

This rule is triggered when there are multiple consecutive blank lines in the document:

Some text here


Some more text here

To fix this, delete the offending lines:

Some text here

Some more text here

Note: this rule will not be triggered if there are multiple consecutive blank lines inside code blocks.

Multiple consecutive blank lines
Open


MD012 - Multiple consecutive blank lines

Tags: whitespace, blank_lines

Aliases: no-multiple-blanks

This rule is triggered when there are multiple consecutive blank lines in the document:

Some text here


Some more text here

To fix this, delete the offending lines:

Some text here

Some more text here

Note: this rule will not be triggered if there are multiple consecutive blank lines inside code blocks.

Multiple consecutive blank lines
Open


MD012 - Multiple consecutive blank lines

Tags: whitespace, blank_lines

Aliases: no-multiple-blanks

This rule is triggered when there are multiple consecutive blank lines in the document:

Some text here


Some more text here

To fix this, delete the offending lines:

Some text here

Some more text here

Note: this rule will not be triggered if there are multiple consecutive blank lines inside code blocks.

Line length
Open

whylogs profiles contain summarized information about our data. This means that it's a __lossy__ process, and once we get the profiles, we don't have access anymore to the complete set of data.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

The answer is that you need to define a __Condition Count Metric__ to be tracked __before__ logging your data. This metric will count the number of times the values of a given column meet a user-defined condition. When the profile is generated, you'll have that information to check against the constraints you'll create.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

To enable additional use cases for Metric Constraints, you need to define additional metrics, which will then be used by the constraints. This can be done, for example, by:

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Multiple consecutive blank lines
Open


MD012 - Multiple consecutive blank lines

Tags: whitespace, blank_lines

Aliases: no-multiple-blanks

This rule is triggered when there are multiple consecutive blank lines in the document:

Some text here


Some more text here

To fix this, delete the offending lines:

Some text here

Some more text here

Note: this rule will not be triggered if there are multiple consecutive blank lines inside code blocks.

Multiple consecutive blank lines
Open


MD012 - Multiple consecutive blank lines

Tags: whitespace, blank_lines

Aliases: no-multiple-blanks

This rule is triggered when there are multiple consecutive blank lines in the document:

Some text here


Some more text here

To fix this, delete the offending lines:

Some text here

Some more text here

Note: this rule will not be triggered if there are multiple consecutive blank lines inside code blocks.

Line length
Open

Once the additional metric is added to the profile, you can use Metric Constraints the very same way as you would for standard metrics.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Headers should be surrounded by blank lines
Open

### Constraints Report

MD022 - Headers should be surrounded by blank lines

Tags: headers, blank_lines

Aliases: blanks-around-headers

This rule is triggered when headers (any style) are either not preceded or not followed by a blank line:

# Header 1
Some text

Some more text
## Header 2

To fix this, ensure that all headers have a blank line both before and after (except where the header is at the beginning or end of the document):

# Header 1

Some text

Some more text

## Header 2

Rationale: Aside from aesthetic reasons, some parsers, including kramdown, will not parse headers that don't have a blank line before, and will parse them as regular text.

Line length
Open

This makes some types of constraints impossible to be created from standard metrics itself. For example, suppose you need to check every row of a column to check that there is no textual information that matches a credit card number or email information. Or maybe you're interested in ensuring that there are no even numbers in a certain column. How do we do that if we don't have access to the complete data?

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Headers should be surrounded by blank lines
Open

### Validate

MD022 - Headers should be surrounded by blank lines

Tags: headers, blank_lines

Aliases: blanks-around-headers

This rule is triggered when headers (any style) are either not preceded or not followed by a blank line:

# Header 1
Some text

Some more text
## Header 2

To fix this, ensure that all headers have a blank line both before and after (except where the header is at the beginning or end of the document):

# Header 1

Some text

Some more text

## Header 2

Rationale: Aside from aesthetic reasons, some parsers, including kramdown, will not parse headers that don't have a blank line before, and will parse them as regular text.

Line length
Open

Data validation is the process of ensuring that data is accurate, complete, and consistent. It involves checking data for errors or inconsistencies, and ensuring that it meets the specified requirements or constraints. Data validation is important because it helps to ensure the integrity and quality of data, and helps to prevent errors or inaccuracies in data from propagating and causing problems downstream.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

The report supports in-notebook visualization and it can also be downloaded in html format.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

Additionally, with condition count metrics you can define actions that will be triggered whenever the specified conditions are met during the logging process. This allows you to take action in critical situations without having to wait for the logging to be completed. These actions are customizable, so you can use them in any way you need: send alerts/notifications, raise errors, halt processes in your pipeline, etc.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

Line length
Open

For example, the `legs greater than number 2` would use the metric component `min` of the `distribution` metric of the `legs` column to perform the validation.

MD013 - Line length

Tags: line_length

Aliases: line-length Parameters: linelength, codeblocks, tables (number; default 80, boolean; default true)

This rule is triggered when there are lines that are longer than the configured line length (default: 80 characters). To fix this, split the line up into multiple lines.

This rule has an exception where there is no whitespace beyond the configured line length. This allows you to still include items such as long URLs without being forced to break them in the middle.

You also have the option to exclude this rule for code blocks and tables. To do this, set the code_blocks and/or tables parameters to false.

Code blocks are included in this rule by default since it is often a requirement for document readability, and tentatively compatible with code rules. Still, some languages do not lend themselves to short lines.

There are no issues that match your filters.

Category
Status