relationships Package
Classes
Multiplicity |
PowerBI relationship cardinality descriptor stored in FabricDataFrame. |
Functions
find_relationships
Suggest possible relationships based on coverage threshold.
By default include_many_to_many is False, which is the most common case. Generated relationship are m:1 (i.e. the "to" attribute is the primary key) and will also include 1:1 relationships.
If include_many_to_many is set to True (less common case), we will search for additional many to many relationships. The results will be a superset of default m:1 case.
Empty dataframes are not considered for relationships.
find_relationships(tables: Dict[str, DataFrame] | List[DataFrame], coverage_threshold: float = 1.0, name_similarity_threshold: float = 0.8, exclude: List[Tuple[str]] | DataFrame | None = None, include_many_to_many: bool = False, verbose: int = 0) -> DataFrame
Parameters
Name | Description |
---|---|
tables
Required
|
A dictionary that maps table names to the dataframes with table content. If a list of dataframes is provided, the function will try to infer the names from the session variables and if it cannot, it will use the positional index to describe them in the results. |
coverage_threshold
|
A minimum threshold to report a potential relationship. Coverage is a ratio of unique values in the "from" column that are found (covered by) the value in the "to" (key) column. Default value: 1.0
|
name_similarity_threshold
|
Minimum similarity of column names before analyzing for relationship. The value of 0 means that any 2 columns will be considered. The value of 1 means that only column that match exactly will be considered. Default value: 0.8
|
exclude
|
A dataframe with relationships to exclude. Its columns should contain the columns "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships. Default value: None
|
include_many_to_many
|
Whether to also search for m:m relationships. Default value: True
|
verbose
|
Verbosity. 0 means no verbosity. Default value: 0
|
Returns
Type | Description |
---|---|
A dataframe with candidate relationships identified by: from_table, from_column, to_table, to_column. Also provides auxiliary statistics to help with evaluation. If no suitable candidates are found, returns an empty DataFrame. |
list_relationship_violations
Validate to see if the content of tables matches relationships.
The function examines results of joins for provided relationships and searches for inconsistencies with the specified relationship multiplicity.
Relationships from empty tables (dataframes) are assumed as valid.
list_relationship_violations(tables: Dict[str, DataFrame] | List[DataFrame], relationships: DataFrame, missing_key_errors='raise', coverage_threshold: float = 1.0, n_keys: int = 10) -> DataFrame
Parameters
Name | Description |
---|---|
tables
Required
|
A dictionary that maps table names to the dataframes with table content. If a list of dataframes is provided, the function will try to infer the names from the session variables and if it cannot, it will use the positional index to describe them in the results. |
relationships
Required
|
A dataframe with relationships to use for validation. Its columns should contain the columns "Multiplicity", "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships. |
missing_key_errors
|
One of 'raise', 'warn', 'ignore'. Action to take when either table or column of the relationship is not found in the elements of the argument tables. Default value: 'raise'
|
coverage_threshold
|
Fraction of rows in the "from" part that need to join in inner join. Default value: 1.0
|
n_keys
|
Number of missing keys to report. Random collection can be reported. Default value: 10
|
Returns
Type | Description |
---|---|
Dataframe with relationships, error type and error message. If there are no violations, returns an empty DataFrame. |
plot_relationship_metadata
Plot a graph of relationships based on metadata contained in the provided dataframe.
The input "metadata" dataframe should contain one row per relationship. Each row names the "from" and "to" table/columns that participate in the relationship, and their multiplicity as defined by Multiplicity.
plot_relationship_metadata(metadata_df: DataFrame, tables: Dict[str, DataFrame] | List[DataFrame] | None = None, include_columns: str = 'keys', missing_key_errors='raise', *, graph_attributes: Dict | None = None) -> Digraph
Parameters
Name | Description |
---|---|
metadata_df
|
A "metadata" dataframe with relationships to plot. It should contain the columns "multiplicity", "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships. Default value: None
|
tables
|
It needs to provided only when include_columns = 'all' and it will be used for mapping table names from relationships to the dataframe columns. Default value: None
|
include_columns
|
One of 'keys', 'all', 'none'. Indicates which columns should be included in the graph. Default value: 'keys'
|
missing_key_errors
|
One of 'raise', 'warn', 'ignore'. Action to take when either table or column of the relationship is not found in the elements of the argument tables. Default value: 'raise'
|
graph_attributes
|
Attributes passed to graphviz. Note that all values need to be strings. Useful attributes are:
Default value: None
|
Returns
Type | Description |
---|---|
Graph object containing all relationships. If include_attributes is true, attributes are represented as ports in the graph. |