sempy.relationships package

find_relationships

Suggest possible relationships based on coverage threshold.

By default include_many_to_many is False, which is the most common case. Generated relationship are m:1 (i.e. the "to" attribute is the primary key) and will also include 1:1 relationships.

If include_many_to_many is set to True (less common case), we will search for additional many to many relationships. The results will be a superset of default m:1 case.

Empty dataframes are not considered for relationships.

find_relationships(tables: Dict[str, DataFrame] | List[DataFrame], coverage_threshold: float = 1.0, name_similarity_threshold: float = 0.8, exclude: List[Tuple[str]] | DataFrame | None = None, include_many_to_many: bool = False, verbose: int = 0) -> DataFrame

Parameters

Name	Description
tables Required	dict[str, DataFrame] or list[DataFrame] A dictionary that maps table names to the dataframes with table content. If a list of dataframes is provided, the function will try to infer the names from the session variables and if it cannot, it will use the positional index to describe them in the results.
coverage_threshold	float A minimum threshold to report a potential relationship. Coverage is a ratio of unique values in the "from" column that are found (covered by) the value in the "to" (key) column. Default value: 1.0
name_similarity_threshold	float Minimum similarity of column names before analyzing for relationship. The value of 0 means that any 2 columns will be considered. The value of 1 means that only column that match exactly will be considered. Default value: 0.8
exclude	DataFrame A dataframe with relationships to exclude. Its columns should contain the columns "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships. Default value: None
include_many_to_many	bool Whether to also search for m:m relationships. Default value: False
verbose	int Verbosity. 0 means no verbosity. Default value: 0

Returns

Type	Description
DataFrame	A dataframe with candidate relationships identified by: from_table, from_column, to_table, to_column. Also provides auxiliary statistics to help with evaluation. If no suitable candidates are found, returns an empty DataFrame.

list_relationship_violations

Validate to see if the content of tables matches relationships.

The function examines results of joins for provided relationships and searches for inconsistencies with the specified relationship multiplicity.

Relationships from empty tables (dataframes) are assumed as valid.

list_relationship_violations(tables: Dict[str, DataFrame] | List[DataFrame], relationships: DataFrame, missing_key_errors='raise', coverage_threshold: float = 1.0, n_keys: int = 10) -> DataFrame

Parameters

Name	Description
tables Required	dict[str, DataFrame] or list[DataFrame] A dictionary that maps table names to the dataframes with table content. If a list of dataframes is provided, the function will try to infer the names from the session variables and if it cannot, it will use the positional index to describe them in the results.
relationships Required	DataFrame A dataframe with relationships to use for validation. Its columns should contain the columns "Multiplicity", "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships.
missing_key_errors	str One of 'raise', 'warn', 'ignore'. Action to take when either table or column of the relationship is not found in the elements of the argument tables. Default value: raise
coverage_threshold	float Fraction of rows in the "from" part that need to join in inner join. Default value: 1.0
n_keys	int Number of missing keys to report. Random collection can be reported. Default value: 10

Returns

Type	Description
DataFrame	Dataframe with relationships, error type and error message. If there are no violations, returns an empty DataFrame.

plot_relationship_metadata

Plot a graph of relationships based on metadata contained in the provided dataframe.

The input "metadata" dataframe should contain one row per relationship. Each row names the "from" and "to" table/columns that participate in the relationship, and their multiplicity as defined by Multiplicity.

plot_relationship_metadata(metadata_df: DataFrame, tables: Dict[str, DataFrame] | List[DataFrame] | None = None, include_columns: str = 'keys', missing_key_errors='raise', *, graph_attributes: Dict | None = None) -> Digraph

Parameters

Name	Description
metadata_df	DataFrame A "metadata" dataframe with relationships to plot. It should contain the columns "multiplicity", "From Table", "From Column", "To Table", "To Column", which matches the output of find_relationships. Default value: None
tables	dict[str, DataFrame] or list[DataFrame] It needs to provided only when include_columns = 'all' and it will be used for mapping table names from relationships to the dataframe columns. Default value: None
include_columns	str One of 'keys', 'all', 'none'. Indicates which columns should be included in the graph. Default value: keys
missing_key_errors	str One of 'raise', 'warn', 'ignore'. Action to take when either table or column of the relationship is not found in the elements of the argument tables. Default value: raise
graph_attributes	dict Attributes passed to graphviz. Note that all values need to be strings. Useful attributes are: rankdir: "TB" (top-bottom) or "LR" (left-right) dpi: "100", "30", etc. (dots per inch) splines: "ortho", "compound", "line", "curved", "spline" (line shape) Default value: None

Keyword-Only Parameters

Name	Description
graph_attributes	Default value: None

Returns

Type	Description
Digraph	Graph object containing all relationships. If include_attributes is true, attributes are represented as ports in the graph.

relationships Package

Classes

Functions

find_relationships

Parameters

Returns

list_relationship_violations

Parameters

Returns

plot_relationship_metadata

Parameters

Keyword-Only Parameters

Returns