azure databrick graphframe recursive child

Question

azure databrick graphframe recursive child

Pankaj Joshi 411

I am looking for a databrick GRAPHFRAME breadth first search pyspark code for logic as explained below:

(EXAMPLE)

For each EMP_ROWID "6780001" in df_data look for a matching record in df_rltnshp ROWID2, it matches and return ROWID1 as "6669300", "6661974" and "6661975" with level as "1",

Now we need to look for each return ROWID1 ("669300", "6661974" and "6661975") in ROWID2, it matches with ROWID2 as "6669300"

and return ROWID1 as "1239300" with level as "2"

Similarly it should continue recursively till record is present.

(PLEASE SEE INPUT DATA AND EXPECTED OUTPUT DATA BELOW)

(1)

co_data = [

["6780001"],

["6063024"],

["6780002"],

["6780011"]

]

columns_co = ['EMP_ROWID']

df_data = spark.createDataFrame(co_data, columns_co)

(2)

rltnshp = [

["6669300", "6780001"],

["6661974", "6780001"],

["6661975", "6780001"],

["1239300", "6669300"],

["5555555", "6063024"],

["6666666", "6780002"],

["4444444", "6780011"],

["3333333", "4444444"]

]

columns_rl = ['ROWID1', 'ROWID2']

df_rltnshp = spark.createDataFrame(rltnshp, columns_rl )

EXPECTED OUTPUT

+--------+--------+-----+

|parent |child |level|

+--------+--------+-----+

|6780001 |6669300 |1 |

|6780001 |6661974 |1 |

|6780001 |6661975 |1 |

|6669300 |1239300 |2 |

|6063024 |5555555 |1 |

|6780002 |6666666 |1 |

|6780011 |4444444 |1 |

|4444444 |3333333 |2 |

+--------+--------+-----+

0 comments

Answer accepted by question author

0 additional answers

Your answer

Answer 1

Pratyush Vashistha 5,135 Microsoft External Staff Moderator

Hello Pankaj Joshi, Thanks for reaching us out.

Here's a comprehensive solution using Databricks GraphFrames to perform breadth-first search traversal for your hierarchical relationship data:

from graphframes import GraphFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, max as spark_max
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark Session (if not already initialized)
spark = SparkSession.builder.appName("GraphFrameBFS").getOrCreate()
# Input Data
co_data = [
    ["6780001"],
    ["6063024"],
    ["6780002"],
    ["6780011"]
]
columns_co = ['EMP_ROWID']
df_data = spark.createDataFrame(co_data, columns_co)
rltnshp = [
    ["6669300", "6780001"],
    ["6661974", "6780001"],
    ["6661975", "6780001"],
    ["1239300", "6669300"],
    ["5555555", "6063024"],
    ["6666666", "6780002"],
    ["4444444", "6780011"],
    ["3333333", "4444444"]
]
columns_rl = ['ROWID1', 'ROWID2']
df_rltnshp = spark.createDataFrame(rltnshp, columns_rl)
# Create vertices DataFrame (all unique nodes)
vertices_from_data = df_data.select(col("EMP_ROWID").alias("id"))
vertices_from_rltnshp1 = df_rltnshp.select(col("ROWID1").alias("id"))
vertices_from_rltnshp2 = df_rltnshp.select(col("ROWID2").alias("id"))
vertices = vertices_from_data.union(vertices_from_rltnshp1).union(vertices_from_rltnshp2).distinct()
# Create edges DataFrame (ROWID2 -> ROWID1, representing parent -> child relationship)
edges = df_rltnshp.select(
    col("ROWID2").alias("src"),  # parent
    col("ROWID1").alias("dst")   # child
)
# Create GraphFrame
graph = GraphFrame(vertices, edges)
# Function to perform BFS for each starting node
def perform_bfs_for_all_nodes(graph, start_nodes_df):
    """
    Perform BFS for all starting nodes and return hierarchical results
    """
    all_results = []
    
    # Convert start_nodes_df to list for iteration
    start_nodes_list = [row['EMP_ROWID'] for row in start_nodes_df.collect()]
    
    for start_node in start_nodes_list:
        # Perform BFS from current start node
        bfs_result = graph.bfs(
            fromExpr=f"id = '{start_node}'",
            toExpr="id IS NOT NULL",  # Find all reachable nodes
            maxPathLength=10  # Adjust based on your expected maximum depth
        )
        
        if bfs_result.count() > 0:
            # Extract parent-child relationships with levels
            # BFS result contains columns like: from, e0, v1, e1, v2, e2, v3, ...
            # We need to extract the path information
            
            # Get the schema to understand the structure
            bfs_columns = bfs_result.columns
            
            # Process each level of the BFS result
            current_results = []
            
            # Level 1: from -> v1
            level_1 = bfs_result.filter(col("v1").isNotNull()).select(
                col("from.id").alias("parent"),
                col("v1.id").alias("child"),
                lit(1).alias("level")
            )
            if level_1.count() > 0:
                current_results.append(level_1)
            
            # Level 2: v1 -> v2
            level_2 = bfs_result.filter(col("v2").isNotNull()).select(
                col("v1.id").alias("parent"),
                col("v2.id").alias("child"),
                lit(2).alias("level")
            )
            if level_2.count() > 0:
                current_results.append(level_2)
            
            # Level 3: v2 -> v3 (add more levels as needed)
            level_3 = bfs_result.filter(col("v3").isNotNull()).select(
                col("v2.id").alias("parent"),
                col("v3.id").alias("child"),
                lit(3).alias("level")
            )
            if level_3.count() > 0:
                current_results.append(level_3)
            
            # Union all levels for current start node
            if current_results:
                node_result = current_results[0]
                for i in range(1, len(current_results)):
                    node_result = node_result.union(current_results[i])
                all_results.append(node_result)
    
    # Union all results from all start nodes
    if all_results:
        final_result = all_results[0]
        for i in range(1, len(all_results)):
            final_result = final_result.union(all_results[i])
        return final_result
    else:
        # Return empty DataFrame with correct schema
        schema = StructType([
            StructField("parent", StringType(), True),
            StructField("child", StringType(), True),
            StructField("level", IntegerType(), True)
        ])
        return spark.createDataFrame([], schema)
# Alternative Simpler Approach using iterative BFS
def iterative_bfs_approach(df_data, df_rltnshp):
    """
    Iterative approach to build hierarchy levels
    """
    # Initialize result DataFrame
    result_schema = StructType([
        StructField("parent", StringType(), True),
        StructField("child", StringType(), True),
        StructField("level", IntegerType(), True)
    ])
    result_df = spark.createDataFrame([], result_schema)
    
    # Start with level 1: direct children of start nodes
    current_parents = df_data.select(col("EMP_ROWID").alias("parent"))
    level = 1
    
    while current_parents.count() > 0:
        # Find children for current parents
        level_result = current_parents.join(
            df_rltnshp, 
            current_parents.parent == df_rltnshp.ROWID2, 
            "inner"
        ).select(
            col("parent"),
            col("ROWID1").alias("child"),
            lit(level).alias("level")
        )
        
        if level_result.count() == 0:
            break
            
        # Add to result
        result_df = result_df.union(level_result)
        
        # Prepare for next level
        current_parents = level_result.select(col("child").alias("parent")).distinct()
        level += 1
        
        # Safety check to prevent infinite loops
        if level > 10:
            break
    
    return result_df
# Execute the BFS traversal
print("=== Using Iterative BFS Approach ===")
final_result = iterative_bfs_approach(df_data, df_rltnshp)
# Sort by parent and level for better readability
final_result = final_result.orderBy("parent", "level", "child")
# Display results
final_result.show()
# Verify the count and structure
print(f"Total relationships found: {final_result.count()}")
print("\nSchema:")
final_result.printSchema()

Output:

User's image

Please "Accept as Answer" if the answer provided is useful, so that you can help others in the community looking for remediation for similar issues.

Thanks

Pratyush
User's image

Pankaj Joshi 411

If I increase the level (5-8 row) as shown below then it showing duplicate record, output is not correct.

rltnshp can go to any level so it should work accordingly

rltnshp = [
    ["6669300", "6780001"],
    ["6661974", "6780001"],
    ["6661975", "6780001"],
    ["1239300", "6669300"],
    ["2239300", "1239300"], 
    ["3239300", "2239300"],
    ["1239300", "3239300"], 
    ["0239300", "1239300"],           
    ["5555555", "6063024"],
    ["6666666", "6780002"],
    ["4444444", "6780011"],
    ["3333333", "4444444"]
]
columns_rl = ['ROWID1', 'ROWID2']
df_rltnshp = spark.createDataFrame(rltnshp, columns_rl)

Pratyush Vashistha 5,135 Reputation points Microsoft External Staff Moderator

2025-09-23T09:45:25.4566667+00:00
Thanks for your further inputs, Pankaj Joshi.

I see the issue with duplicate records and need to understand the requirements better. Looking at your updated relationship data, I notice some important patterns:

In your data, I see a potential cycle:

1239300 → 2239300 (level 3)

2239300 → 3239300 (level 4)

3239300 → 1239300 (level 5) ← This creates a cycle back to level 2

My Question is How should we handle cycles in the relationship graph? Should we:

Stop traversal when we detect a cycle?

Continue for a maximum number of levels?

Track visited nodes to avoid infinite loops.

Given the cycle 1239300 → 2239300 → 3239300 → 1239300, what should the expected output be?

Should it be:

|6780001 |6669300 |1 | |6780001 |6661974 |1 | |6780001 |6661975 |1 | |6669300 |1239300 |2 | |1239300 |2239300 |3 | |2239300 |3239300 |4 | |3239300 |1239300 |5 | ← This creates a cycle |1239300 |0239300 |2 | or 6? ← Which level for this?

I Also noticed 1239300 appears as both:

Child of 6669300 (would be level 2)

Child of 3239300 (would be level 5 due to cycle)

So, when a node has multiple parents at different levels, should we:

Show it at all levels where it appears?

Show it only at the first/minimum level?

Show it only at the last/maximum level?

Could you please provide the exact expected output for your updated relationship data? This will help me understand:

How cycles should be handled

Which level nodes with multiple parents should appear at

Whether we should show duplicate relationships at different levels

The duplicates are likely occurring because:

The algorithm revisits nodes, creating duplicate paths

Nodes reachable via multiple paths are being processed multiple times

Unclear how to assign levels when cycles exist

Please clarify these points so I can provide the most accurate solution for your use case.
Pankaj Joshi 411 Reputation points

2025-09-23T10:26:39.4233333+00:00

here is the correct tree structure.

1239300 appears twice earlier was typo , I have corrected in below tree.

e.g. for 678001 is direct parent of 6669300, 6661974 and 6661975 ( level1)

6669300 is direct parent of 1239300 ( level 2) , similarly 1239300 is direct parent of 2239300 ( level3) and so on.

This level can go to any ( n = 50, 100 etc)

I hope this will clarify. Let me know if need more clrification
Pratyush Vashistha 5,135 Reputation points Microsoft External Staff Moderator

2025-09-23T12:06:36.5733333+00:00
Thank you for the clarification! Now I understand, this is a simple hierarchical tree structure without cycles, and the levels can go arbitrarily deep. let me work on this and come up with the updated code

From your tree diagram, I can see:

6780001 → 6669300, 6661974, 6661975 (Level 1)

6669300 → 1239300 (Level 2)

1239300 → 2239300 (Level 3)

2239300 → 3239300 (Level 4)

3239300 → 1239301 (Level 5)

1239301 → 2239301 (Level 6)

And so on...

Let me know if my understanding is correct. I will proceed with this understanding. Thanks for sharing these inputs.
Pankaj Joshi 411 Reputation points

2025-09-23T13:18:03.1833333+00:00

yes it is correct

Pratyush Vashistha 5,135 Microsoft External Staff Moderator

Hello Pankaj Joshi.

Could you please try this updated code and let me know if it works because I tried at my end and it is working as per the requirement. I have added some more steps for better tracking and debugging.

from graphframes import GraphFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, monotonically_increasing_id
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark Session
spark = SparkSession.builder.appName("GraphFrameBFS").getOrCreate()
# Input Data
co_data = [
    ["6780001"],
    ["6063024"],
    ["6780002"],
    ["6780011"]
]
columns_co = ['EMP_ROWID']
df_data = spark.createDataFrame(co_data, columns_co)
# Corrected relationship data
rltnshp = [
    ["6669300", "6780001"],
    ["6661974", "6780001"],
    ["6661975", "6780001"],
    ["1239300", "6669300"],
    ["2239300", "1239300"], 
    ["3239300", "2239300"],
    ["1239301", "3239300"],  # Corrected: different from 1239300
    ["2239301", "1239301"],  # Next level           
    ["5555555", "6063024"],
    ["6666666", "6780002"],
    ["4444444", "6780011"],
    ["3333333", "4444444"]
]
columns_rl = ['ROWID1', 'ROWID2']
df_rltnshp = spark.createDataFrame(rltnshp, columns_rl)
def build_hierarchy_iterative(df_start_nodes, df_relationships, max_levels=100):
    """
    Build complete hierarchy using iterative approach
    Can handle any number of levels (up to max_levels)
    """
    # Initialize result DataFrame
    result_schema = StructType([
        StructField("parent", StringType(), True),
        StructField("child", StringType(), True),
        StructField("level", IntegerType(), True)
    ])
    result_df = spark.createDataFrame([], result_schema)
    
    # Start with root nodes as current parents
    current_parents = df_start_nodes.select(col("EMP_ROWID").alias("parent"))
    level = 1
    
    print(f"Starting BFS traversal...")
    print(f"Initial parents count: {current_parents.count()}")
    
    while current_parents.count() > 0 and level <= max_levels:
        print(f"Processing Level {level}...")
        
        # Find children for current parents
        level_result = current_parents.join(
            df_relationships, 
            current_parents.parent == df_relationships.ROWID2, 
            "inner"
        ).select(
            col("parent"),
            col("ROWID1").alias("child"),
            lit(level).alias("level")
        )
        
        level_count = level_result.count()
        print(f"Level {level} found {level_count} relationships")
        
        if level_count == 0:
            print(f"No more relationships found at level {level}. Stopping.")
            break
            
        # Add current level results to final result
        result_df = result_df.union(level_result)
        
        # Prepare for next level: children become parents
        current_parents = level_result.select(col("child").alias("parent")).distinct()
        level += 1
        
        # Show progress for deep hierarchies
        if level % 10 == 0:
            print(f"Processed {level-1} levels so far...")
    
    print(f"BFS traversal completed. Total levels processed: {level-1}")
    return result_df
# Execute the hierarchy building
print("=== Building Complete Hierarchy ===")
final_result = build_hierarchy_iterative(df_data, df_rltnshp)
# Sort results for better readability
final_result_sorted = final_result.orderBy("parent", "level", "child")
# Display results
print("\n=== Final Results ===")
final_result_sorted.show(50, truncate=False)
# Show statistics
print(f"\nTotal parent-child relationships found: {final_result.count()}")
# Show level distribution
print("\n=== Level Distribution ===")
level_stats = final_result.groupBy("level").count().orderBy("level")
level_stats.show()
# FIXED: Show hierarchy by starting node with correct parentheses
print("\n=== Hierarchy by Starting Node ===")
for start_node in [row['EMP_ROWID'] for row in df_data.collect()]:
    print(f"\nHierarchy starting from {start_node}:")
    
    # Method 1: Find all descendants of the start node
    # Get all nodes that are children in the hierarchy starting from this node
    descendants_df = final_result_sorted.filter(col("parent") == start_node)
    
    # Also get descendants of descendants (complete subtree)
    all_children = final_result_sorted.select("child").distinct()
    start_node_descendants = final_result_sorted.filter(col("parent") == start_node)
    
    # Get the complete subtree by finding all relationships where parent is in the subtree
    subtree_parents = [start_node]
    
    # Collect all descendants iteratively
    current_level_children = [row['child'] for row in start_node_descendants.collect()]
    all_descendants = set(current_level_children)
    
    while current_level_children:
        next_level_children = []
        for child in current_level_children:
            child_relationships = final_result_sorted.filter(col("parent") == child).collect()
            for rel in child_relationships:
                if rel['child'] not in all_descendants:
                    next_level_children.append(rel['child'])
                    all_descendants.add(rel['child'])
        current_level_children = next_level_children
    
    # Show complete hierarchy for this start node
    all_subtree_parents = [start_node] + list(all_descendants)
    
    start_node_hierarchy = final_result_sorted.filter(
        col("parent").isin(all_subtree_parents)
    ).orderBy("level", "parent", "child")
    
    if start_node_hierarchy.count() > 0:
        start_node_hierarchy.show(truncate=False)
    else:
        print("No hierarchy found")
# Alternative simpler approach to show hierarchy by root node
print("\n=== Alternative: Direct Root Node Hierarchies ===")
def get_hierarchy_for_root(root_node, result_df):
    """
    Get complete hierarchy starting from a root node
    """
    print(f"\nComplete hierarchy for root node: {root_node}")
    
    # Start with direct children of root
    current_parents = [root_node]
    displayed_relationships = set()
    
    level = 1
    while current_parents:
        # Get relationships for current parents
        level_relationships = result_df.filter(
            col("parent").isin(current_parents) & (col("level") == level)
        ).orderBy("parent", "child")
        
        if level_relationships.count() == 0:
            break
            
        print(f"Level {level}:")
        level_relationships.show(truncate=False)
        
        # Get children for next level
        current_parents = [row['child'] for row in level_relationships.collect()]
        level += 1
# Show hierarchy for each root node
for root_node in [row['EMP_ROWID'] for row in df_data.collect()]:
    get_hierarchy_for_root(root_node, final_result_sorted)

Expected output:

User's image

Please "Accept as Answer" if the answer provided is useful, so that you can help others in the community looking for remediation for similar issues.

Thanks

Pratyush

Pankaj Joshi 411 Reputation points

2025-09-24T11:23:19.47+00:00

Hi Pratyush,

code is working fine. however its not using graphframe library right?

it took 4 min to complete.

If there is 3 lac child relationship in whole tree, in that case , is there any other library I can used for this use case for maximum performance?

Pratyush Vashistha 5,135 Microsoft External Staff Moderator

Hey Pankaj, yes, I was also working on GraphFrame Alternative (for Deep Hierarchies). Didn't get a chance to test it on my own environment but sharing it with you. following is the code.

def graphframe_bfs_deep_hierarchy(df_start_nodes, df_relationships, max_depth=50):
    """
    GraphFrame approach for very deep hierarchies
    """
    # Create vertices
    vertices_from_data = df_start_nodes.select(col("EMP_ROWID").alias("id"))
    vertices_from_rltnshp1 = df_relationships.select(col("ROWID1").alias("id"))
    vertices_from_rltnshp2 = df_relationships.select(col("ROWID2").alias("id"))
    vertices = vertices_from_data.union(vertices_from_rltnshp1).union(vertices_from_rltnshp2).distinct()
    
    # Create edges (parent -> child)
    edges = df_relationships.select(
        col("ROWID2").alias("src"),  # parent
        col("ROWID1").alias("dst")   # child
    )
    
    # Create GraphFrame
    graph = GraphFrame(vertices, edges)
    
    # Get all start nodes
    start_nodes = [row['EMP_ROWID'] for row in df_start_nodes.collect()]
    
    all_results = []
    
    for start_node in start_nodes:
        print(f"Processing GraphFrame BFS for start node: {start_node}")
        
        # Perform BFS with sufficient max path length
        bfs_result = graph.bfs(
            fromExpr=f"id = '{start_node}'",
            toExpr="id IS NOT NULL",
            maxPathLength=max_depth * 2  # *2 because GraphFrame counts edges and vertices
        )
        
        if bfs_result.count() > 0:
            # Extract hierarchical relationships from BFS result
            current_results = []
            
            # Dynamically process all levels found in BFS result
            bfs_columns = bfs_result.columns
            max_vertex_num = 0
            
            # Find maximum vertex number in columns (v1, v2, v3, etc.)
            for col_name in bfs_columns:
                if col_name.startswith('v') and col_name[1:].isdigit():
                    vertex_num = int(col_name[1:])
                    max_vertex_num = max(max_vertex_num, vertex_num)
            
            # Extract relationships for each level
            for level in range(1, max_vertex_num + 1):
                if level == 1:
                    parent_col = "from.id"
                    child_col = f"v{level}.id"
                else:
                    parent_col = f"v{level-1}.id"
                    child_col = f"v{level}.id"
                
                level_df = bfs_result.filter(col(f"v{level}").isNotNull()).select(
                    col(parent_col).alias("parent"),
                    col(child_col).alias("child"),
                    lit(level).alias("level")
                )
                
                if level_df.count() > 0:
                    current_results.append(level_df)
            
            # Union all levels for current start node
            if current_results:
                node_result = current_results[0]
                for i in range(1, len(current_results)):
                    node_result = node_result.union(current_results[i])
                all_results.append(node_result)
    
    # Union results from all start nodes
    if all_results:
        final_result = all_results[0]
        for i in range(1, len(all_results)):
            final_result = final_result.union(all_results[i])
        return final_result.distinct()
    else:
        # Return empty DataFrame with correct schema
        schema = StructType([
            StructField("parent", StringType(), True),
            StructField("child", StringType(), True),
            StructField("level", IntegerType(), True)
        ])
        return spark.createDataFrame([], schema)
# Execute GraphFrame approach for comparison
print("\n=== GraphFrame BFS Approach ===")
gf_result = graphframe_bfs_deep_hierarchy(df_data, df_rltnshp, max_depth=20)
gf_result.orderBy("parent", "level", "child").show(50, truncate=False)

Let me know if this works. I request you to kindly accept the answer if the requirement is fulfilled.

Pankaj Joshi 411 Reputation points

2025-09-24T12:18:14.2266667+00:00

no its returning zero row.

Pratyush Vashistha 5,135 Microsoft External Staff Moderator

Here we go with the updated code. Sharing the output as well.

from graphframes import GraphFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, isnan, isnull
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

def graphframe_bfs_fixed(df_start_nodes, df_relationships, max_depth=20):
    """
    Fixed GraphFrame approach for hierarchical traversal
    """
    print("=== Starting GraphFrame BFS (Fixed Version) ===")
    
    # Create vertices - all unique nodes
    vertices_from_data = df_start_nodes.select(col("EMP_ROWID").alias("id"))
    vertices_from_rltnshp1 = df_relationships.select(col("ROWID1").alias("id"))
    vertices_from_rltnshp2 = df_relationships.select(col("ROWID2").alias("id"))
    vertices = vertices_from_data.union(vertices_from_rltnshp1).union(vertices_from_rltnshp2).distinct()
    
    print(f"Total vertices: {vertices.count()}")
    print("Vertices:")
    vertices.show()
    
    # Create edges (parent -> child relationship)
    edges = df_relationships.select(
        col("ROWID2").alias("src"),  # parent
        col("ROWID1").alias("dst")   # child
    )
    
    print(f"Total edges: {edges.count()}")
    print("Edges:")
    edges.show()
    
    # Create GraphFrame
    graph = GraphFrame(vertices, edges)
    
    # Get all start nodes
    start_nodes = [row['EMP_ROWID'] for row in df_start_nodes.collect()]
    print(f"Start nodes: {start_nodes}")
    
    all_results = []
    
    for start_node in start_nodes:
        print(f"\nProcessing GraphFrame BFS for start node: {start_node}")
        
        # Method 1: Use BFS to find paths
        try:
            bfs_result = graph.bfs(
                fromExpr=f"id = '{start_node}'",
                toExpr="id != ''",  # Find all reachable nodes
                maxPathLength=max_depth
            )
            
            print(f"BFS result count: {bfs_result.count()}")
            
            if bfs_result.count() > 0:
                print("BFS Result Schema:")
                bfs_result.printSchema()
                print("Sample BFS Results:")
                bfs_result.show(5, truncate=False)
                
                # Process BFS results to extract parent-child relationships
                node_relationships = extract_relationships_from_bfs(bfs_result)
                if node_relationships.count() > 0:
                    all_results.append(node_relationships)
            
        except Exception as e:
            print(f"BFS failed for {start_node}: {str(e)}")
            print("Trying alternative approach...")
            
            # Alternative: Use motifs to find patterns
            alt_result = find_paths_with_motifs(graph, start_node, max_depth)
            if alt_result.count() > 0:
                all_results.append(alt_result)
    
    # Combine all results
    if all_results:
        final_result = all_results[0]
        for i in range(1, len(all_results)):
            final_result = final_result.union(all_results[i])
        return final_result.distinct()
    else:
        # Return empty DataFrame with correct schema
        schema = StructType([
            StructField("parent", StringType(), True),
            StructField("child", StringType(), True),
            StructField("level", IntegerType(), True)
        ])
        return spark.createDataFrame([], schema)

def extract_relationships_from_bfs(bfs_df):
    """
    Extract parent-child relationships from BFS result DataFrame
    """
    print("Extracting relationships from BFS result...")
    
    # Initialize result list
    relationships = []
    
    # Get column names to understand the BFS structure
    columns = bfs_df.columns
    print(f"BFS columns: {columns}")
    
    # Collect BFS results for processing
    bfs_data = bfs_df.collect()
    
    for row in bfs_data:
        row_dict = row.asDict()
        
        # Extract path information
        # BFS typically returns: from, e0, v1, e1, v2, e2, v3, ...
        
        current_parent = row_dict['from']['id'] if row_dict.get('from') else None
        level = 1
        
        # Process each vertex in the path
        vertex_index = 1
        while f'v{vertex_index}' in row_dict and row_dict[f'v{vertex_index}'] is not None:
            current_child = row_dict[f'v{vertex_index}']['id']
            
            # Add relationship
            relationships.append((current_parent, current_child, level))
            
            # Move to next level
            current_parent = current_child
            level += 1
            vertex_index += 1
    
    # Convert to DataFrame
    if relationships:
        schema = StructType([
            StructField("parent", StringType(), True),
            StructField("child", StringType(), True),
            StructField("level", IntegerType(), True)
        ])
        return spark.createDataFrame(relationships, schema)
    else:
        schema = StructType([
            StructField("parent", StringType(), True),
            StructField("child", StringType(), True),
            StructField("level", IntegerType(), True)
        ])
        return spark.createDataFrame([], schema)

def find_paths_with_motifs(graph, start_node, max_depth=20):
    """
    Alternative approach using motifs to find hierarchical patterns
    """
    print(f"Using motifs approach for {start_node}")
    
    all_relationships = []
    
    # Find direct children (level 1)
    level_1_motif = "(a)-[e]->(b)"
    level_1_result = graph.find(level_1_motif).filter(col("a.id") == start_node)
    
    if level_1_result.count() > 0:
        level_1_relationships = level_1_result.select(
            col("a.id").alias("parent"),
            col("b.id").alias("child"),
            lit(1).alias("level")
        )
        all_relationships.append(level_1_relationships)
        
        # Get children for next levels
        current_children = [row['child'] for row in level_1_relationships.collect()]
        
        # Find deeper levels
        for level in range(2, min(max_depth + 1, 10)):  # Limit to prevent excessive processing
            if not current_children:
                break
                
            level_relationships = []
            for parent in current_children:
                parent_children = graph.find(level_1_motif).filter(col("a.id") == parent)
                if parent_children.count() > 0:
                    parent_rels = parent_children.select(
                        col("a.id").alias("parent"),
                        col("b.id").alias("child"),
                        lit(level).alias("level")
                    )
                    level_relationships.append(parent_rels)
            
            if level_relationships:
                # Union all relationships at this level
                level_df = level_relationships[0]
                for i in range(1, len(level_relationships)):
                    level_df = level_df.union(level_relationships[i])
                
                all_relationships.append(level_df)
                
                # Get children for next level
                current_children = [row['child'] for row in level_df.collect()]
            else:
                break
    
    # Combine all levels
    if all_relationships:
        result = all_relationships[0]
        for i in range(1, len(all_relationships)):
            result = result.union(all_relationships[i])
        return result.distinct()
    else:
        schema = StructType([
            StructField("parent", StringType(), True),
            StructField("child", StringType(), True),
            StructField("level", IntegerType(), True)
        ])
        return spark.createDataFrame([], schema)

# Simplified and More Reliable GraphFrame Approach
def simple_graphframe_approach(df_start_nodes, df_relationships):
    """
    Simplified GraphFrame approach using motifs level by level
    """
    print("=== Simple GraphFrame Approach ===")
    
    # Create vertices and edges
    vertices_from_data = df_start_nodes.select(col("EMP_ROWID").alias("id"))
    vertices_from_rltnshp1 = df_relationships.select(col("ROWID1").alias("id"))
    vertices_from_rltnshp2 = df_relationships.select(col("ROWID2").alias("id"))
    vertices = vertices_from_data.union(vertices_from_rltnshp1).union(vertices_from_rltnshp2).distinct()
    
    edges = df_relationships.select(
        col("ROWID2").alias("src"),
        col("ROWID1").alias("dst")
    )
    
    graph = GraphFrame(vertices, edges)
    
    # Use simple motif to find all parent-child relationships
    parent_child_motif = "(parent)-[edge]->(child)"
    all_edges_df = graph.find(parent_child_motif)
    
    print(f"Found {all_edges_df.count()} direct parent-child relationships")
    
    if all_edges_df.count() > 0:
        # Convert to our desired format and determine levels
        direct_relationships = all_edges_df.select(
            col("parent.id").alias("parent_id"),
            col("child.id").alias("child_id")
        )
        
        # Now determine levels using iterative approach
        start_nodes = [row['EMP_ROWID'] for row in df_start_nodes.collect()]
        
        result_data = []
        
        for start_node in start_nodes:
            # Find all paths from this start node
            visited = set()
            queue = [(start_node, 0)]  # (node, level)
            
            while queue:
                current_node, current_level = queue.pop(0)
                
                if current_node in visited:
                    continue
                visited.add(current_node)
                
                # Find children of current node
                children = direct_relationships.filter(
                    col("parent_id") == current_node
                ).collect()
                
                for child_row in children:
                    child_id = child_row['child_id']
                    child_level = current_level + 1
                    
                    # Add relationship
                    result_data.append((current_node, child_id, child_level))
                    
                    # Add child to queue for next level processing
                    if child_id not in visited:
                        queue.append((child_id, child_level))
        
        # Create result DataFrame
        if result_data:
            schema = StructType([
                StructField("parent", StringType(), True),
                StructField("child", StringType(), True),
                StructField("level", IntegerType(), True)
            ])
            return spark.createDataFrame(result_data, schema).distinct()
    
    # Return empty DataFrame if no relationships found
    schema = StructType([
        StructField("parent", StringType(), True),
        StructField("child", StringType(), True),
        StructField("level", IntegerType(), True)
    ])
    return spark.createDataFrame([], schema)

# Execute the fixed GraphFrame approaches
print("\n=== Testing Fixed GraphFrame BFS ===")
try:
    gf_result = graphframe_bfs_fixed(df_data, df_rltnshp, max_depth=10)
    print("Fixed GraphFrame BFS Results:")
    gf_result.orderBy("parent", "level", "child").show(50, truncate=False)
except Exception as e:
    print(f"Fixed GraphFrame BFS failed: {str(e)}")

print("\n=== Testing Simple GraphFrame Approach ===")
try:
    simple_result = simple_graphframe_approach(df_data, df_rltnshp)
    print("Simple GraphFrame Results:")
    simple_result.orderBy("parent", "level", "child").show(50, truncate=False)
except Exception as e:
    print(f"Simple GraphFrame approach failed: {str(e)}")

Output:

User's image

For Performance, you can consider one of the following methods

Approach	Performance	Memory Usage	Scalability	Pros	Cons
PySpark Iterative	⭐⭐⭐⭐	Medium	Excellent	Simple, reliable	Multiple iterations
PySpark Iterative	⭐⭐⭐⭐	Medium	Excellent	Simple, reliable	Multiple iterations
GraphFrames	⭐⭐	High	Poor	Graph algorithms	Memory intensive, slow
Neo4j + PySpark	⭐⭐⭐⭐⭐	Low	Excellent	Optimized for graphs	Additional infrastructure
NetworkX	⭐	Very High	Poor	Rich algorithms	Single machine, slow

These are some best suggestions known to my knowledge. I hope this will work for you. All the best and Happy learning! Kindly accept so that others can get benefit too.

Thanks

Pratyush

Pratyush Vashistha 5,135 Reputation points Microsoft External Staff Moderator

2025-09-25T07:34:27.2333333+00:00

Hello Pankaj Joshi.

Just checking in to see if the above answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Thanks

Pratyush

Share via

azure databrick graphframe recursive child

0 additional answers

Your answer