GraphFrames 사용자 가이드 - Scala

아티클
05/28/2024

이 문서에서는 GraphFrames 사용자 가이드의 예제를 보여 줍니다.

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._

GraphFrame 만들기

꼭짓점 및 에지 DataFrames에서 GraphFrame을 만들 수 있습니다.

꼭짓점 DataFrame: 꼭짓점 DataFrame에는 그래프의 각 꼭짓점마다 고유한 ID를 지정하는 이라는 id 특수 열이 포함되어야 합니다.
Edge DataFrame: 에지 DataFrame에는 두 개의 특수 열 src (에지의 원본 꼭짓점 ID) 및 dst (에지의 대상 꼭짓점 ID)가 포함되어야 합니다.

두 DataFrame 모두 임의의 다른 열을 가질 수 있습니다. 이러한 열은 꼭짓점 및 에지 특성을 나타낼 수 있습니다.

꼭짓점 및 가장자리 만들기

// Vertex DataFrame
val v = spark.createDataFrame(List(
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
)).toDF("id", "name", "age")
// Edge DataFrame
val e = spark.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
)).toDF("src", "dst", "relationship")

이러한 꼭짓점 및 이러한 가장자리에서 그래프를 만들어 보겠습니다.

val g = GraphFrame(v, e)

// This example graph also comes with the GraphFrames package.
// val g = examples.Graphs.friends

기본 그래프 및 DataFrame 쿼리

GraphFrames는 노드 수준과 같은 간단한 그래프 쿼리를 제공합니다.

또한 GraphFrames는 그래프를 꼭짓점 및 에지 DataFrame 쌍으로 나타내므로 꼭짓점 및 에지 DataFrame에서 직접 강력한 쿼리를 쉽게 만들 수 있습니다. 이러한 데이터 프레임은 GraphFrame에서 꼭짓점 및 에지 필드로 사용할 수 있습니다.

display(g.vertices)

display(g.edges)

꼭짓점의 들어오는 정도:

display(g.inDegrees)

꼭짓점의 나가는 정도:

display(g.outDegrees)

꼭짓점의 정도:

display(g.degrees)

꼭짓점 DataFrame에서 직접 쿼리를 실행할 수 있습니다. 예를 들어 그래프에서 가장 어린 사람의 나이를 찾을 수 있습니다.

val youngest = g.vertices.groupBy().min("age")
display(youngest)

마찬가지로 에지 DataFrame에서 쿼리를 실행할 수 있습니다. 예를 들어 그래프의 '팔로우' 관계 수를 계산해 보겠습니다.

val numFollows = g.edges.filter("relationship = 'follow'").count()

모티프 찾기

모티브를 사용하여 에지 및 꼭짓점과 관련된 더 복잡한 관계를 빌드합니다. 다음 셀은 양쪽 방향으로 가장자리가 있는 꼭짓점 쌍을 찾습니다. 그 결과 열 이름이 모티프 키인 DataFrame이 생성됩니다.

API에 대한 자세한 내용은 GraphFrame 사용자 가이드 를 참조하세요.

// Search for pairs of vertices with edges in both directions between them.
val motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
display(motifs)

결과는 DataFrame이므로 모티프 위에 더 복잡한 쿼리를 작성할 수 있습니다. 한 사람이 30세 이상인 모든 상호 관계를 찾아 보겠습니다.

val filtered = motifs.filter("b.age > 30")
display(filtered)

상태 저장 쿼리

대부분의 모티프 쿼리는 위 예제와 같이 상태 비 상태이며 표현하기 간단합니다. 다음 예제에서는 모티프의 경로를 따라 상태를 전달하는 더 복잡한 쿼리를 보여 줍니다. GraphFrame 모티프 찾기를 결과에 대한 필터와 결합하여 이러한 쿼리를 표현합니다. 여기서 필터는 시퀀스 작업을 사용하여 일련의 DataFrame 열을 생성합니다.

예를 들어 함수 시퀀스로 정의된 일부 속성을 사용하여 4개의 꼭짓점 체인을 식별한다고 가정합니다. 즉, 4개의 꼭짓점 체인 중에서 이 복잡한 필터와 일치하는 체인의 하위 집합을 식별합니다 a->b->c->d.

경로에서 상태를 초기화합니다.
꼭짓점 에 따라 상태를 업데이트합니다.
꼭짓점 b를 기반으로 상태를 업데이트합니다.
등. c 및 d의 경우
최종 상태가 일부 조건과 일치하는 경우 필터는 체인을 허용합니다.

다음 코드 조각은 이 프로세스를 보여 줍니다. 여기서 3개의 가장자리 중 2개 이상이 "friend" 관계인 4개의 꼭짓점 체인을 식별합니다. 이 예제에서 상태는 "friend" 에지의 현재 개수입니다. 일반적으로 DataFrame 열일 수 있습니다.

// Find chains of 4 vertices.
val chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")

// Query on sequence, with state (cnt)
//  (a) Define method for updating state given the next element of the motif.
def sumFriends(cnt: Column, relationship: Column): Column = {
  when(relationship === "friend", cnt + 1).otherwise(cnt)
}
//  (b) Use sequence operation to apply method to sequence of elements in motif.
//      In this case, the elements are the 3 edges.
val condition = Seq("ab", "bc", "cd").
  foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship")))
//  (c) Apply filter to DataFrame.
val chainWith2Friends2 = chain4.where(condition >= 2)
display(chainWith2Friends2)

하위 그래프

GraphFrames는 가장자리 및 꼭짓점을 필터링하여 하위 그래프를 빌드하기 위한 API를 제공합니다. 이러한 필터는 함께 구성할 수 있습니다. 예를 들어 다음 하위 그래프에는 친구이고 30세 이상인 사용자만 포함됩니다.

// Select subgraph of users older than 30, and edges of type "friend"
val g2 = g
  .filterEdges("relationship = 'friend'")
  .filterVertices("age > 30")
  .dropIsolatedVertices()

복합 삼중 필터

다음 예제에서는 에지 및 해당 "src" 및 "dst" 꼭짓점에서 작동하는 삼중 필터를 기반으로 하위 그래프를 선택하는 방법을 보여 줍니다. 더 복잡한 모티브를 사용하여 이 예제를 삼중값 이상으로 확장하는 것은 간단합니다.

// Select subgraph based on edges "e" of type "follow"
// pointing from a younger user "a" to an older user "b".
val paths = g.find("(a)-[e]->(b)")
  .filter("e.relationship = 'follow'")
  .filter("a.age < b.age")
// "paths" contains vertex info. Extract the edges.
val e2 = paths.select("e.src", "e.dst", "e.relationship")
// In Spark 1.5+, the user may simplify this call:
//  val e2 = paths.select("e.*")

// Construct the subgraph
val g2 = GraphFrame(g.vertices, e2)

display(g2.vertices)

display(g2.edges)

표준 그래프 알고리즘

이 섹션에서는 GraphFrames에 기본 제공되는 표준 그래프 알고리즘에 대해 설명합니다.

BFS(폭 우선 검색)

"Esther"에서 32세 < 사용자를 검색합니다.

val paths: DataFrame = g.bfs.fromExpr("name = 'Esther'").toExpr("age < 32").run()
display(paths)

또한 검색은 에지 필터와 최대 경로 길이를 제한할 수 있습니다.

val filteredPaths = g.bfs.fromExpr("name = 'Esther'").toExpr("age < 32")
  .edgeFilter("relationship != 'friend'")
  .maxPathLength(3)
  .run()
display(filteredPaths)

연결된 구성 요소

각 꼭짓점의 연결된 구성 요소 멤버 자격을 계산하고 구성 요소 ID가 할당된 각 꼭짓점이 있는 그래프를 반환합니다.

val result = g.connectedComponents.run() // doesn't work on Spark 1.4
display(result)

강력한 연결 구성 요소

각 꼭짓점의 SCC(강력한 연결 구성 요소)를 계산하고 해당 꼭짓점을 포함하는 SCC에 할당된 각 꼭짓점이 있는 그래프를 반환합니다.

val result = g.stronglyConnectedComponents.maxIter(10).run()
display(result.orderBy("component"))

레이블 전파

네트워크에서 커뮤니티를 검색하기 위해 정적 레이블 전파 알고리즘을 실행합니다.

네트워크의 각 노드는 처음에 자체 커뮤니티에 할당됩니다. 모든 슈퍼스텝에서 노드는 모든 이웃에게 커뮤니티 소속을 보내고 들어오는 메시지의 모드 커뮤니티 소속으로 상태를 업데이트합니다.

LPA는 그래프에 대한 표준 커뮤니티 검색 알고리즘입니다. (1) 수렴이 보장되지 않고 (2) 사소한 솔루션으로 끝날 수 있지만(모든 노드가 단일 커뮤니티로 식별됨) 계산이 저렴합니다.

val result = g.labelPropagation.maxIter(5).run()
display(result.orderBy("label"))

Pagerank

연결을 기반으로 그래프에서 중요한 꼭짓점을 식별합니다.

// Run PageRank until convergence to tolerance "tol".
val results = g.pageRank.resetProbability(0.15).tol(0.01).run()
display(results.vertices)

display(results.edges)

// Run PageRank for a fixed number of iterations.
val results2 = g.pageRank.resetProbability(0.15).maxIter(10).run()
display(results2.vertices)

// Run PageRank personalized for vertex "a"
val results3 = g.pageRank.resetProbability(0.15).maxIter(10).sourceId("a").run()
display(results3.vertices)

최단 경로

지정된 랜드마크 꼭짓점 집합에 대한 가장 짧은 경로를 계산합니다. 여기서 랜드마크는 꼭짓점 ID로 지정됩니다.

val paths = g.shortestPaths.landmarks(Seq("a", "d")).run()
display(paths)

삼각형 계산

각 꼭짓점을 통과하는 삼각형 수를 계산합니다.

import org.graphframes.examples
val g: GraphFrame = examples.Graphs.friends  // get example graph

val results = g.triangleCount.run()
results.select("id", "count").show()

다음을 통해 공유