CS224w HomeWork 1

本文旨在记录CS224w Machine Learning With Graphs 2019完成作业中遇到的问题和作业的结果。

我的github仓库 LINK。

Part 1 Network Characteristics

课程中反复强调的一个非常重要的观点就是：想要比较一个网络的属性，我们需要一个criterion或者一个null network。

所以这部分的核心就是null network or criterion的生成。我们需要生成Erdös-Renyi Random Graphs和Small-World Random Network。

Erdös-Renyi Random Graphs

非常的简单，即：undirected graph with n nodes, and m edges picked uniformly at random. 需要的参数就是节点数和边数了。

def genErdosRenyi(N=5242, E=14484):
    """
    :param - N: number of nodes
    :param - E: number of edges

    return type: snap.PUNGraph
    return: Erdos-Renyi graph with N nodes and E edges
    """
    ############################################################################
    # TODO: Your code here!
    Graph = snap.PUNGraph.New()
    for i in range(N):
        Graph.AddNode(i)
    cnt = 0
    while cnt < E:
        src = np.random.randint(0, N)
        dst = np.random.randint(0, N)
        if not Graph.IsEdge(src, dst):
            Graph.AddEdge(src, dst)
            cnt += 1
    ############################################################################
    return Graph

Small-World Random Network

生成Small-World Random Network需要三步。分别为：

Begin with N nodes arranged as a ring, i.e. imagine the nodes form a circle and each node is connected to its two direct neighbors.
Connect each node to the neighbors of its neighbors.
Randomly select some given number pairs of nodes not yet connected and add an edge between them.

实现起来并不困难，代码如下：

def genCircle(N=5242):
    """
    :param - N: number of nodes

    return type: snap.PUNGraph
    return: Circle graph with N nodes and N edges. Imagine the nodes form a
        circle and each node is connected to its two direct neighbors.
    """
    ############################################################################
    # TODO: Your code here!
    Graph = snap.PUNGraph.New()
    for i in range(N):
        Graph.AddNode(i)
    Graph.AddEdge(N - 1, 0)
    for i in range(N - 1):
        Graph.AddEdge(i, i + 1)
    ############################################################################
    return Graph


def connectNbrOfNbr(Graph, N=5242):
    """
    :param - Graph: snap.PUNGraph object representing a circle graph on N nodes
    :param - N: number of nodes

    return type: snap.PUNGraph
    return: Graph object with additional N edges added by connecting each node
        to the neighbors of its neighbors
    """
    ############################################################################
    # TODO: Your code here!
    for i in range(N):
        n1 = (i + 1 + N) % N
        n2 = (i - 1 + N) % N
        Graph.AddEdge(n1, n2)
    ############################################################################
    return Graph


def connectRandomNodes(Graph, M=4000):
    """
    :param - Graph: snap.PUNGraph object representing an undirected graph
    :param - M: number of edges to be added

    return type: snap.PUNGraph
    return: Graph object with additional M edges added by connecting M randomly
        selected pairs of nodes not already connected.
    """
    ############################################################################
    # TODO: Your code here!
    N = Graph.GetNodes()
    cnt = 0
    while cnt < M:
        src = np.random.randint(0, N)
        dst = np.random.randint(0, N)
        if not Graph.IsEdge(src, dst):
            Graph.AddEdge(src, dst)
            cnt += 1
    ############################################################################
    return Graph


def genSmallWorld(N=5242, E=14484):
    """
    :param - N: number of nodes
    :param - E: number of edges

    return type: snap.PUNGraph
    return: Small-World graph with N nodes and E edges
    """
    Graph = genCircle(N)
    Graph = connectNbrOfNbr(Graph, N)
    Graph = connectRandomNodes(Graph, 4000)
    return Graph

Question 1.1 Degree Distribution

计算Degree Distribution。Degree Distribution就是：

这里并不需要计算出分数，而是直接将度为k的节点数以log-log图画出来。

这里画图使用matplotlib，首先要生成x,y坐标上对应的数。代码如下：

def getDataPointsToPlot(Graph):
    """
    :param - Graph: snap.PUNGraph object representing an undirected graph

    return values:
    X: list of degrees
    Y: list of frequencies: Y[i] = fraction of nodes with degree X[i]
    """
    ############################################################################
    # TODO: Your code here!
    X, Y = [], []
    DegToCntV = snap.TIntPrV()
    snap.GetDegCnt(Graph, DegToCntV)
    for item in DegToCntV:
        X.append(item.GetVal1())
        Y.append(item.GetVal2())
    ############################################################################
    return X, Y

画图部分代码已给出，这里不再列出。值得注意的是画log-log图使用plt.loglog即可。

Answers

Question 1.2 Clustering Coefficient

计算Clustering Coefficient。公式如下：

代码如下，可能稍微负责的就是找该节点邻居间边的数量。

def calcClusteringCoefficientSingleNode(Node, Graph):
    """
    :param - Node: node from snap.PUNGraph object. Graph.Nodes() will give an
                   iterable of nodes in a graph
    :param - Graph: snap.PUNGraph object representing an undirected graph

    return type: float
    returns: local clustering coeffient of Node
    """
    ############################################################################
    # TODO: Your code here!
    C = 0.0
    neigbors = []
    deg = Node.GetDeg()
    for i in range(deg):
        neigbors.append(Graph.GetNI(Node.GetNbrNId(i)))
    cnt_nbr = 0
    for i in range(deg):
        for j in range(i):
            cnt_nbr += neigbors[i].IsInNId(neigbors[j].GetId())
    if deg >= 2:
        C = 2 * cnt_nbr / (deg * (deg - 1.0))
    ############################################################################
    return C

def calcClusteringCoefficient(Graph):
    """
    :param - Graph: snap.PUNGraph object representing an undirected graph

    return type: float
    returns: clustering coeffient of Graph
    """
    ############################################################################
    # TODO: Your code here! If you filled out calcClusteringCoefficientSingleNode,
    #       you'll probably want to call it in a loop here
    C = 0.0
    V = Graph.GetNodes()
    for NI in Graph.Nodes():
        Ci = calcClusteringCoefficientSingleNode(NI, Graph)
        C = C + Ci
    C = C / V

    ############################################################################
    return C

Answers

Part 2 Structural Roles: Rolx and ReFex

本部分主要是implement计算每个节点Structural Role的方法（实现Rolx 和 ReFex方法），可以看做一种对于每个节点拓扑信息的提取方法，将每个节点的拓扑信息变为一个向量，即实现一个feature extraction的方法。这里的feature extraction主要包含两步：

Basic Features: First extract basic local features from every node.
Recursive Features: Then aggregate the basic features along graph edges so that global features are also obtained.

Basic Features

首先是计算basic feature，主要指的是节点本身的local features。这里计算的basic feature为一个三维向量。

其三个维度的意义是：

这里简单的记录一下计算2和3的方法。

有了上图，这几个关系就很明确了，将代码完善即可。

代码如下：

def extract_basic_features(node, graph):
    """
    提取basic features
    :param node: 目标节点，SNAP中的node类，而非ID，可以用GetNI()获得
    :param graph: 目标图（无向图）
    :return: 长度为3的array 分别为：该节点deg，egonet内部deg，进出egonet的边数
    """
    degree = node.GetDeg()

    neighbors = []

    # 计算全部邻居的deg的和
    total_deg_nbr = 0
    for i in range(degree):
        # 获取全部的邻居，目标构建egonet
        neighbor = graph.GetNI(node.GetNbrNId(i))
        neighbors.append(neighbor)
        total_deg_nbr += neighbor.GetDeg()

    # 计算邻居之间边的数量
    edge_between_nbr = 0
    for i in range(len(neighbors)):
        for j in range(i):
            edge_between_nbr += neighbors[i].IsNbrNId(neighbors[j].GetId())

    return np.array((degree, edge_between_nbr + degree, total_deg_nbr - 2 * edge_between_nbr - degree))

我的结果为：

Recursive Features

也相对简单，使用下图的公式即可。具体细节，请参照官方作业的pdf。

核心函数：

def recursive_features(v_mat, graph):
    """
    迭代一次
    :param v_mat: np array
    :param graph: 目标图
    :return: 新的v_mat，向量长度变为原来的3倍
    """
    cnt_node, ori_len = v_mat.shape
    new_mat = np.zeros((cnt_node, ori_len))
    mean_mat = np.zeros((cnt_node, ori_len))
    sum_mat = np.zeros((cnt_node, ori_len))
    for node in graph.Nodes():
        nodeId = node.GetId()
        cnt_nbrs = node.GetDeg()
        new_mat[nodeId] = v_mat[nodeId]
        if cnt_nbrs == 0:
            continue
        for i in range(cnt_nbrs):
            nbr_id = node.GetNbrNId(i)
            mean_mat[nodeId] += v_mat[nbr_id]
            sum_mat[nodeId] += v_mat[nbr_id]
        mean_mat[nodeId] = mean_mat[nodeId] / cnt_nbrs
    return np.concatenate((new_mat, mean_mat, sum_mat), axis=1)

结果如下：

Role Discovery

主要工作即为：根据计算出的迭代3次的cosine similarity，绘出其分布直方图；根据直方图判别有几种role（看凸起即可），并随机选一个role中的一个节点，根据其local(basic) feature绘制其2-hop子图。唯一难点可能就是2-hop子图的绘制，这里我使用了networkx来绘制。我使用了python中的set()来滤去重复节点，这种方法时间复杂度显然很高。不过对于小图而言还是能接受的。

代码如下：

def Draw_SubGraph(Graph, Node_id):
    """
    绘图函数 Node 节点2-hop子图
    :param Graph: 目标图
    :param NI_id: 目标节点
    :return: 无
    """
    Nodes = []
    Edges = []
    center_node = Graph.GetNI(Node_id)
    Nodes.append(Node_id)
    colors = []

    # 所有1-hop节点放进去
    for i in range(center_node.GetDeg()):
        nbr_id = center_node.GetNbrNId(i)
        Nodes.append(nbr_id)
    # 所有2-hop节点放进去
    for i in range(len(Nodes)):
        mid_node = Graph.GetNI(Nodes[i])
        for j in range(mid_node.GetDeg()):
            nbr_id = mid_node.GetNbrNId(j)
            Nodes.append(nbr_id)
    Nodes = list(set(Nodes))

    for mid_id in Nodes:
        if mid_id == Node_id:
            colors.append('r')
        else:
            colors.append('b')
    # print(Nodes)
    for i in range(len(Nodes)):
        for j in range(i):
            if Graph.IsEdge(Nodes[i], Nodes[j]):
                Edges.append((Nodes[i], Nodes[j]))
    # print(len(Edges))
    G = nx.Graph()
    G.add_nodes_from(Nodes)
    G.add_edges_from(Edges)
    nx.draw_networkx(G, node_color=colors)
    # plt.show()

运行结果如下：分别为直方图和2-hop子图。

Overall

最大的感觉就是光听一遍和写完一遍代码对于整个流程完全有着不同程度的理解，还是得多练啊。。。

中间也卡了不少次，花费了不少的时间T.T

Part 3 Community detection using the Louvain algorithm

Community is just sets of tightly connected nodes.

如上图所示，这个就是Community Detection后的结果。这类Community detection的算法一般将：原本问题变为一个优化问题。

首先是定义一个优化目标，其应该是A measure of how well a network is partitioned into communities，该值正比于划分的好坏. 即对于每一个可能的partition都给出一个分数，目标就是找到使该分数最大的一个划分。这里使用的指标为：Modularity。定义如下图所示。

公式化之后变为：注：这里的Null model使用的是configuration model。即：有同样的degree distribution但node之间的连接是uniformly random的。

其取值在[-1,1]之间，It is positive if the number of edges within groups exceeds the expected number.

然而直接优化这个问题是一个NP hard问题。所以提出了Louvain algorithm。这是一种greedy algorithm。并且可以提供一种Hierarchical communities. 其整体流程如下：

值得注意的是这种方法假设：community是disjoint的。

Q3-1 Proof for Modularity gain when an isolated node moves into a community

证明如图：

Q3-2 Louvain algorithm on a 16 node network

Answers for Graph H:

1
12 NOTE: 6*2
$4 \cdot Q_c = 4(\frac{12}{2m}-\frac{14^2}{4m^2})$,where $m=12\cdot4+1\cdot4=52$, Q=0.158

Answers for Graph J:

2
26
Q=2(26/104-(28/104)^2)=0.355

Q3-3 Louvain algorithm on a 128 node network

与Q3-2是相同的计算方式，比较简单，不再计算一遍了。

Part 4 Spectral clustering

经典的谱聚类的方法。讲得还是很精彩和清晰的。我在这里简单的总结一下整体的思路框架。

首先是谱聚类的目标，对于图做一个good partition分成不同的部分，即不同的clustering。

所以我们要有一个指标来衡量一个good partition。引出了cut。

这个指标的最大问题是没有将每个cluster内部的连接考虑进去，在优化该指标时，即最小化cut指标时，我们总可以找到如下的“最优指标”，但该指标显然不是我们期望的。原因就是没有考虑每个cluster内部的连接。

为了解决这个问题，引入了指标Conductance

相当于对于cluster的连接做了一个正则，这样优化该指标，我们可以找到一个更好的partition。

下面作者引入了邻接矩阵的谱，拉普拉斯矩阵的谱（拉普拉斯矩阵有很好的性质，半正定矩阵，必定有全1向量为eigenvector，其对应了eigenvalue为0），最终将寻找第二小的特征值和特征向量与一个partition挂钩（直观解释），并且从理论推导出小的特征值为最优Conductance的一个下界。这样就寻找将拉普拉斯矩阵第二小的特征值和特征向量和找到最优partition联系了起来。简单回顾Spectral Clustering后，开始完成这部分的作业。

Q4-1 A Spectral Algorithm for Normalized Cut Minimization: Foundations

一些公式推导的题目，具体题目见官网源文件。

Q4-2 Normalized Cut Minimization: Solving for the Minimizer（Uncertain）

本部分的证明不太确定，最后部分使用的Rayleigh定理不太确定。

Q4-3 Relating Modularity to Cuts and Volumes

结束语

至此homework1的内容全部结束了，整体而言的难度不是很高，在编程上有snap库极大的简化了编程上的难度。整体来讲homework1涉及了很多分析图的方法和非常多的基本却关键的概念。比如：如何衡量一个图的特点（通过一些全局的指标比如：degree distribution，Clustering Coefficient，diameter），这是要和random graph做对比的。包括后面的涉及的如何衡量一个节点的特征（Role），motifs，graphlet的概念，community/group的概念，这些都是分析复杂图的基石。

整体而言，目前这部分课程也让我对于图论产生了很大的兴趣，由于本人之前没有离散数学or图论的基础，对于一些例如induced graph的概念并不是清楚，之后计划会额外自学一些关于图论方面的知识。