Beijing University of Posts and Telecommunications | Overview of Network Security Graph Data Mining

Original title: Graph Mining for Cybersecurity: A Survey
Original authors: Bo Yan, Cheng Yang, Chuan Shi*, Yong Fang, Qi Li, Yanfang Ye, Junping Du
Journal published: ACM Transactions on Knowledge Discovery from Data
Original link: dl.acm.org/doi/pdf/10.1145/3610228
Topic Type: Graph Data Mining
Note author: ShuiChang
Editor-in-Chief: Huang Cheng@Safety Academic Circle

A review paper talks about the application of graph mining in network security. Discussed how to use graph mining technology to detect network threats, etc. The paper introduces the basic concepts and technologies of graph mining, and compares the application of traditional machine learning methods and graph mining technology in network security.

Application security

Internet fraud – including malicious web pages, black hat SEO, phishing websites, etc.
Review Spam – Internet trolls fake comments, which has problems such as complex inter-group dynamics and difficulty in checking.
False news
Fake account – Fake account used to participate in fake social networking, such as trolling
Financial Fraud – graphical modeling of money laundering, cash-out, loan defaults and insurance fraud
Underground market – analysis based on underground forums: key player detection, account association analysis, illegal product identification and CTI (Cyber Threat Intelligence) collection
Safety of tradement
Cognition Security

Network infrastructure security

malicious software
System vulnerabilities
Blockchain Security – Vulnerabilities in the designed blockchain itself (such as smart contract vulnerabilities) and criminal activity in the blockchain (blockchain money laundering)
BotNet
Malicious domain names
Intrusion detection
cyber security
system security

Graph attribute enumeration (based on structural statistical features)

node hierarchy

Discrepancy
Centrality – used to measure the importance of nodes, including degree centrality, between centrality, and closeness centrality
PageRank
Local Clustering Coefficient – measures the proximity of a node to its neighbors: divide the actual number of edges between a node and its neighbors by the maximum number of edges that may exist between that node's neighbors

edge level

shortest path – the shortest path between two nodes

graph hierarchy

Diameter – the maximum length of the shortest path between two nodes in the graph
Degree Distribution – P(k), which is the proportion of nodes with degree k in the network. If there are n nodes in total and nk nodes have k degrees, then p(k)=nk/n

Homogeneous and heterogeneous graphs in network security

When the node type or edge type is greater than 1, the graph is called a heterogeneous graph, otherwise the graph is a homogeneous graph.

isomorphic graph

Isomorphic graphs are mostly used for malware analysis. Such as: API calls, function calls, etc.

For isomorphic graphs, basic structural information can be captured through first-order proximity and second-order proximity. The basic idea of first-order proximity is that highly interconnected nodes should be closely embedded together. For example: two APIs with a calling relationship can implement similar functions. However, in some cases, disconnected nodes may behave similarly. For example: the three-party payment platform used by two malicious accounts. Second-order proximity represents the similarity of two nodes by comparing their neighbor structures. That is, if two nodes have more common neighbors, their second-order proximity is higher.

Heterogeneous graph

Most network security graphs are heterogeneous graphs. For example: Android program and its signature, API used, etc.

The basic form of expression of heterogeneous graph is network schema. That is, a meta-template that reflects node types and edge relationships. Meta-paths can capture very useful content for downstream tasks.

metapath

Meta paths are paths defined on the network schema. It can be represented as a combination of relationships, by specifying a sequence of node types and edge types that describe how to get from one node to another.

For example: in the author-paper-citation network, there is the meta-path “author-write-paper-cited-paper-write-author”.

According to the characteristics used, it can be classified as:

Statistical features – artificial statistical features, which are high-dimensional and sparse.

Structural features – i.e. graph structural features
attribute characteristics – attribute information that depends on the graph

Graph embedding – feature vectors are transformed from sparse to dense while automatically retaining structural and attribute information.

This article divides the graph mining methods used into three types:

Structural-attributed (structural/attributed): based on whether attribute information is used
Surface-deep (shallow/deep): depending on whether deep learning is used
Homogeneous/heterogeneous: depending on whether it is a heterogeneous graph

For different network security issues, focusing on different graph mining methods can achieve different effects.

Statistical Features

Structural characteristics: Mostly used in early security solutions, with poor versatility.For example, dissimilarity can identify botnets

Attributed Feature: Attributed Feature includes node attributes, edge weights, etc.

graph embedding

The purpose of graph embedding is to reduce the dimensionality of non-relational data and obtain the characteristics of the structure and attributes of the graph.

structural – attribute

The purpose of structural embedding is to obtain embedding in terms of graph structure. If the structures of two nodes are similar in some aspect, their structural embeddings will be relatively close. Therefore, many studies use random walks to sample node structural features. Such technologies include: DeepWalk, LINE, Node2vec, Struct2vec, etc.

Attribute embedding can represent both structure and node information. Such technologies include: GraphSAGE, GAT.

superficial – deep

Early research was superficial, and the only trainable parameters were node embeddings. This method is basically matrix factorization and random walk.

Graph Neural Networks (GNNs) brought by deep learning provide a more convenient and scalable way to obtain graph embeddings. This method can be divided into two types:

Based on spectral method: graph convolution network GCN
Based on spatial method: message passing network MPNN

There are also networks such as GAT, GraphSAGE, and STGNN.

Isomorphic – Heterogeneous

There are several categories of methods designed for homogeneous networks: Node2vec, DeepWalk, and LINE for random walks, and GCN, GAT, and GraphSAGE for GNN.

Graph data mining for cyberspace security often follows this assumption: similar nodes will be more likely to be connected to each other and have similar embeddings.

The embedding of heterogeneous graphs often pays more attention to semantics. For example, if two hosts often send packets through the same protocol, they are more likely to have the same embedding. The discovery of such semantic embeddings is mostly carried out with a predefined meta-path guided random walk algorithm.

Some advanced heterogeneous embedding methods rely on obtaining and combining information from heterogeneous neighbor nodes. This suggests several solutions:

Convert different embeddings into the same space through mapping functions;
After aggregating the target node embedding through specified neighbor nodes, use the multi-aggregation method to obtain the final node embedding;
After obtaining the aggregated nodes through the meta path, perform aggregation, as shown in the figure below.

task definition

Determine whether the task can be modeled using graphs
Abstract tasks into graph tasks (such as node prediction, node classification)

data collection

Collect task-related data and determine node sets, attribute sets, relationship sets, etc.

Graph construction

Model the data and build concrete graph instances

Model design

Design and optimize models through deep analysis of task properties and constructed graphs
via graph embedding or via traditional statistical analysis

Model evaluation

Classification models can use ACC, Micro-F1, FPR (for IDS), etc.
Clustering models can use AMI, etc.
Code similarity, recommendation systems, etc. can use Rank-n, MRR, etc.

Safety of tradement

Example: Using hacker forums to analyze underground markets

Collect information – build AHIN – define meta-path – perform random walk and multi-view fusion – obtain embeddings for further downstream tasks

cognitive domain security

Example: Using redirect links to identify web spam

Collect information – establish HIN – extract features, establish redirection templates based on graphs – train model predictions

Example: Review Scam – Troubled Reviews

Gather information – build HIN – extract new graphs based on common comments – design models, apply graph mining methods similar to community discovery algorithms – solve problems

cyber security

Example: Mining malicious domain names

Data collection-building HIN-graph representation and meta-path + neighbor node embedding fusion-detection (node classification problem)

Example: IDS

Data collection – building HIN – applying graph mining technology

system security

Example: File-level malware detection

Data collection – data processing (disassembly to obtain Dalvic bytecode) – building a function call graph – using template-based (difficult to detect 0days) or learning-based (capable of detecting 0days) malware analysis methods – feature extraction – binary classification or Multiple classification tasks

data set

Data set name	Key words
Elliptic	Anti-money laundering, blockchain security
WEBSPAM-UK2007	Internet fraud
Twitter15&Twitter16	Twitter, retweet, counter rumors
LIAR	False news
CTU-13	BotNet
NSL-KDD	Intrusion detection, attack traffic
Amazon Review	Comment, reply
SARD	Vulnerabilities, source code
Drebin	malicious software

data collection tools

Tool name	Key words
Snopes	Check rumors
DNSDB Scout	DNS data
Semantic	Extract AST
VirusTotal	malicious software
Cuckoo Sandbox	Dynamic Analysis
Wireshark	Network traffic

Recruiting teammates in the safety academic circle-ing
If you are interested in joining the academic circle, please contact secdr#qq.com

Beijing University of Posts and Telecommunications | Overview of Network Security Graph Data Mining

Graph attribute enumeration (based on structural statistical features)