Original title: Graph Mining for Cybersecurity: A Survey
Original authors: Bo Yan, Cheng Yang, Chuan Shi*, Yong Fang, Qi Li, Yanfang Ye, Junping Du
Journal published: ACM Transactions on Knowledge Discovery from Data
Original link: dl.acm.org/doi/pdf/10.1145/3610228
Topic Type: Graph Data Mining
Note author: ShuiChang
Editor-in-Chief: Huang Cheng@Safety Academic Circle
A review paper talks about the application of graph mining in network security. Discussed how to use graph mining technology to detect network threats, etc. The paper introduces the basic concepts and technologies of graph mining, and compares the application of traditional machine learning methods and graph mining technology in network security.
-
Application security
-
Internet fraud – including malicious web pages, black hat SEO, phishing websites, etc. -
Review Spam – Internet trolls fake comments, which has problems such as complex inter-group dynamics and difficulty in checking. -
False news -
Fake account – Fake account used to participate in fake social networking, such as trolling -
Financial Fraud – graphical modeling of money laundering, cash-out, loan defaults and insurance fraud -
Underground market – analysis based on underground forums: key player detection, account association analysis, illegal product identification and CTI (Cyber Threat Intelligence) collection -
Safety of tradement -
Cognition Security -
Network infrastructure security
-
malicious software -
System vulnerabilities -
Blockchain Security – Vulnerabilities in the designed blockchain itself (such as smart contract vulnerabilities) and criminal activity in the blockchain (blockchain money laundering) -
BotNet -
Malicious domain names -
Intrusion detection -
cyber security -
system security
Graph attribute enumeration (based on structural statistical features)
-
node hierarchy
-
Discrepancy -
Centrality – used to measure the importance of nodes, including degree centrality, between centrality, and closeness centrality -
PageRank -
Local Clustering Coefficient – measures the proximity of a node to its neighbors: divide the actual number of edges between a node and its neighbors by the maximum number of edges that may exist between that node's neighbors -
edge level
-
shortest path – the shortest path between two nodes -
graph hierarchy
-
Diameter – the maximum length of the shortest path between two nodes in the graph -
Degree Distribution – P(k), which is the proportion of nodes with degree k in the network. If there are n nodes in total and nk nodes have k degrees, then p(k)=nk/n
Homogeneous and heterogeneous graphs in network security
When the node type or edge type is greater than 1, the graph is called a heterogeneous graph, otherwise the graph is a homogeneous graph.
isomorphic graph
Isomorphic graphs are mostly used for malware analysis. Such as: API calls, function calls, etc.
For isomorphic graphs, basic structural information can be captured through first-order proximity and second-order proximity. The basic idea of first-order proximity is that highly interconnected nodes should be closely embedded together. For example: two APIs with a calling relationship can implement similar functions. However, in some cases, disconnected nodes may behave similarly. For example: the three-party payment platform used by two malicious accounts. Second-order proximity represents the similarity of two nodes by comparing their neighbor structures. That is, if two nodes have more common neighbors, their second-order proximity is higher.
Heterogeneous graph
Most network security graphs are heterogeneous graphs. For example: Android program and its signature, API used, etc.
The basic form of expression of heterogeneous graph is network schema. That is, a meta-template that reflects node types and edge relationships. Meta-paths can capture very useful content for downstream tasks.
metapath
Meta paths are paths defined on the network schema. It can be represented as a combination of relationships, by specifying a sequence of node types and edge types that describe how to get from one node to another.
For example: in the author-paper-citation network, there is the meta-path “author-write-paper-cited-paper-write-author”.
According to the characteristics used, it can be classified as:
-
Statistical features – artificial statistical features, which are high-dimensional and sparse.
-
Structural features – i.e. graph structural features -
attribute characteristics – attribute information that depends on the graph -
Graph embedding – feature vectors are transformed from sparse to dense while automatically retaining structural and attribute information.
This article divides the graph mining methods used into three types:
-
Structural-attributed (structural/attributed): based on whether attribute information is used -
Surface-deep (shallow/deep): depending on whether deep learning is used -
Homogeneous/heterogeneous: depending on whether it is a heterogeneous graph
For different network security issues, focusing on different graph mining methods can achieve different effects.
Statistical Features
Structural characteristics: Mostly used in early security solutions, with poor versatility.For example, dissimilarity can identify botnets
Attributed Feature: Attributed Feature includes node attributes, edge weights, etc.
graph embedding
The purpose of graph embedding is to reduce the dimensionality of non-relational data and obtain the characteristics of the structure and attributes of the graph.
structural – attribute
The purpose of structural embedding is to obtain embedding in terms of graph structure. If the structures of two nodes are similar in some aspect, their structural embeddings will be relatively close. Therefore, many studies use random walks to sample node structural features. Such technologies include: DeepWalk, LINE, Node2vec, Struct2vec, etc.
Attribute embedding can represent both structure and node information. Such technologies include: GraphSAGE, GAT.
superficial – deep
Early research was superficial, and the only trainable parameters were node embeddings. This method is basically matrix factorization and random walk.
Graph Neural Networks (GNNs) brought by deep learning provide a more convenient and scalable way to obtain graph embeddings. This method can be divided into two types:
-
Based on spectral method: graph convolution network GCN -
Based on spatial method: message passing network MPNN
There are also networks such as GAT, GraphSAGE, and STGNN.
Isomorphic – Heterogeneous
There are several categories of methods designed for homogeneous networks: Node2vec, DeepWalk, and LINE for random walks, and GCN, GAT, and GraphSAGE for GNN.
Graph data mining for cyberspace security often follows this assumption: similar nodes will be more likely to be connected to each other and have similar embeddings.
The embedding of heterogeneous graphs often pays more attention to semantics. For example, if two hosts often send packets through the same protocol, they are more likely to have the same embedding. The discovery of such semantic embeddings is mostly carried out with a predefined meta-path guided random walk algorithm.
Some advanced heterogeneous embedding methods rely on obtaining and combining information from heterogeneous neighbor nodes. This suggests several solutions:
-
Convert different embeddings into the same space through mapping functions; -
After aggregating the target node embedding through specified neighbor nodes, use the multi-aggregation method to obtain the final node embedding; -
After obtaining the aggregated nodes through the meta path, perform aggregation, as shown in the figure below.
-
task definition -
Determine whether the task can be modeled using graphs -
Abstract tasks into graph tasks (such as node prediction, node classification) -
data collection -
Collect task-related data and determine node sets, attribute sets, relationship sets, etc. -
Graph construction -
Model the data and build concrete graph instances -
Model design -
Design and optimize models through deep analysis of task properties and constructed graphs -
via graph embedding or via traditional statistical analysis -
Model evaluation -
Classification models can use ACC, Micro-F1, FPR (for IDS), etc. -
Clustering models can use AMI, etc. -
Code similarity, recommendation systems, etc. can use Rank-n, MRR, etc.
Safety of tradement
Example: Using hacker forums to analyze underground markets
Collect information – build AHIN – define meta-path – perform random walk and multi-view fusion – obtain embeddings for further downstream tasks
cognitive domain security
Example: Using redirect links to identify web spam
Collect information – establish HIN – extract features, establish redirection templates based on graphs – train model predictions
Example: Review Scam – Troubled Reviews
Gather information – build HIN – extract new graphs based on common comments – design models, apply graph mining methods similar to community discovery algorithms – solve problems
cyber security
Example: Mining malicious domain names
Data collection-building HIN-graph representation and meta-path + neighbor node embedding fusion-detection (node classification problem)
Example: IDS
Data collection – building HIN – applying graph mining technology
system security
Example: File-level malware detection
Data collection – data processing (disassembly to obtain Dalvic bytecode) – building a function call graph – using template-based (difficult to detect 0days) or learning-based (capable of detecting 0days) malware analysis methods – feature extraction – binary classification or Multiple classification tasks
data set
Data set name | Key words |
---|---|
Elliptic | Anti-money laundering, blockchain security |
WEBSPAM-UK2007 | Internet fraud |
Twitter15&Twitter16 | Twitter, retweet, counter rumors |
LIAR | False news |
CTU-13 | BotNet |
NSL-KDD | Intrusion detection, attack traffic |
Amazon Review | Comment, reply |
SARD | Vulnerabilities, source code |
Drebin | malicious software |
data collection tools
Tool name | Key words |
---|---|
Snopes | Check rumors |
DNSDB Scout | DNS data |
Semantic | Extract AST |
VirusTotal | malicious software |
Cuckoo Sandbox | Dynamic Analysis |
Wireshark | Network traffic |
Recruiting teammates in the safety academic circle-ing
If you are interested in joining the academic circle, please contact secdr#qq.com
GIPHY App Key not set. Please check settings