in ,

Beijing University of Posts and Telecommunications | Overview of Network Security Graph Data Mining


Original title: Graph Mining for Cybersecurity: A Survey
Original authors: Bo Yan, Cheng Yang, Chuan Shi*, Yong Fang, Qi Li, Yanfang Ye, Junping Du
Journal published: ACM Transactions on Knowledge Discovery from Data
Original link: dl.acm.org/doi/pdf/10.1145/3610228
Topic Type: Graph Data Mining
Note author: ShuiChang
Editor-in-Chief: Huang Cheng@Safety Academic Circle

A review paper talks about the application of graph mining in network security. Discussed how to use graph mining technology to detect network threats, etc. The paper introduces the basic concepts and technologies of graph mining, and compares the application of traditional machine learning methods and graph mining technology in network security.

  • Application security

    • Internet fraud – including malicious web pages, black hat SEO, phishing websites, etc.
    • Review Spam – Internet trolls fake comments, which has problems such as complex inter-group dynamics and difficulty in checking.
    • False news
    • Fake account – Fake account used to participate in fake social networking, such as trolling
    • Financial Fraud – graphical modeling of money laundering, cash-out, loan defaults and insurance fraud
    • Underground market – analysis based on underground forums: key player detection, account association analysis, illegal product identification and CTI (Cyber ​​Threat Intelligence) collection
    • Safety of tradement
    • Cognition Security
  • Network infrastructure security

    • malicious software
    • System vulnerabilities
    • Blockchain Security – Vulnerabilities in the designed blockchain itself (such as smart contract vulnerabilities) and criminal activity in the blockchain (blockchain money laundering)
    • BotNet
    • Malicious domain names
    • Intrusion detection
    • cyber security
    • system security

Graph attribute enumeration (based on structural statistical features)

  • node hierarchy

    • Discrepancy
    • Centrality – used to measure the importance of nodes, including degree centrality, between centrality, and closeness centrality
    • PageRank
    • Local Clustering Coefficient – measures the proximity of a node to its neighbors: divide the actual number of edges between a node and its neighbors by the maximum number of edges that may exist between that node's neighbors
  • edge level

    • shortest path – the shortest path between two nodes
  • graph hierarchy

    • Diameter – the maximum length of the shortest path between two nodes in the graph
    • Degree Distribution – P(k), which is the proportion of nodes with degree k in the network. If there are n nodes in total and nk nodes have k degrees, then p(k)=nk/n

Homogeneous and heterogeneous graphs in network security

When the node type or edge type is greater than 1, the graph is called a heterogeneous graph, otherwise the graph is a homogeneous graph.

isomorphic graph

Isomorphic graphs are mostly used for malware analysis. Such as: API calls, function calls, etc.

For isomorphic graphs, basic structural information can be captured through first-order proximity and second-order proximity. The basic idea of ​​first-order proximity is that highly interconnected nodes should be closely embedded together. For example: two APIs with a calling relationship can implement similar functions. However, in some cases, disconnected nodes may behave similarly. For example: the three-party payment platform used by two malicious accounts. Second-order proximity represents the similarity of two nodes by comparing their neighbor structures. That is, if two nodes have more common neighbors, their second-order proximity is higher.

Heterogeneous graph

Most network security graphs are heterogeneous graphs. For example: Android program and its signature, API used, etc.

The basic form of expression of heterogeneous graph is network schema. That is, a meta-template that reflects node types and edge relationships. Meta-paths can capture very useful content for downstream tasks.

metapath

Meta paths are paths defined on the network schema. It can be represented as a combination of relationships, by specifying a sequence of node types and edge types that describe how to get from one node to another.

For example: in the author-paper-citation network, there is the meta-path “author-write-paper-cited-paper-write-author”.

According to the characteristics used, it can be classified as:

  • Statistical features – artificial statistical features, which are high-dimensional and sparse.

    • Structural features – i.e. graph structural features
    • attribute characteristics – attribute information that depends on the graph
  • Graph embedding – feature vectors are transformed from sparse to dense while automatically retaining structural and attribute information.

This article divides the graph mining methods used into three types:

  • Structural-attributed (structural/attributed): based on whether attribute information is used
  • Surface-deep (shallow/deep): depending on whether deep learning is used
  • Homogeneous/heterogeneous: depending on whether it is a heterogeneous graph

For different network security issues, focusing on different graph mining methods can achieve different effects.

Statistical Features

Structural characteristics: Mostly used in early security solutions, with poor versatility.For example, dissimilarity can identify botnets

Attributed Feature: Attributed Feature includes node attributes, edge weights, etc.

graph embedding

The purpose of graph embedding is to reduce the dimensionality of non-relational data and obtain the characteristics of the structure and attributes of the graph.

structural – attribute

The purpose of structural embedding is to obtain embedding in terms of graph structure. If the structures of two nodes are similar in some aspect, their structural embeddings will be relatively close. Therefore, many studies use random walks to sample node structural features. Such technologies include: DeepWalk, LINE, Node2vec, Struct2vec, etc.

Attribute embedding can represent both structure and node information. Such technologies include: GraphSAGE, GAT.

superficial – deep

Early research was superficial, and the only trainable parameters were node embeddings. This method is basically matrix factorization and random walk.

Graph Neural Networks (GNNs) brought by deep learning provide a more convenient and scalable way to obtain graph embeddings. This method can be divided into two types:

  • Based on spectral method: graph convolution network GCN
  • Based on spatial method: message passing network MPNN

There are also networks such as GAT, GraphSAGE, and STGNN.

Isomorphic – Heterogeneous

There are several categories of methods designed for homogeneous networks: Node2vec, DeepWalk, and LINE for random walks, and GCN, GAT, and GraphSAGE for GNN.

Graph data mining for cyberspace security often follows this assumption: similar nodes will be more likely to be connected to each other and have similar embeddings.

The embedding of heterogeneous graphs often pays more attention to semantics. For example, if two hosts often send packets through the same protocol, they are more likely to have the same embedding. The discovery of such semantic embeddings is mostly carried out with a predefined meta-path guided random walk algorithm.

Some advanced heterogeneous embedding methods rely on obtaining and combining information from heterogeneous neighbor nodes. This suggests several solutions:

  • Convert different embeddings into the same space through mapping functions;
  • After aggregating the target node embedding through specified neighbor nodes, use the multi-aggregation method to obtain the final node embedding;
  • After obtaining the aggregated nodes through the meta path, perform aggregation, as shown in the figure below.
  • task definition
    • Determine whether the task can be modeled using graphs
    • Abstract tasks into graph tasks (such as node prediction, node classification)
  • data collection
    • Collect task-related data and determine node sets, attribute sets, relationship sets, etc.
  • Graph construction
    • Model the data and build concrete graph instances
  • Model design
    • Design and optimize models through deep analysis of task properties and constructed graphs
    • via graph embedding or via traditional statistical analysis
  • Model evaluation
    • Classification models can use ACC, Micro-F1, FPR (for IDS), etc.
    • Clustering models can use AMI, etc.
    • Code similarity, recommendation systems, etc. can use Rank-n, MRR, etc.

Safety of tradement

Example: Using hacker forums to analyze underground markets

Collect information – build AHIN – define meta-path – perform random walk and multi-view fusion – obtain embeddings for further downstream tasks

cognitive domain security

Example: Using redirect links to identify web spam

Collect information – establish HIN – extract features, establish redirection templates based on graphs – train model predictions

Example: Review Scam – Troubled Reviews

Gather information – build HIN – extract new graphs based on common comments – design models, apply graph mining methods similar to community discovery algorithms – solve problems

cyber security

Example: Mining malicious domain names

Data collection-building HIN-graph representation and meta-path + neighbor node embedding fusion-detection (node ​​classification problem)

Example: IDS

Data collection – building HIN – applying graph mining technology

system security

Example: File-level malware detection

Data collection – data processing (disassembly to obtain Dalvic bytecode) – building a function call graph – using template-based (difficult to detect 0days) or learning-based (capable of detecting 0days) malware analysis methods – feature extraction – binary classification or Multiple classification tasks

data set

Data set name Key words
Elliptic Anti-money laundering, blockchain security
WEBSPAM-UK2007 Internet fraud
Twitter15&Twitter16 Twitter, retweet, counter rumors
LIAR False news
CTU-13 BotNet
NSL-KDD Intrusion detection, attack traffic
Amazon Review Comment, reply
SARD Vulnerabilities, source code
Drebin malicious software

data collection tools

Tool name Key words
Snopes Check rumors
DNSDB Scout DNS data
Semantic Extract AST
VirusTotal malicious software
Cuckoo Sandbox Dynamic Analysis
Wireshark Network traffic
Recruiting teammates in the safety academic circle-ing
If you are interested in joining the academic circle, please contact secdr#qq.com

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings

Hypervisor From Scratch – Basic concepts, configuration of test environment, and entry into VMX operation

Canadian retail chain Giant Tiger data breach may have impacted millions of customers