With the emergence of new service technologies such as RESTful and microservices, service software has evolved from a simple homogeneous system to a service ecology with open environment, cross-domain scenarios and complex business. In the microservice architecture, each functional module of the business system is split into relatively independent microservices, and these microservices achieve complete business functions by calling each other. Therefore, with the growth of business and system scale, there will be a large and complex number of service calls in the system. When one of these services fails, it often affects other services in the entire call chain, resulting in multiple service abnormalities, which in turn makes it difficult to determine the root cause of system abnormalities.
To address the above problem, we first extract the invocation information based on the service metrics and dynamically construct the service invocation graph based on it. In the graph, we treat individual services as nodes and inter-service invocations as edges, and use the relevant metrics of the services themselves, such as various resource utilization, as node features, and inter-service invocation metrics, such as response time and number of requests, as edge features. In turn, we transform the root cause location problem into a node classification problem on the service invocation graph.
Based on the above analysis, we propose a root cause localization algorithm based on graph neural networks by classifying nodes according to their features, so as to distinguish ordinary nodes from root cause nodes. Current graph neural network algorithms tend to randomly select the features of neighboring nodes for aggregation when calculating node features, and these neighboring nodes have the same weights, but in practical scenarios different neighboring nodes often have different effects on the nodes to be clustered.
Fig.1. call graph and deployment diagram of microservice systems. There are two kinds of nodes in the diagram including service node and host node. A service node denotes a instance of a microservice, which can interact with other service nodes. A host node represents a physical machine that runs some service nodes in the cloud-native system.
Therefore, in order to achieve node feature aggregation with different weights, we adopt the idea of attention mechanism to achieve weighted node aggregation by calculating the attention of master nodes and neighbor nodes as weights. Meanwhile, according to our analysis, in the call graph, the edge features between nodes contain important information, such as response time, which is very important for responding to the node health, and the traditional attention mechanism is to get the attention coefficient by calculating the node features. Based on the above reasons, we add the attention mechanism based on the edge features to the traditional attention. The specific method is by extracting the edge features between nodes and weighting multiple different edge features as the influence factor between nodes, and adding this influence factor with the traditional attention coefficient as the new attention coefficient. Based on the computed attention coefficients to perform the weighted aggregation of node features, the graph neural network algorithm based on the edge features enhanced attention mechanism is finally implemented and used for fault localization of microservice systems.
Fig.2. of MicroEGRCL major components and root cause localization workflow.
The positioning process of the whole algorithm is: firstly, the metrics of each service in the system is collected to construct the service call graph and the metrics are normalized. After that, the attention coefficient between service nodes is calculated and used as the weight of node feature aggregation. Then the node features are aggregated based on this weight, and finally the aggregated node features are classified and the probability of the corresponding service node as the root cause of the failure is output.
Authors
RUIBO CHEN is a Ph.D candidated at Beihang university majoring in software engineering at State Key Laboratory of Software Development Environment. His research interest is Artificial intelligence for IT operations and Reinfocement Learning.
Jian Ren received the Ph.D. degree from University College London. He is a lecturer at Beihang University. He is a Member of China Computer Federation Software Engineering Professional Committee. His research interests include Search-based Software Engineering, Natural Computation and Artificial Intelligence.
Lingfeng Wang is a master at Beihang university majoring in software engineering at State Key Laboratory of Software Development Environment. His research interest is Artificial intelligence for IT operations.
Yanjun Pu received a Bachelor’s degree in school of Computer Science & Engineering from university of Beihang in 2016, he is now a Ph.D candidated at Beihang university majoring in software engineering at State Key Laboratory of Software Development Environment. His research interest is Education Data mining.
Kaiyuan Yang is a master at Beihang university majoring in Computer Science and Technology at State Key Laboratory of Software Development Environment. His research interest is Microservice Scheduling and Artificial Intelligence.
WENJUN WU received the Ph.D. degree in computer science from Beihang University, Beijing, China. He is a professor at Beihang University. He is a member of the China Computer Federation. His research interests include intelligent tutoring system, crowdsourcing, and cloud computing.