数据与计算发展前沿 ›› 2022, Vol. 4 ›› Issue (3): 78-89.

CSTR: 32002.14.jfdc.CN10-1649/TP.2022.03.006

doi: 10.11871/jfdc.issn.2096-742X.2022.03.006

• 专刊:先进智能计算平台及应用(下) • 上一篇    下一篇

微服务架构下的根因定位方法综述

李思毅(),马诗雨(),崔丽月(),张圣林(),孙永谦(),张玉志()   

  1. 南开大学,软件学院,天津 300350
  • 收稿日期:2022-02-24 出版日期:2022-06-20 发布日期:2022-06-20
  • 通讯作者: 张圣林
  • 作者简介:李思毅,南开大学,软件学院,硕士研究生,研究方向为机器学习、智能运维等。
    本文中主要负责基于根因分析方法和实践论文撰写。
    LI Siyi is a master’s student in the College of Software at Nan-kai University. His research interests include machine learning and AIOps.
    In this paper, he is responsible for the method and practical part of root cause analysis.
    E-mail: lisiyimail@qq.com|马诗雨,南开大学,软件学院,本科生,研究方向为机器学习、智能运维等。
    本文中主要负责关系学习方法和实践部分论文撰写。
    MA Shiyu is an undergraduate student in the College of Soft-ware at Nankai University. Her research interests include mach-ine learning and AIOps.
    In this paper, she is responsible for the method and practical part of the relational learning method.
    E-mail: nkcs77@163.com|崔丽月,南开大学,软件学院,硕士研究生,研究方向为机器学习、智能运维等。
    本文中主要负责文献的调研和论文修改。
    CUI Liyue is a master’s student in the College of Software at Nankai University. Her research inter-ests include machine learning and AIOps.
    In this paper, she is responsible for the related work investi-gation and paper revision.
    E-mail: 2320190026@mail.nankai.edu.cn|张圣林,南开大学,软件学院,博士,副教授,主要研究数据中心网络中的故障检测、诊断和预测。发表SCI/EI 收录论文 15 篇以上。
    本文中承担文献调研及指导。
    ZHANG Shenglin, Ph.D., is currently an associate professor in the College of Software, Nankai University, Tianjin, China. His current rese-arch interests include failure detection, diagnosis and prediction in data center networks. He has published 15 papers that are in-dexed by SCI/EI.
    In this paper, he is responsible for guidance and related work investigation.
    E-mail: zhangsl@nankai.edu.cn|孙永谦,南开大学,软件学院,博士,讲师,主要研究异常检测、根本原因定位以及数据中心的高性能切换。
    本文中主要设计文章架构并修改论文。
    SUN Yongqian, Ph.D., is currently an assistant professor in the College of Software, Nankai Univer-sity, Tianjin, China. His research interests include anomaly detection, root cause localization, and high-performance swit-ching in datacenter.
    In this paper, he is mainly responsible for the design of the paper architecture and paper revision.
    E-mail: sunyongqian@nankai.edu.cn|张玉志,南开大学,软件学院,院长,博士,讲席教授,研究方向为人工智能。
    本文中承担文献调研。
    ZHANG Yuzhi, Ph.D., is currently a distinguished professor and the dean of the College of Software, Nankai University. His research interests include deep learning and other aspects in artificial intelligence.
    In this paper, he is responsible for the related work investi-gation.
    E-mail: zyz@nankai.edu.cn
  • 基金资助:
    国家重点研发计划(2021YFB0300104);国家自然科学基金青年项目(61902200);中国博士后科学基金面上项目(2019M651015)

Overview of Root Cause Localization Method in Microservice Architecture

LI Siyi(),MA Shiyu(),CUI Liyue(),ZHANG Shenglin(),SUN Yongqian(),ZHANG Yuzhi()   

  1. College of Software, Nankai University, Tianjin 300350, China
  • Received:2022-02-24 Online:2022-06-20 Published:2022-06-20
  • Contact: ZHANG Shenglin

摘要:

【目的】在大规模云平台中,当微服务系统关键性能指标发生异常,要求运维人员面对告警风暴和纷繁复杂的异常指标及时梳理背后的异常关联,对异常进行准确的根因定位和快速的恢复。【方法】本文详细介绍在微服务架构下构建故障传播图的方式以及基于图推理的根因定位技术。结合云平台上运维及高可用的能力建设经验,对现有的根因定位方法进行梳理、总结。【结果】基于图推理的根因定位方法在大型数据中心显著提高了云上系统稳定性、可靠性。【局限】该方法依赖稳定的监控基础设施以及准确的指标异常检测能力。【结论】随着数字化转型的深入,微服务架构下的根因定位技术对大规模云平台的稳定性保障将会起到越来越大的作用。

关键词: 云原生, 微服务, 智能运维, 根因定位

Abstract:

[Objective] When the key performance indicators of the cloud native system are abnormal, operation and maintenance engineers are required to sort out the abnormal correlation behind the alarm storm and complex abnormal indicators in a timely manner, and perform accurate root cause localization and rapid recovery. [Methods] This paper introduces the way to build a fault propagation graph under a large-scale microservice architecture and a root cause localization technology based on graph reasoning. We investigate, compare and summarize the existing root cause localization methods based on the experience of operation and maintenance of the cloud platform and high-availability capacity building. [Results] The root cause localization method based on graph reasoning significantly improves the stability and reliability of the cloud system in large data centers. [Limitations] This method relies on a stable monitoring infrastructure and accurate anomaly detection capabilities for indicators. [Conclusions] With the deepening of digital transformation, the root cause localization technology of microservice indicators under the microservice architecture will play an increasingly important role in ensuring the stability of large-scale cloud platforms.

Key words: cloud native, microservices, AIOps, root cause localization