Processing math: 0%
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

HyWarm:针对处理器 RTL仿真的自适应混合预热方法

周耀阳, 韩博阳, 蔺嘉炜, 王凯帆, 张林隽, 余子濠, 唐丹, 王卅, 孙凝晖, 包云岗

周耀阳, 韩博阳, 蔺嘉炜, 王凯帆, 张林隽, 余子濠, 唐丹, 王卅, 孙凝晖, 包云岗. HyWarm:针对处理器 RTL仿真的自适应混合预热方法[J]. 计算机研究与发展, 2023, 60(6): 1246-1261. DOI: 10.7544/issn1000-1239.202330061
引用本文: 周耀阳, 韩博阳, 蔺嘉炜, 王凯帆, 张林隽, 余子濠, 唐丹, 王卅, 孙凝晖, 包云岗. HyWarm:针对处理器 RTL仿真的自适应混合预热方法[J]. 计算机研究与发展, 2023, 60(6): 1246-1261. DOI: 10.7544/issn1000-1239.202330061
Zhou Yaoyang, Han Boyang, Lin Jiawei, Wang Kaifan, Zhang Linjuan, Yu Zihao, Tang Dan, Wang Sa, Sun Ninghui, Bao Yungang. HyWarm: Adaptive Hybrid Warmup Method for RTL Emulation of Processors[J]. Journal of Computer Research and Development, 2023, 60(6): 1246-1261. DOI: 10.7544/issn1000-1239.202330061
Citation: Zhou Yaoyang, Han Boyang, Lin Jiawei, Wang Kaifan, Zhang Linjuan, Yu Zihao, Tang Dan, Wang Sa, Sun Ninghui, Bao Yungang. HyWarm: Adaptive Hybrid Warmup Method for RTL Emulation of Processors[J]. Journal of Computer Research and Development, 2023, 60(6): 1246-1261. DOI: 10.7544/issn1000-1239.202330061
周耀阳, 韩博阳, 蔺嘉炜, 王凯帆, 张林隽, 余子濠, 唐丹, 王卅, 孙凝晖, 包云岗. HyWarm:针对处理器 RTL仿真的自适应混合预热方法[J]. 计算机研究与发展, 2023, 60(6): 1246-1261. CSTR: 32373.14.issn1000-1239.202330061
引用本文: 周耀阳, 韩博阳, 蔺嘉炜, 王凯帆, 张林隽, 余子濠, 唐丹, 王卅, 孙凝晖, 包云岗. HyWarm:针对处理器 RTL仿真的自适应混合预热方法[J]. 计算机研究与发展, 2023, 60(6): 1246-1261. CSTR: 32373.14.issn1000-1239.202330061
Zhou Yaoyang, Han Boyang, Lin Jiawei, Wang Kaifan, Zhang Linjuan, Yu Zihao, Tang Dan, Wang Sa, Sun Ninghui, Bao Yungang. HyWarm: Adaptive Hybrid Warmup Method for RTL Emulation of Processors[J]. Journal of Computer Research and Development, 2023, 60(6): 1246-1261. CSTR: 32373.14.issn1000-1239.202330061
Citation: Zhou Yaoyang, Han Boyang, Lin Jiawei, Wang Kaifan, Zhang Linjuan, Yu Zihao, Tang Dan, Wang Sa, Sun Ninghui, Bao Yungang. HyWarm: Adaptive Hybrid Warmup Method for RTL Emulation of Processors[J]. Journal of Computer Research and Development, 2023, 60(6): 1246-1261. CSTR: 32373.14.issn1000-1239.202330061

HyWarm:针对处理器 RTL仿真的自适应混合预热方法

基金项目: 中国科学院战略性先导科技专项(XDC05030200), 国家自然科学基金重大项目(62090020)
详细信息
    作者简介:

    周耀阳: 1995年生. 博士. 主要研究方向为处理器ILP提升、可扩展处理器设计、负载采样和性能评测方法

    韩博阳: 1999年生. 工程硕士研究生. 主要研究方向为计算机体系结构、数字系统设计和高速串行通讯协议

    蔺嘉炜: 1998年生. 硕士研究生. 主要研究方向为高性能计算机体系结构

    王凯帆: 1997年生. 博士研究生. 主要研究方向为处理器敏捷开发与计算机体系结构

    张林隽: 1998年生. 硕士研究生. 主要研究方向为高性能计算机体系结构

    余子濠: 1991年生. 博士. 主要研究方向为计算机系统结构和操作系统

    唐丹: 1976年生. 博士,高级工程师. 主要研究方向为计算机体系结构和低功耗SoC设计

    王卅: 1986年生. 博士,副研究员. 主要研究方向为云计算、操作系统以及系统建模与性能分析

    孙凝晖: 1968年生. 博士,中国工程院院士,CCF会士. 主要研究方向为计算机系统结构、高性能计算

    包云岗: 1980年生. 博士,研究员. 主要研究方向为数据中心体系结构、处理器芯片敏捷设计方法论、开源处理器芯片生态

    通讯作者:

    包云岗(baoyg@ict.ac.cn

  • 中图分类号: TP391

HyWarm: Adaptive Hybrid Warmup Method for RTL Emulation of Processors

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDC05030200), and the Major Program of the National Natural Science Foundation of China (62090020).
More Information
    Author Bio:

    Zhou Yaoyang: born in 1995. PhD. His main research interests include CPU ILP enhancement, scalable CPU design, workload sampling, and performance evaluation methods

    Han Boyang: born in 1999. Master candidate of Science in Engineering. His main research interests include computer architecture, digital system design, and high-speed serial communication protocols

    Lin Jiawei: born in 1998. Master candidate. His research interest includes high-performance computer architecture

    Wang Kaifan: born in 1997. PhD candidate. His main research interests include agile development of processors and computer architecture

    Zhang Linjuan: born in 1998. Master candidate. Her main research interest includes high-performance computer architecture

    Yu Zihao: born in 1991. PhD. His main research interests include computer architecture and operating system

    Tang Dan: born in 1976, PhD, senior engineer. His main research interests include computer architecture and low power SoC design

    Wang Sa: born in 1986. PhD, associate professor. His main research interests include cloud computing, operating systems, and system modeling and performance analysis

    Sun Ninghui: born in 1968. PhD, academician of Chinese Academy of Engineering, fellow of CCF. His main research interests include computer architecture and high performance computing

    Bao Yungang: born in 1980. PhD, professor. His main research interests include data-center architecture, agile design methodology of processor chips and ecosystem of open-source processor chips

  • 摘要:

    在高性能处理器开发中,准确而快速的性能估算是设计决策和参数选择的基础. 现有工作通过采样算法和RTL的体系结构检查点加速了处理器RTL仿真,使得在数天内测算复杂高性能处理器的SPECCPU等基准测试的性能成为可能. 但是数天的迭代周期仍然过长,性能测算周期仍然有进一步缩短的空间. 在处理器RTL仿真过程中,预热过程的时间占比很大. HyWarm框架的提出是为了加速性能测算过程中的预热过程. HyWarm通过微结构模拟器分析负载预热需求,为每个负载定制预热方案. 对于缓存预热需求较大的负载,HyWarm通过总线协议进行RTL缓存的功能预热;对于RTL全细节仿真,HyWarm利用CPU分簇和LJF调度缩短最大完成时间. HyWarm相较于现有最好的RTL采样仿真方法,在与基准方法准确率相似的前提下,将仿真完成时间缩短了53%.

    Abstract:

    When developing high-performance processors, accurate and fast performance estimation is the basis for design decisions and parameter exploration. Prior work accelerates processor RTL emulation through workload sampling and architectural checkpoints for RTL, which makes it possible to estimate the performance of benchmarks such as SPECCPU running on complex high-performance processors within a few days. However, waiting a few days for performance results is still too long for architecture iteration, and there is still room for further shortening the performance measurement cycle. During RTL emulation of processors, the warm up phase consumes a significant amount of time. As a solution to expedite the warm up phase during performance evaluation, the HyWarm framework is developed. HyWarm analyzes the warm up demand of workloads with the micro-architectural simulator, and adaptively customizes the warm up scheme for each workload. For workloads with high warm up demand on caches, HyWarm performs functional warm up through the caches’ bus protocol on RTL. For detailed emulation part, HyWarm utilizes CPU clustering and LJF scheduling to reduce the maximum completion time. Compared with the best existing sampling-based RTL emulation method, HyWarm reduces the emulation completion time by 53% under the premise of similar accuracy to the baseline method.

  • 图  1   现有的基于采样的仿真方法

    Figure  1.   Existing sampling-based simulation methods

    图  2   来自SPECCPU® 2006的492个检查点的仿真时间分布

    Figure  2.   Emulation time distribution of 492 checkpoints from SPECCPU® 2006

    图  3   HyWarm的优化概览:将现存固定预热长度分为3段

    Figure  3.   Optimization overview of HyWarm: Existing fixed warm up duration is divided into three segments

    图  4   主流的基于采样的仿真方法

    Figure  4.   Mainstream sampling-based simulation methods

    图  5   sjeng的预热需求曲线

    Figure  5.   Warm up demand curve of sjeng

    图  6   预热长度搜索过程

    Figure  6.   Warm up length search process

    图  7   GEM5模拟器与香山处理器的分支预测器预热需求

    Figure  7.   Warm up demand of branch predictors in GEM5 simulator and Xiangshan processor

    图  8   开启Verilator多线程对调度策略的影响

    Figure  8.   Impact of enabling multi-threading in Verilator on scheduling policy

    图  9   不同的调度策略下最大完成时间对比

    Figure  9.   Comparison of maximum completion time under different scheduling policies.

    图  10   HyWarm工作流程

    Figure  10.   Workflow of HyWarm

    图  11   Filter模式的工作流程

    Figure  11.   Workflow of Filter mode

    图  12   接收TileLink请求的缓存子系统

    Figure  12.   Cache subsystem that receives TileLink requests

    图  13   检查点的预热需求(指令数)分布

    Figure  13.   Distribution of warm up demand (the number of instructions) or checkpoints.

    图  14   GEM5模拟器与香山处理器的预热需求曲线

    Figure  14.   Warm up demand curve of GEM5 simulator and Xiangshan processor

    图  15   不同预热方案对L1MP的影响

    Figure  15.   Impact of different warm up schemes on L1MP

    图  16   不同预热方案对分支MPKI的影响

    Figure  16.   Impact of different warm up schemes on branch MPKI

    图  17   不同预热方案对CPI的影响

    Figure  17.   Impact of different warm up schemes on CPI

    图  18   使用自适应预热时53个负载的全细节仿真周期数分布

    Figure  18.   Distribution of total detailed simulation cycle counts for 53 workloads using adaptive warm up

    表  1   在AMD EPYC 7H12 64核服务器上运行不同并行任务数的Verilator的仿真速度

    Table  1   Emulation Speed of Verilator with Different Parallelism on AMD EPYC 7H12 Server with 64 Cores

    仿真速度/IPS 4线程单任务4线程16任务满载性能损失
    单任务2153.131189.31
    每核538.28297.3345%
    下载: 导出CSV

    表  2   常用的RTL性能评估方法对比

    Table  2   Comparison of Commonly Used RTL Performance Evaluation Methods

    RTL性能评估方法仿真频率典型价格/CNY是否可租用典型可容纳设计
    RTL软件仿真器1kHz5−10万可容纳商业级SoC
    公有云FPGA\leqslant 100MHz每天240−3600Boom处理器
    私有FPGA\leqslant 100MHz\leqslant 40万香山处理器
    硬件仿真加速器\leqslant 1MHz>1000万可容纳商业级SoC
    下载: 导出CSV

    表  3   服务器低负载时Verilator仿真的多线程扩展效率对比

    Table  3   Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server Load is Low

    线程数量14816
    每核 IPS190.82538.28450.94321.27
    下载: 导出CSV

    表  4   服务器满载时Verilator仿真的多线程扩展效率对比

    Table  4   Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server is Fully Loaded

    线程数量4816
    每核IPS297.33389.27335.50
    下载: 导出CSV

    表  5   微结构配置

    Table  5   Microarchitectural Configuration

    部件配置
    分支预测器16KB TAGE-SC + ITTAGE + RAS + 4KB BTB
    一级数据缓存128KB, 8路数据缓存
    一级指令缓存128KB, 8路指令缓存
    二级缓存1MB 8路 非包含
    三级缓存6MB 6路 非包含
    一级指令TLB40项
    一级数据TLB136(128 × 4k页 + 8 × 2M页)
    二级TLB2K项
    取指宽度每周期8×4B指令
    译码重命名宽度每周期6条指令
    ROB/LQ/SQ256/80/64
    物理寄存器堆192整数;192浮点
    执行单元Int: 4×ALU, 2×MDU, 1×Misc
    Mem: 2×Ld AGU, 2×St AGU
    Float: 4×FMA, 2×Misc
    下载: 导出CSV

    表  6   预热配置

    Table  6   Warm up Configurations

    方案功能预热的
    M条指令数
    全细节预热的
    M条指令数
    性能测量的
    M条指令数
    0+1001005
    0+50505
    0+25205
    0+10105
    0+555
    Ada100−DW自适应(DW5
    FixedFW
    (95+5)
    9555
    下载: 导出CSV

    表  7   不同功能预热方案的总仿真时长对比 h

    Table  7   Comparison of Total Simulation Time for Different Functional Warm up Schemes

    子项0+50+100+25FixedFW (95+5)Ada
    GemsFDTD0.370.551.040.420.29
    astar.bi0.570.911.650.580.64
    astar.ri0.690.951.970.660.79
    bwaves0.570.921.680.600.43
    bzip2.chi0.300.430.810.300.22
    bzip2.com1.001.522.711.010.72
    bzip2.htm0.300.430.920.340.31
    bzip2.lib0.300.420.890.300.21
    bzip2.pro1.011.603.190.980.68
    bzip2.sou0.951.492.921.080.96
    cactusADM0.410.601.350.470.32
    calculix0.350.601.120.360.26
    dealII0.330.511.100.401.20
    gamess.cy0.330.491.000.363.46
    gamess.gra0.350.511.060.381.09
    gamess.tri0.330.500.920.341.10
    gcc.1660.420.611.330.481.34
    gcc.2000.901.172.720.890.71
    gcc.cpde0.540.861.630.621.75
    gcc.expr20.580.861.760.631.03
    gcc.expr0.630.891.750.610.70
    gcc.g230.550.761.540.660.43
    gcc.s040.570.931.660.670.69
    gcc.scil0.901.102.340.942.48
    gcc.type0.921.442.620.911.57
    gobmk.13x0.941.513.080.991.66
    gobmk.nn0.851.282.610.920.61
    gobmk.sco0.971.342.700.980.66
    gobmk.tr0.951.302.630.870.98
    gobmk.tr0.711.072.260.731.17
    gromacs0.721.002.250.720.48
    h264ref.f0.440.581.210.470.45
    h264ref.s0.380.501.040.382.23
    hmmer.nph0.771.252.520.851.45
    hmmer.re0.801.212.430.920.79
    lbm0.671.022.080.740.57
    leslie3d0.510.781.430.510.35
    libquantum0.560.781.550.980.39
    mcf3.144.189.353.342.32
    milc0.420.591.260.460.34
    namd0.520.771.380.480.31
    omnetpp1.081.663.191.271.06
    perl.che0.460.681.290.470.83
    perl.di0.550.831.370.521.56
    perl.spli0.430.661.310.430.32
    povray0.550.881.650.545.39
    sjeng0.721.052.000.672.14
    soplex.p1.151.593.571.360.87
    soplex.r1.111.703.051.140.71
    sphinx30.460.721.330.591.49
    tonto0.370.551.190.410.48
    xalancbmk0.891.422.561.171.03
    zeusmp0.510.751.530.580.39
    总计35.852.7105.538.554.4
    注:黑体数字表示mcf是25M全细节预热下的时间最长的子项,而povray是Ada配置下的时间最长子项.
    下载: 导出CSV

    表  8   不同方案准确率对比

    Table  8   Accuracy Comparison of Different Schemes %

    方案CPI分支MPKIL1MP
    Ada99.691.695.1
    0+5099.898.997.5
    0+2599.794.191.3
    0+1099.185.282.8
    下载: 导出CSV

    表  9   WarmProfiler的分支MPKI预测误差(增高)

    Table  9   Branch MPKI Prediction Error Caused by WarmProfiler (increase)

    子项完美预测
    MPKI
    MPKI
    增高
    MPKI
    增高百分比/%
    gcc_expr20.4430.17739.9
    gcc_g230.9730.17217.7
    tonto0.5060.11723.1
    gamess_g0.4300.11226.1
    gcc_scilab7.6870.0901.2
    xalancbmk2.0030.0793.9
    gcc_s040.1630.07042.8
    perl_di0.6690.0669.8
    h264ref_f0.0420.064151.9
    astar_rivers3.4220.0531.6
    注:计算MPKI误差的方法是用WarmProfiler指导预热所得的MPKI减去用RTL的真实预热需求进行预热所得到的MPKI. 黑体数字标识出了MPKI误差超过0.1的子项.
    下载: 导出CSV

    表  10   簇的数量对调度均衡度的影响

    Table  10   Impact of Cluster Count on Scheduling Balance

    调度均衡度随机调度LJF调度
    4 簇 × 16核0.930.99
    8 簇 × 8核0.760.98
    16 簇 × 4核0.540.63
    下载: 导出CSV

    表  11   LJF调度与随机调度的仿真时间对比

    Table  11   Comparison of Simulation Time Between LJF Scheduling and Random Scheduling

    仿真随机调度/hLJF调度/h提升率/%
    Ada,8核×8簇8.716.9120.61
    Ada,8核×16簇6.255.3813.89
    25+5,8核×8簇15.9813.5415.26
    25+5,8核×16簇11.299.3517.23
    注:Ada结合LJF调度是HyWarm提出的方案;25+5结合随机调度是基线方案.
    下载: 导出CSV

    表  12   采用模拟器IPC和RTL的真实IPC指导LJF调度的最大完成时间

    Table  12   Maximum Completion Time of LJF Scheduling Guided by Simulator IPC and Real IPC of RTL h

    Ada仿真模拟器预测IPC真实IPC
    8核 × 4 簇13.7713.67
    8核 × 8 簇6.916.92
    8核 × 16 簇5.385.38
    注:黑体数字标识出8簇下模拟器预测IPC获得了更短的完成时间,这是因为LJF是贪心算法,完成时间的预测误差可能导致更好的调度结果.
    下载: 导出CSV
  • [1]

    Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C] //Proc of the 49th Annual Design Automation Conf. New York: ACM, 2012: 1212–1221

    [2]

    Nikhil R. Bluespec systemVerilog: Efficient, correct RTL from high-level specifications[C] //Proc of the 2nd Int Conf on Formal Methods and Models for Co-Design. Piscataway, NJ: IEEE, 2004: 69–70

    [3]

    Asanovic K, Avizienis R, Bachrach J, et al. The Rocket Chip Generator[R]. Berkeley, CA: UC Berkeley, 2016

    [4]

    Xu Yinan, Yu Zihao, Tang Dan, et al. Towards developing high performance RISC-V processors using agile methodology[C] //Proc of the 55th Annual Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1178–1199

    [5]

    Lockhart D, Zibrat G, Batten C. PyMTL: A unified framework for vertically integrated computer architecture research[C] //Proc of the 47th Annual Int Symp on Microarchitecture (MICRO). Los Alamitos, CA: IEEE Computer Society, 2014: 280–292

    [6]

    Celio C, Chiu P F, Asanović K, et al. Broom: An open-source out-of-order processor with resilient low-voltage operation in 28-nm CMOS[J]. IEEE Micro, 2019, 39(2): 52−60 doi: 10.1109/MM.2019.2897782

    [7]

    Celio C, Patterson D, Asanovi K. The Berkeley Out-of-Order Machine ( BOOM ) Design Specification[R]. Berkeley, CA: UC Berkeley, 2016

    [8] 王凯帆,徐易难,余子濠等. 香山开源高性能 RISC-V 处理器设计与实现[J]. 计 算 机 研 究 与 发 展,2023,60(3):476−493

    Wang Kaifan, Xu Yinan, Yu Zihao, et al. XiangShan open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493 (in Chinese)

    [9]

    Veripool. Verilator, the fastest Verilog/SystemVerilog simulator. [EB/OL]. [2022-10-20]. https://www.veripool.org/verilator/

    [10]

    Sherwood T, Perelman E, Calder B. Basic block distribution analysis to find periodic behavior and simulation points in applications[C] //Proc of the 2001 Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2001: 3–14

    [11]

    Wunderlich R E, Wenisch T F, Falsafi B, et al. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling[C] //Proc of the 30th Annual Int Symp on Computer Architecture, ISCA. Los Alamitos, CA: IEEE Computer Society, 2003: 84–95

    [12]

    Binkert N, Beckmann B, Black G, et al. The gem5 simulator[C] //Proc of the 16th Int Conf on Architectural Support for Programming Languages and Operating Systems.New York: ACM, 2011, 39(2): 1–7

    [13]

    Kabylkas N, Thorn T, Srinath S, et al. Effective processor verification with logic fuzzer enhanced co-simulation[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 667–678

    [14]

    Eeckhout L, Luo Y, De Bosschere K, et al. BLRL: Accurate and efficient warmup for sampled processor simulation[J]. Computer Journal, 2005, 48(4): 451−459 doi: 10.1093/comjnl/bxh103

    [15]

    Wenisch T F, Wunderlich R E, Falsafi B, et al. TurboSMARTS: Accurate microarchitecture simulation sampling in minutes[C] //Proc of the Int Conf on Measurements and Modeling of Computer Systems.New York: ACM, 2005: 408–409

    [16]

    Nikoleris N, Sandberg A, Hagersten E, et al. CoolSim: Statistical techniques to replace cache warming with efficient, virtualized profiling[C] //Proc of the Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2017: 106–115

    [17]

    Nikoleris N, Eeckhout L, Hagersten E, et al. Directed statistical warming through time traveling[C] //Proc of the 52nd Annual Int Symp on Microarchitecture. New York: ACM, 2019: 1037–1049

    [18]

    Patil H, Isaev A, Heirman W, et al. ELFies: executable region checkpoints for performance analysis and simulation[C] // Proc of the Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2021: 126–136

    [19]

    Haskins J W, Skadron K. Memory reference reuse latency: accelerated warmup for sampled microarchitecture simulation[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2003: 195–203

    [20]

    Yue Luo, John L K, Eeckhout L. Self-monitored adaptive cache warm-up for microprocessor simulation[C] //Proc of the 16th Symp on Computer Architecture and High Performance Computing. Los Alamitos, CA: IEEE Computer Society, 2004: 10–17

    [21]

    ARM. Learn the architecture-introducing AMBA CHI[EB/OL]. [2022-11-24]. https://developer.arm.com/documentation/102407/0100

    [22]

    Cook H, Terpstra W, Lee Y. Diplomatic design patterns: A TileLink case study[C] //Proc of the First Workshop on Computer Architecture Research with RISC-V. Berkeley, CA: UC Berkeley, 2017: 23

    [23]

    Coffman E G, Sethi R. A generalized bound on LPT sequencing[C] //Proc of the Int Symp on Computer Modeling, Measurement and Evaluation. New York: ACM, 1976: 306–310

    [24]

    Xiao Xin. A direct proof of the 4/3 bound of LPT scheduling rule[C] //Proc of Int Conf on Frontiers of Manufacturing Science and Measuring Technology. Amsterdam, The Netherlands: Atlantis, 2017: 486–489

    [25]

    Tan Zhangxi, Waterman A, Cook H, et al. A case for FAME: FPGA architecture model execution[C] //Proc of the 37th Int Symp on Computer Architecture. New York: ACM, 2010: 290–301

    [26]

    Karandikar S, Mao H, Kim D, et al. FireSim : FPGA-accelerated cycle-exact scale-out system simulation in the public cloud[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 29-42

    [27]

    Kim D, Izraelevitz A, Celio C, et al. Strober: Fast and accurate sample-based energy simulation for arbitrary RTL[C] //Proc of the 43rd Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 128–139

    [28]

    Hung W N N, Sun R. Challenges in large FPGA-based logic emulation systems[C] //Proc of the Int Symp on Physical Design. New York: ACM, 2018: 26–33

    [29]

    Agnesina A, Lim S K, Lepercq E, et al. Improving FPGA-based logic emulation systems through machine learning[J].ACM Trans on Design Automation of Electronic Systems, 2020, 25(5): 46:1-46:20

    [30]

    Cadence. Palladium Emulation [EB/OL]. [2022-12-22]. https://www.cadence.com/en_US/home/tools/system-design-and-verification/emulation-and-prototyping/palladium.html

    [31]

    Siemens Software. Veloce Hardware-assisted Verification System[EB/OL]. [2023-01-08]. https://eda.sw.siemens.com/en-US/ic/veloce/

    [32]

    Synopsys. Synopsys Emulation Systems[EB/OL]. [2023-01-08]https://www.synopsys.com/verification/emulation.html

    [33]

    Beamer S, Donofrio D. Efficiently exploiting low activity factors to accelerate RTL simulation[C] //Proc of the Design Automation Conf. Piscataway, NJ: IEEE, 2020: 1-6

    [34]

    Sandberg A, Nikoleris N, Carlson T E, et al. Full speed ahead: Detailed architectural simulation at near-native speed[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2015: 183–192

    [35]

    Hassani S, Southern G, Renau J. LiveSim: Going live with microarchitecture simulation[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 606–617

    [36]

    Vengalam U K R, Sharma A, Huang M C. LoopIn: A Loop-Based Simulation Sampling Mechanism[C] //Proc of the Int IEEE Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2022: 224–226

    [37]

    Carlson T E, Heirman W, Van Craeynest K, et al. BarrierPoint: Sampled simulation of multi-threaded applications[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2014: 2–12

    [38]

    Grass T, Carlson T E, Rico A, et al. Sampled simulation of task-based programs[J]. IEEE Trans on Computers, 2019, 68(2): 255−269 doi: 10.1109/TC.2018.2860012

    [39]

    Ardestani E K, Renau J. ESESC: A fast multicore simulator using time-based sampling[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2013: 448–459

    [40]

    Pestel S De, Eyerman S, Eeckhout L. Micro-architecture independent branch behavior characterization[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2015: 135–144

    [41]

    RISC-V International. RISC-V Debug Support Version 1.0.0-STABLE[EB/OL]. [2023-01-26]. https://github.com/riscv/riscv-debug-spec

    [42]

    Standard Performance Evaluation Corporation. SPEC CPU® 2006[EB/OL]. [2023-01-26]. https://www.spec.org/cpu2006/

    [43]

    Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2005: 66–77

    [44]

    Black B, Shen J P. Calibration of microprocessor performance models[J]. Computer, 1998, 31(5): 59−65 doi: 10.1109/2.675637

    [45]

    Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Austin, Texas, USA: IEEE Computer Society, 2005: 66–77.

    [46]

    Seznec A. A 256 Kbits L-TAGE branch predictor[J]. Journal of Instruction-Level Parallelism Special Issue: The Second Championship Branch Prediction Competition, 2007, 9: 1−6

    [47]

    Predictors T B, Irisa I. TAGE-SC-L Branch Predictors [J]. 5th JILP Workshop on Computer Architecture Competitions: Championship Branch Prediction, 2016:267175

    [48]

    Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques[J]. ACM Transaction on Information Systems, 2002, 20(4): 422−446 doi: 10.1145/582415.582418

    [49]

    Khan T A, Brown N, Sriraman A, et al. Twig: Profile-guided BTB prefetching for data center applications[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 816–829

    [50]

    Qureshi M K, Patt Y N. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches[C] //Proc of the 43rd Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2006: 423–432

    [51]

    Delimitrou C, Kozyrakis C. IBench: Quantifying interference for datacenter applications[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2013: 23–33

    [52]

    Leverich J, Kozyrakis C. Reconciling high server utilization and sub-millisecond quality-of-service[C] //Proc of the European Conf on Computer Systems. New York: ACM, 2014: 1-14

    [53]

    Muralidhara S P, Subramanian L, Mutlu O, et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning[C] //Proc of the 44th Annual Int Symp on Microarchitecture. New York: ACM, 2011: 374–385

    [54]

    Kasture H, Sanchez D. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 729–742

    [55]

    Ma Jiayue, Sui Xiufeng, Sun Ninghui, et al. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD)[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2015, 50(4): 131–143

    [56]

    Krause K L, Shen V Y, Schwetman H D. Analysis of several task-scheduling algorithms for a model of multiprogramming computer systems[J]. Journal of the ACM, 1975, 22(4): 522−550 doi: 10.1145/321906.321917

    [57]

    Hochbaum D S, Shmoys D B. Polynomial approximation scheme for scheduling on uniform processors: Using the dual approximation approach[J]. SIAM Journal on Computing, 1988, 17(3): 539−551 doi: 10.1137/0217033

    [58]

    Horowitz E, Sahni S. Exact and approximate algorithms for scheduling nonidentical processors[J]. Journal of the ACM, 1976, 23(2): 317−327 doi: 10.1145/321941.321951

    [59]

    Graham, Ronald L. Bounds for certain multiprocessing anomalies[J]. Bell System Technical Journal, 1966, 45(9): 1563−1581 doi: 10.1002/j.1538-7305.1966.tb01709.x

    [60]

    Sifive. Block-Inclusivecache-Sifive[EB/OL]. [2023-01-25]. https://github.com/sifive/block-inclusivecache-sifive

  • 期刊类型引用(13)

    1. 孙文举,李清勇,张靖,王丹羽,王雯,耿阳李敖. 基于深度神经网络的增量学习研究综述. 数据分析与知识发现. 2025(01): 1-30 . 百度学术
    2. 谢家晨,刘波,林伟伟,郑剑文. 联邦增量学习研究综述. 计算机科学. 2025(03): 377-384 . 百度学术
    3. 徐岸,吴永明,郑洋. 自适应特征整合与参数优化的类增量学习方法. 计算机工程与应用. 2024(03): 220-227 . 百度学术
    4. 马旭淼,徐德. 机器人增量学习研究综述. 控制与决策. 2024(05): 1409-1423 . 百度学术
    5. 姚红革,邬子逸,马姣姣,石俊,程嗣怡,陈游,喻钧,姜虹. 避免近期偏好的自学习掩码分区增量学习. 软件学报. 2024(07): 3428-3453 . 百度学术
    6. 徐岸,吴永明,郑洋. 基于自监督与蒸馏约束的正则化类增量学习方法. 计算机辅助设计与图形学学报. 2024(05): 775-785 . 百度学术
    7. 朱觐镳,吴一帆,王东署. 智能体记忆引导的学习与决策:海马体记忆回放的视角. 控制理论与应用. 2024(10): 1753-1764 . 百度学术
    8. 王伟,张志莹,郭杰龙,兰海,俞辉,魏宪. 基于脑启发的类增量学习. 计算机应用研究. 2023(03): 671-675+688 . 百度学术
    9. 朱飞,张煦尧,刘成林. 类别增量学习研究进展和性能评价. 自动化学报. 2023(03): 635-660 . 百度学术
    10. 吴楚,王士同. 任务相似度引导的渐进深度神经网络及其学习. 计算机科学与探索. 2023(05): 1126-1138 . 百度学术
    11. 孙家辉,马骊溟. 持续学习算法在车辆目标识别上的应用. 汽车实用技术. 2023(15): 73-81 . 百度学术
    12. 孙泽群,崔员宁,胡伟. 基于链接实体回放的多源知识图谱终身表示学习. 软件学报. 2023(10): 4501-4517 . 百度学术
    13. 郭广慧,钟世华,李三忠,丰成友,戴黎明,索艳慧,刘嘉情,牛警徽,黄宇,薛梓萌. 运用机器学习和锆石微量元素构建花岗岩成矿潜力判别图解:以东昆仑祁漫塔格为例. 西北地质. 2023(06): 57-70 . 百度学术

    其他类型引用(16)

图(18)  /  表(12)
计量
  • 文章访问数:  326
  • HTML全文浏览量:  67
  • PDF下载量:  143
  • 被引次数: 29
出版历程
  • 收稿日期:  2023-01-09
  • 修回日期:  2023-04-14
  • 网络出版日期:  2023-05-03
  • 刊出日期:  2023-05-31

目录

    /

    返回文章
    返回