HyWarm：针对处理器 RTL仿真的自适应混合预热方法

周耀阳; 韩博阳; 蔺嘉炜; 王凯帆; 张林隽; 余子濠; 唐丹; 王卅; 孙凝晖; 包云岗

doi:10.7544/issn1000-1239.202330061

HyWarm：针对处理器 RTL仿真的自适应混合预热方法

周耀阳^{1, 2, 3,},
韩博阳⁴,
蔺嘉炜^{1, 2, 3},
王凯帆^{1, 2, 3},
张林隽^{1, 2, 3},
余子濠¹,
唐丹^{1, 3},
王卅^{1, 2},
孙凝晖^{1, 2, 3},
包云岗^{1, 2, ,}

1.
处理器芯片全国重点实验室（中国科学院计算技术研究所）　北京　100190
2.
中国科学院大学计算机科学与技术学院　北京　100049
3.
北京开源芯片研究院　北京　100080
4.
香港大学电机电子工程系　香港　999077

基金项目: 中国科学院战略性先导科技专项（XDC05030200）, 国家自然科学基金重大项目（62090020）

详细信息

作者简介:
周耀阳: 1995年生. 博士. 主要研究方向为处理器ILP提升、可扩展处理器设计、负载采样和性能评测方法

韩博阳: 1999年生. 工程硕士研究生. 主要研究方向为计算机体系结构、数字系统设计和高速串行通讯协议

蔺嘉炜: 1998年生. 硕士研究生. 主要研究方向为高性能计算机体系结构

王凯帆: 1997年生. 博士研究生. 主要研究方向为处理器敏捷开发与计算机体系结构

张林隽: 1998年生. 硕士研究生. 主要研究方向为高性能计算机体系结构

余子濠: 1991年生. 博士. 主要研究方向为计算机系统结构和操作系统

唐丹: 1976年生. 博士，高级工程师. 主要研究方向为计算机体系结构和低功耗SoC设计

王卅: 1986年生. 博士，副研究员. 主要研究方向为云计算、操作系统以及系统建模与性能分析

孙凝晖: 1968年生. 博士，中国工程院院士，CCF会士. 主要研究方向为计算机系统结构、高性能计算

包云岗: 1980年生. 博士，研究员. 主要研究方向为数据中心体系结构、处理器芯片敏捷设计方法论、开源处理器芯片生态

通讯作者:
包云岗（baoyg@ict.ac.cn）

中图分类号: TP391
计量
- 文章访问数: 326
- HTML全文浏览量: 67
- PDF下载量: 143
出版历程
- 收稿日期: 2023-01-09
- 修回日期: 2023-04-14
- 网络出版日期: 2023-05-03
- 刊出日期: 2023-05-31

HyWarm: Adaptive Hybrid Warmup Method for RTL Emulation of Processors

Zhou Yaoyang^{1, 2, 3,},
Han Boyang⁴,
Lin Jiawei^{1, 2, 3},
Wang Kaifan^{1, 2, 3},
Zhang Linjuan^{1, 2, 3},
Yu Zihao¹,
Tang Dan^{1, 3},
Wang Sa^{1, 2},
Sun Ninghui^{1, 2, 3},
Bao Yungang^{1, 2, ,}

1.
State Key Lab of Processors (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190
2.
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049
3.
Beijing Institute of Open Source Chip, Beijing 100080
4.
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong 999077

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDC05030200), and the Major Program of the National Natural Science Foundation of China (62090020).

More Information

Author Bio:
Zhou Yaoyang: born in 1995. PhD. His main research interests include CPU ILP enhancement, scalable CPU design, workload sampling, and performance evaluation methods

Han Boyang: born in 1999. Master candidate of Science in Engineering. His main research interests include computer architecture, digital system design, and high-speed serial communication protocols

Lin Jiawei: born in 1998. Master candidate. His research interest includes high-performance computer architecture

Wang Kaifan: born in 1997. PhD candidate. His main research interests include agile development of processors and computer architecture

Zhang Linjuan: born in 1998. Master candidate. Her main research interest includes high-performance computer architecture

Yu Zihao: born in 1991. PhD. His main research interests include computer architecture and operating system

Tang Dan: born in 1976, PhD, senior engineer. His main research interests include computer architecture and low power SoC design

Wang Sa: born in 1986. PhD, associate professor. His main research interests include cloud computing, operating systems, and system modeling and performance analysis

Sun Ninghui: born in 1968. PhD, academician of Chinese Academy of Engineering, fellow of CCF. His main research interests include computer architecture and high performance computing

Bao Yungang: born in 1980. PhD, professor. His main research interests include data-center architecture, agile design methodology of processor chips and ecosystem of open-source processor chips

摘要

摘要:
在高性能处理器开发中，准确而快速的性能估算是设计决策和参数选择的基础. 现有工作通过采样算法和RTL的体系结构检查点加速了处理器RTL仿真，使得在数天内测算复杂高性能处理器的SPECCPU等基准测试的性能成为可能. 但是数天的迭代周期仍然过长，性能测算周期仍然有进一步缩短的空间. 在处理器RTL仿真过程中，预热过程的时间占比很大. HyWarm框架的提出是为了加速性能测算过程中的预热过程. HyWarm通过微结构模拟器分析负载预热需求，为每个负载定制预热方案. 对于缓存预热需求较大的负载，HyWarm通过总线协议进行RTL缓存的功能预热；对于RTL全细节仿真，HyWarm利用CPU分簇和LJF调度缩短最大完成时间. HyWarm相较于现有最好的RTL采样仿真方法，在与基准方法准确率相似的前提下，将仿真完成时间缩短了53%.
- 高性能处理器 /
- 芯片设计 /
- 敏捷开发 /
- 负载采样 /
- 功能预热
Abstract:
When developing high-performance processors, accurate and fast performance estimation is the basis for design decisions and parameter exploration. Prior work accelerates processor RTL emulation through workload sampling and architectural checkpoints for RTL, which makes it possible to estimate the performance of benchmarks such as SPECCPU running on complex high-performance processors within a few days. However, waiting a few days for performance results is still too long for architecture iteration, and there is still room for further shortening the performance measurement cycle. During RTL emulation of processors, the warm up phase consumes a significant amount of time. As a solution to expedite the warm up phase during performance evaluation, the HyWarm framework is developed. HyWarm analyzes the warm up demand of workloads with the micro-architectural simulator, and adaptively customizes the warm up scheme for each workload. For workloads with high warm up demand on caches, HyWarm performs functional warm up through the caches’ bus protocol on RTL. For detailed emulation part, HyWarm utilizes CPU clustering and LJF scheduling to reduce the maximum completion time. Compared with the best existing sampling-based RTL emulation method, HyWarm reduces the emulation completion time by 53% under the premise of similar accuracy to the baseline method.
- high performance processor /
- chip design /
- agile development /
- workload sampling /
- functional warm up

HTML全文

图 1 现有的基于采样的仿真方法

Figure 1. Existing sampling-based simulation methods

下载: 全尺寸图片幻灯片

图 2 来自SPECCPU^® 2006的492个检查点的仿真时间分布

Figure 2. Emulation time distribution of 492 checkpoints from SPECCPU^® 2006

下载: 全尺寸图片幻灯片

图 3 HyWarm的优化概览：将现存固定预热长度分为3段

Figure 3. Optimization overview of HyWarm: Existing fixed warm up duration is divided into three segments

下载: 全尺寸图片幻灯片

图 4 主流的基于采样的仿真方法

Figure 4. Mainstream sampling-based simulation methods

下载: 全尺寸图片幻灯片

图 5 sjeng的预热需求曲线

Figure 5. Warm up demand curve of sjeng

下载: 全尺寸图片幻灯片

图 6 预热长度搜索过程

Figure 6. Warm up length search process

下载: 全尺寸图片幻灯片

图 7 GEM5模拟器与香山处理器的分支预测器预热需求

Figure 7. Warm up demand of branch predictors in GEM5 simulator and Xiangshan processor

下载: 全尺寸图片幻灯片

图 8 开启Verilator多线程对调度策略的影响

Figure 8. Impact of enabling multi-threading in Verilator on scheduling policy

下载: 全尺寸图片幻灯片

图 9 不同的调度策略下最大完成时间对比

Figure 9. Comparison of maximum completion time under different scheduling policies.

下载: 全尺寸图片幻灯片

图 10 HyWarm工作流程

Figure 10. Workflow of HyWarm

下载: 全尺寸图片幻灯片

图 11 Filter模式的工作流程

Figure 11. Workflow of Filter mode

下载: 全尺寸图片幻灯片

图 12 接收TileLink请求的缓存子系统

Figure 12. Cache subsystem that receives TileLink requests

下载: 全尺寸图片幻灯片

图 13 检查点的预热需求（指令数）分布

Figure 13. Distribution of warm up demand (the number of instructions) or checkpoints.

下载: 全尺寸图片幻灯片

图 14 GEM5模拟器与香山处理器的预热需求曲线

Figure 14. Warm up demand curve of GEM5 simulator and Xiangshan processor

下载: 全尺寸图片幻灯片

图 15 不同预热方案对L1MP的影响

Figure 15. Impact of different warm up schemes on L1MP

下载: 全尺寸图片幻灯片

图 16 不同预热方案对分支MPKI的影响

Figure 16. Impact of different warm up schemes on branch MPKI

下载: 全尺寸图片幻灯片

图 17 不同预热方案对CPI的影响

Figure 17. Impact of different warm up schemes on CPI

下载: 全尺寸图片幻灯片

图 18 使用自适应预热时53个负载的全细节仿真周期数分布

Figure 18. Distribution of total detailed simulation cycle counts for 53 workloads using adaptive warm up

下载: 全尺寸图片幻灯片

表 1 在AMD EPYC 7H12 64核服务器上运行不同并行任务数的Verilator的仿真速度

Table 1 Emulation Speed of Verilator with Different Parallelism on AMD EPYC 7H12 Server with 64 Cores

仿真速度/IPS	4线程单任务	4线程16任务	满载性能损失
单任务	2153.13	1189.31
每核	538.28	297.33	45%

下载: 导出CSV

表 2 常用的RTL性能评估方法对比

Table 2 Comparison of Commonly Used RTL Performance Evaluation Methods

RTL性能评估方法	仿真频率	典型价格/CNY	是否可租用	典型可容纳设计
RTL软件仿真器	$\leqslant$ 1kHz	5−10万	是	可容纳商业级SoC
公有云FPGA	$\leqslant$ 100MHz	每天240−3600	是	Boom处理器
私有FPGA	$\leqslant$ 100MHz	$\leqslant$ 40万	否	香山处理器
硬件仿真加速器	$\leqslant$ 1MHz	>1000万	否	可容纳商业级SoC

下载: 导出CSV

表 3 服务器低负载时Verilator仿真的多线程扩展效率对比

Table 3 Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server Load is Low

线程数量	1	4	8	16
每核 IPS	190.82	538.28	450.94	321.27

下载: 导出CSV

表 4 服务器满载时Verilator仿真的多线程扩展效率对比

Table 4 Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server is Fully Loaded

线程数量	4	8	16
每核IPS	297.33	389.27	335.50

下载: 导出CSV

表 5 微结构配置

Table 5 Microarchitectural Configuration

部件	配置
分支预测器	16KB TAGE-SC + ITTAGE + RAS + 4KB BTB
一级数据缓存	128KB, 8路数据缓存
一级指令缓存	128KB, 8路指令缓存
二级缓存	1MB 8路非包含
三级缓存	6MB 6路非包含
一级指令TLB	40项
一级数据TLB	136（128 × 4k页 + 8 × 2M页）
二级TLB	2K项
取指宽度	每周期8×4B指令
译码重命名宽度	每周期6条指令
ROB/LQ/SQ	256/80/64
物理寄存器堆	192整数；192浮点
执行单元	Int: 4×ALU, 2×MDU, 1×Misc Mem: 2×Ld AGU, 2×St AGU Float: 4×FMA, 2×Misc

下载: 导出CSV

表 6 预热配置

Table 6 Warm up Configurations

方案	功能预热的 M条指令数	全细节预热的 M条指令数	性能测量的 M条指令数
0+100		100	5
0+50		50	5
0+25		20	5
0+10		10	5
0+5		5	5
Ada	100−DW	自适应（DW）	5
FixedFW （95+5）	95	5	5

下载: 导出CSV

表 7 不同功能预热方案的总仿真时长对比 h

Table 7 Comparison of Total Simulation Time for Different Functional Warm up Schemes

子项	0+5	0+10	0+25	FixedFW （95+5）	Ada
GemsFDTD	0.37	0.55	1.04	0.42	0.29
astar.bi	0.57	0.91	1.65	0.58	0.64
astar.ri	0.69	0.95	1.97	0.66	0.79
bwaves	0.57	0.92	1.68	0.60	0.43
bzip2.chi	0.30	0.43	0.81	0.30	0.22
bzip2.com	1.00	1.52	2.71	1.01	0.72
bzip2.htm	0.30	0.43	0.92	0.34	0.31
bzip2.lib	0.30	0.42	0.89	0.30	0.21
bzip2.pro	1.01	1.60	3.19	0.98	0.68
bzip2.sou	0.95	1.49	2.92	1.08	0.96
cactusADM	0.41	0.60	1.35	0.47	0.32
calculix	0.35	0.60	1.12	0.36	0.26
dealII	0.33	0.51	1.10	0.40	1.20
gamess.cy	0.33	0.49	1.00	0.36	3.46
gamess.gra	0.35	0.51	1.06	0.38	1.09
gamess.tri	0.33	0.50	0.92	0.34	1.10
gcc.166	0.42	0.61	1.33	0.48	1.34
gcc.200	0.90	1.17	2.72	0.89	0.71
gcc.cpde	0.54	0.86	1.63	0.62	1.75
gcc.expr2	0.58	0.86	1.76	0.63	1.03
gcc.expr	0.63	0.89	1.75	0.61	0.70
gcc.g23	0.55	0.76	1.54	0.66	0.43
gcc.s04	0.57	0.93	1.66	0.67	0.69
gcc.scil	0.90	1.10	2.34	0.94	2.48
gcc.type	0.92	1.44	2.62	0.91	1.57
gobmk.13x	0.94	1.51	3.08	0.99	1.66
gobmk.nn	0.85	1.28	2.61	0.92	0.61
gobmk.sco	0.97	1.34	2.70	0.98	0.66
gobmk.tr	0.95	1.30	2.63	0.87	0.98
gobmk.tr	0.71	1.07	2.26	0.73	1.17
gromacs	0.72	1.00	2.25	0.72	0.48
h264ref.f	0.44	0.58	1.21	0.47	0.45
h264ref.s	0.38	0.50	1.04	0.38	2.23
hmmer.nph	0.77	1.25	2.52	0.85	1.45
hmmer.re	0.80	1.21	2.43	0.92	0.79
lbm	0.67	1.02	2.08	0.74	0.57
leslie3d	0.51	0.78	1.43	0.51	0.35
libquantum	0.56	0.78	1.55	0.98	0.39
mcf	3.14	4.18	9.35	3.34	2.32
milc	0.42	0.59	1.26	0.46	0.34
namd	0.52	0.77	1.38	0.48	0.31
omnetpp	1.08	1.66	3.19	1.27	1.06
perl.che	0.46	0.68	1.29	0.47	0.83
perl.di	0.55	0.83	1.37	0.52	1.56
perl.spli	0.43	0.66	1.31	0.43	0.32
povray	0.55	0.88	1.65	0.54	5.39
sjeng	0.72	1.05	2.00	0.67	2.14
soplex.p	1.15	1.59	3.57	1.36	0.87
soplex.r	1.11	1.70	3.05	1.14	0.71
sphinx3	0.46	0.72	1.33	0.59	1.49
tonto	0.37	0.55	1.19	0.41	0.48
xalancbmk	0.89	1.42	2.56	1.17	1.03
zeusmp	0.51	0.75	1.53	0.58	0.39
总计	35.8	52.7	105.5	38.5	54.4
注：黑体数字表示mcf是25M全细节预热下的时间最长的子项，而povray是Ada配置下的时间最长子项.

下载: 导出CSV

表 8 不同方案准确率对比

Table 8 Accuracy Comparison of Different Schemes %

方案	CPI	分支MPKI	L1MP
Ada	99.6	91.6	95.1
0+50	99.8	98.9	97.5
0+25	99.7	94.1	91.3
0+10	99.1	85.2	82.8

下载: 导出CSV

表 9 WarmProfiler的分支MPKI预测误差（增高）

Table 9 Branch MPKI Prediction Error Caused by WarmProfiler （increase）

子项	完美预测 MPKI	MPKI 增高	MPKI 增高百分比/%
gcc_expr2	0.443	0.177	39.9
gcc_g23	0.973	0.172	17.7
tonto	0.506	0.117	23.1
gamess_g	0.430	0.112	26.1
gcc_scilab	7.687	0.090	1.2
xalancbmk	2.003	0.079	3.9
gcc_s04	0.163	0.070	42.8
perl_di	0.669	0.066	9.8
h264ref_f	0.042	0.064	151.9
astar_rivers	3.422	0.053	1.6
注：计算MPKI误差的方法是用WarmProfiler指导预热所得的MPKI减去用RTL的真实预热需求进行预热所得到的MPKI. 黑体数字标识出了MPKI误差超过0.1的子项.

下载: 导出CSV

表 10 簇的数量对调度均衡度的影响

Table 10 Impact of Cluster Count on Scheduling Balance

调度均衡度	随机调度	LJF调度
4 簇 × 16核	0.93	0.99
8 簇 × 8核	0.76	0.98
16 簇 × 4核	0.54	0.63

下载: 导出CSV

表 11 LJF调度与随机调度的仿真时间对比

Table 11 Comparison of Simulation Time Between LJF Scheduling and Random Scheduling

仿真	随机调度/h	LJF调度/h	提升率/%
Ada，8核×8簇	8.71	6.91	20.61
Ada，8核×16簇	6.25	5.38	13.89
25+5，8核×8簇	15.98	13.54	15.26
25+5，8核×16簇	11.29	9.35	17.23
注：Ada结合LJF调度是HyWarm提出的方案；25+5结合随机调度是基线方案.

下载: 导出CSV

表 12 采用模拟器IPC和RTL的真实IPC指导LJF调度的最大完成时间

Table 12 Maximum Completion Time of LJF Scheduling Guided by Simulator IPC and Real IPC of RTL h

Ada仿真	模拟器预测IPC	真实IPC
8核 × 4 簇	13.77	13.67
8核 × 8 簇	6.91	6.92
8核 × 16 簇	5.38	5.38
注：黑体数字标识出8簇下模拟器预测IPC获得了更短的完成时间，这是因为LJF是贪心算法，完成时间的预测误差可能导致更好的调度结果.

下载: 导出CSV

参考文献(60)

[1]	Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C] //Proc of the 49th Annual Design Automation Conf. New York: ACM, 2012: 1212–1221
[2]	Nikhil R. Bluespec systemVerilog: Efficient, correct RTL from high-level specifications[C] //Proc of the 2nd Int Conf on Formal Methods and Models for Co-Design. Piscataway, NJ: IEEE, 2004: 69–70
[3]	Asanovic K, Avizienis R, Bachrach J, et al. The Rocket Chip Generator[R]. Berkeley, CA: UC Berkeley, 2016
[4]	Xu Yinan, Yu Zihao, Tang Dan, et al. Towards developing high performance RISC-V processors using agile methodology[C] //Proc of the 55th Annual Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1178–1199
[5]	Lockhart D, Zibrat G, Batten C. PyMTL: A unified framework for vertically integrated computer architecture research[C] //Proc of the 47th Annual Int Symp on Microarchitecture (MICRO). Los Alamitos, CA: IEEE Computer Society, 2014: 280–292
[6]	Celio C, Chiu P F, Asanović K, et al. Broom: An open-source out-of-order processor with resilient low-voltage operation in 28-nm CMOS[J]. IEEE Micro, 2019, 39(2): 52−60 doi: 10.1109/MM.2019.2897782
[7]	Celio C, Patterson D, Asanovi K. The Berkeley Out-of-Order Machine ( BOOM ) Design Specification[R]. Berkeley, CA: UC Berkeley, 2016
[8]	王凯帆,徐易难,余子濠等. 香山开源高性能 RISC-V 处理器设计与实现[J]. 计算机研究与发展,2023,60(3):476−493 Wang Kaifan, Xu Yinan, Yu Zihao, et al. XiangShan open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493 (in Chinese)
[9]	Veripool. Verilator, the fastest Verilog/SystemVerilog simulator. [EB/OL]. [2022-10-20]. https://www.veripool.org/verilator/
[10]	Sherwood T, Perelman E, Calder B. Basic block distribution analysis to find periodic behavior and simulation points in applications[C] //Proc of the 2001 Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2001: 3–14
[11]	Wunderlich R E, Wenisch T F, Falsafi B, et al. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling[C] //Proc of the 30th Annual Int Symp on Computer Architecture, ISCA. Los Alamitos, CA: IEEE Computer Society, 2003: 84–95
[12]	Binkert N, Beckmann B, Black G, et al. The gem5 simulator[C] //Proc of the 16th Int Conf on Architectural Support for Programming Languages and Operating Systems.New York: ACM, 2011, 39(2): 1–7
[13]	Kabylkas N, Thorn T, Srinath S, et al. Effective processor verification with logic fuzzer enhanced co-simulation[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 667–678
[14]	Eeckhout L, Luo Y, De Bosschere K, et al. BLRL: Accurate and efficient warmup for sampled processor simulation[J]. Computer Journal, 2005, 48(4): 451−459 doi: 10.1093/comjnl/bxh103
[15]	Wenisch T F, Wunderlich R E, Falsafi B, et al. TurboSMARTS: Accurate microarchitecture simulation sampling in minutes[C] //Proc of the Int Conf on Measurements and Modeling of Computer Systems.New York: ACM, 2005: 408–409
[16]	Nikoleris N, Sandberg A, Hagersten E, et al. CoolSim: Statistical techniques to replace cache warming with efficient, virtualized profiling[C] //Proc of the Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2017: 106–115
[17]	Nikoleris N, Eeckhout L, Hagersten E, et al. Directed statistical warming through time traveling[C] //Proc of the 52nd Annual Int Symp on Microarchitecture. New York: ACM, 2019: 1037–1049
[18]	Patil H, Isaev A, Heirman W, et al. ELFies: executable region checkpoints for performance analysis and simulation[C] // Proc of the Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2021: 126–136
[19]	Haskins J W, Skadron K. Memory reference reuse latency: accelerated warmup for sampled microarchitecture simulation[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2003: 195–203
[20]	Yue Luo, John L K, Eeckhout L. Self-monitored adaptive cache warm-up for microprocessor simulation[C] //Proc of the 16th Symp on Computer Architecture and High Performance Computing. Los Alamitos, CA: IEEE Computer Society, 2004: 10–17
[21]	ARM. Learn the architecture-introducing AMBA CHI[EB/OL]. [2022-11-24]. https://developer.arm.com/documentation/102407/0100
[22]	Cook H, Terpstra W, Lee Y. Diplomatic design patterns: A TileLink case study[C] //Proc of the First Workshop on Computer Architecture Research with RISC-V. Berkeley, CA: UC Berkeley, 2017: 23
[23]	Coffman E G, Sethi R. A generalized bound on LPT sequencing[C] //Proc of the Int Symp on Computer Modeling, Measurement and Evaluation. New York: ACM, 1976: 306–310
[24]	Xiao Xin. A direct proof of the 4/3 bound of LPT scheduling rule[C] //Proc of Int Conf on Frontiers of Manufacturing Science and Measuring Technology. Amsterdam, The Netherlands: Atlantis, 2017: 486–489
[25]	Tan Zhangxi, Waterman A, Cook H, et al. A case for FAME: FPGA architecture model execution[C] //Proc of the 37th Int Symp on Computer Architecture. New York: ACM, 2010: 290–301
[26]	Karandikar S, Mao H, Kim D, et al. FireSim : FPGA-accelerated cycle-exact scale-out system simulation in the public cloud[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 29-42
[27]	Kim D, Izraelevitz A, Celio C, et al. Strober: Fast and accurate sample-based energy simulation for arbitrary RTL[C] //Proc of the 43rd Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 128–139
[28]	Hung W N N, Sun R. Challenges in large FPGA-based logic emulation systems[C] //Proc of the Int Symp on Physical Design. New York: ACM, 2018: 26–33
[29]	Agnesina A, Lim S K, Lepercq E, et al. Improving FPGA-based logic emulation systems through machine learning[J].ACM Trans on Design Automation of Electronic Systems, 2020, 25(5): 46:1-46:20
[30]	Cadence. Palladium Emulation [EB/OL]. [2022-12-22]. https://www.cadence.com/en_US/home/tools/system-design-and-verification/emulation-and-prototyping/palladium.html
[31]	Siemens Software. Veloce Hardware-assisted Verification System[EB/OL]. [2023-01-08]. https://eda.sw.siemens.com/en-US/ic/veloce/
[32]	Synopsys. Synopsys Emulation Systems[EB/OL]. [2023-01-08]https://www.synopsys.com/verification/emulation.html
[33]	Beamer S, Donofrio D. Efficiently exploiting low activity factors to accelerate RTL simulation[C] //Proc of the Design Automation Conf. Piscataway, NJ: IEEE, 2020: 1-6
[34]	Sandberg A, Nikoleris N, Carlson T E, et al. Full speed ahead: Detailed architectural simulation at near-native speed[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2015: 183–192
[35]	Hassani S, Southern G, Renau J. LiveSim: Going live with microarchitecture simulation[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 606–617
[36]	Vengalam U K R, Sharma A, Huang M C. LoopIn: A Loop-Based Simulation Sampling Mechanism[C] //Proc of the Int IEEE Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2022: 224–226
[37]	Carlson T E, Heirman W, Van Craeynest K, et al. BarrierPoint: Sampled simulation of multi-threaded applications[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2014: 2–12
[38]	Grass T, Carlson T E, Rico A, et al. Sampled simulation of task-based programs[J]. IEEE Trans on Computers, 2019, 68(2): 255−269 doi: 10.1109/TC.2018.2860012
[39]	Ardestani E K, Renau J. ESESC: A fast multicore simulator using time-based sampling[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2013: 448–459
[40]	Pestel S De, Eyerman S, Eeckhout L. Micro-architecture independent branch behavior characterization[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2015: 135–144
[41]	RISC-V International. RISC-V Debug Support Version 1.0.0-STABLE[EB/OL]. [2023-01-26]. https://github.com/riscv/riscv-debug-spec
[42]	Standard Performance Evaluation Corporation. SPEC CPU® 2006[EB/OL]. [2023-01-26]. https://www.spec.org/cpu2006/
[43]	Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2005: 66–77
[44]	Black B, Shen J P. Calibration of microprocessor performance models[J]. Computer, 1998, 31(5): 59−65 doi: 10.1109/2.675637
[45]	Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Austin, Texas, USA: IEEE Computer Society, 2005: 66–77.
[46]	Seznec A. A 256 Kbits L-TAGE branch predictor[J]. Journal of Instruction-Level Parallelism Special Issue: The Second Championship Branch Prediction Competition, 2007, 9: 1−6
[47]	Predictors T B, Irisa I. TAGE-SC-L Branch Predictors [J]. 5th JILP Workshop on Computer Architecture Competitions: Championship Branch Prediction, 2016:267175
[48]	Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques[J]. ACM Transaction on Information Systems, 2002, 20(4): 422−446 doi: 10.1145/582415.582418
[49]	Khan T A, Brown N, Sriraman A, et al. Twig: Profile-guided BTB prefetching for data center applications[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 816–829
[50]	Qureshi M K, Patt Y N. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches[C] //Proc of the 43rd Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2006: 423–432
[51]	Delimitrou C, Kozyrakis C. IBench: Quantifying interference for datacenter applications[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2013: 23–33
[52]	Leverich J, Kozyrakis C. Reconciling high server utilization and sub-millisecond quality-of-service[C] //Proc of the European Conf on Computer Systems. New York: ACM, 2014: 1-14
[53]	Muralidhara S P, Subramanian L, Mutlu O, et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning[C] //Proc of the 44th Annual Int Symp on Microarchitecture. New York: ACM, 2011: 374–385
[54]	Kasture H, Sanchez D. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 729–742
[55]	Ma Jiayue, Sui Xiufeng, Sun Ninghui, et al. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD)[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2015, 50(4): 131–143
[56]	Krause K L, Shen V Y, Schwetman H D. Analysis of several task-scheduling algorithms for a model of multiprogramming computer systems[J]. Journal of the ACM, 1975, 22(4): 522−550 doi: 10.1145/321906.321917
[57]	Hochbaum D S, Shmoys D B. Polynomial approximation scheme for scheduling on uniform processors: Using the dual approximation approach[J]. SIAM Journal on Computing, 1988, 17(3): 539−551 doi: 10.1137/0217033
[58]	Horowitz E, Sahni S. Exact and approximate algorithms for scheduling nonidentical processors[J]. Journal of the ACM, 1976, 23(2): 317−327 doi: 10.1145/321941.321951
[59]	Graham, Ronald L. Bounds for certain multiprocessing anomalies[J]. Bell System Technical Journal, 1966, 45(9): 1563−1581 doi: 10.1002/j.1538-7305.1966.tb01709.x
[60]	Sifive. Block-Inclusivecache-Sifive[EB/OL]. [2023-01-25]. https://github.com/sifive/block-inclusivecache-sifive

施引文献(29)

期刊类型引用(13)

1.	孙文举，李清勇，张靖，王丹羽，王雯，耿阳李敖. 基于深度神经网络的增量学习研究综述. 数据分析与知识发现. 2025(01): 1-30 . 百度学术
2.	谢家晨，刘波，林伟伟，郑剑文. 联邦增量学习研究综述. 计算机科学. 2025(03): 377-384 . 百度学术
3.	徐岸，吴永明，郑洋. 自适应特征整合与参数优化的类增量学习方法. 计算机工程与应用. 2024(03): 220-227 . 百度学术
4.	马旭淼，徐德. 机器人增量学习研究综述. 控制与决策. 2024(05): 1409-1423 . 百度学术
5.	姚红革，邬子逸，马姣姣，石俊，程嗣怡，陈游，喻钧，姜虹. 避免近期偏好的自学习掩码分区增量学习. 软件学报. 2024(07): 3428-3453 . 百度学术
6.	徐岸，吴永明，郑洋. 基于自监督与蒸馏约束的正则化类增量学习方法. 计算机辅助设计与图形学学报. 2024(05): 775-785 . 百度学术
7.	朱觐镳，吴一帆，王东署. 智能体记忆引导的学习与决策:海马体记忆回放的视角. 控制理论与应用. 2024(10): 1753-1764 . 百度学术
8.	王伟，张志莹，郭杰龙，兰海，俞辉，魏宪. 基于脑启发的类增量学习. 计算机应用研究. 2023(03): 671-675+688 . 百度学术
9.	朱飞，张煦尧，刘成林. 类别增量学习研究进展和性能评价. 自动化学报. 2023(03): 635-660 . 百度学术
10.	吴楚，王士同. 任务相似度引导的渐进深度神经网络及其学习. 计算机科学与探索. 2023(05): 1126-1138 . 百度学术
11.	孙家辉，马骊溟. 持续学习算法在车辆目标识别上的应用. 汽车实用技术. 2023(15): 73-81 . 百度学术
12.	孙泽群，崔员宁，胡伟. 基于链接实体回放的多源知识图谱终身表示学习. 软件学报. 2023(10): 4501-4517 . 百度学术
13.	郭广慧，钟世华，李三忠，丰成友，戴黎明，索艳慧，刘嘉情，牛警徽，黄宇，薛梓萌. 运用机器学习和锆石微量元素构建花岗岩成矿潜力判别图解：以东昆仑祁漫塔格为例. 西北地质. 2023(06): 57-70 . 百度学术