Microarchitecture

A4: Microarchitecture-Aware LLC Management for Datacenter Servers with Emerging I/O Devices

This work uncovers two previously unknown sources of Last-Level Cache (LLC) contention in Intel Xeon CPUs caused by high-bandwidth I/O devices and proposes A4, a runtime LLC management framework that mitigates these issues. A4 improves performance for latency-sensitive workloads by 51% without significantly affecting low-priority workloads.

Haneul Park, Jiaqi Lou, Sangjin Lee, Yifan Yuan, KyoungSoo Park, Yongseok Son, Ipoom Jeong, Nam Sung Kim

Intel® In-Memory Analytics Accelerator: Performance Characterization and Guidelines

The rapid advancements in CPU performance have slowed due to the end of Dennard scaling and the exponential growth of data, making it …

Jaeyoung Kang, Qirong Xia, Ipoom Jeong, Yongjoo Park, Nam Sung Kim

Warped-Compaction: Maximizing GPU Register File Bandwidth Utilization via Operand Compaction

The GPU has been successfully used for diverse emerging compute-intensive applications, including imaging, computer vision, and more …

Eunbi Jeong, Ipoom Jeong, Myung Kuk Yoon, Nam Sung Kim

Marching Page Walks: Batching and Concurrent Page Table Walks for Enhancing GPU Throughput

Virtual memory, with the support of address translation hardware, is a key technique in expanding programmability and memory management …

Jiwon Lee, Gun Ko, Myung Kuk Yoon, Ipoom Jeong, Yunho Oh, Won Woo Ro

Triple-A: Early Operand Collector Allocation for Maximizing GPU Register Bank Utilization

Recent GPUs provisioned with large register files cannot fully utilize the bandwidth between the register files and execution …

Ipoom Jeong, Eunbi Jeong, Nam Sung Kim, Myung Kuk Yoon

Triple-A: Early Operand Collector Allocation for Maximizing GPU Register Bank Utilization

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors

In this work, we set out to introduce the latest features supported by Intel DSA (Data Streaming Accelerator), deep-dive into its versatility, and analyze its throughput benefits through a comprehensive evaluation.

Reese Kuper, Ipoom Jeong, Yifan Yuan, Ren Wang, Narayan Ranganathan, Nikhil Rao, Jiayu Hu, Sanjay Kumar, Philip Lantz, Nam Sung Kim

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors

INTERPRET: Inter-Warp Register Reuse for GPU Tensor Core

Jae Seok Kwak, Myung Kuk Yoon, Ipoom Jeong, Seunghyun Jin, Won Woo Ro

A Quantitative Analysis and Guideline of Data Streaming Accelerator in Intel 4th Gen Xeon Scalable Processors

In this work, we set out to introduce the latest features supported by Intel DSA (Data Streaming Accelerator), deep-dive into its versatility, and analyze its throughput benefits through a comprehensive evaluation.

Reese Kuper, Ipoom Jeong, Yifan Yuan, Jiayu Hu, Ren Wang, Narayan Ranganathan, Nam Sung Kim

A Quantitative Analysis and Guideline of Data Streaming Accelerator in Intel 4th Gen Xeon Scalable Processors

CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUs

Spin-transfer torque magnetic random-access memory (STT-MRAM) is an emerging nonvolatile memory technology that has been received …

Yunho Oh, Ipoom Jeong, Won Woo Ro, Myung Kuk Yoon

Reconstructing Out-of-Order Issue Queue

In this work, we propose an energy-efficient microarchitecture named Ballerino, carrying out BALanced and cache-miss toLERable dynamic scheduling via cascaded and clustered IN-Order IQs. The proposed microarchitecture is built upon three key principles that drive dynamic scheduling: instruction readiness, memory/register dependences, and oldest-first selection.

Ipoom Jeong, Jiwon Lee, Myung Kuk Yoon, Won Woo Ro