학술논문

Acceleration of arithmetic processing with CAM-based massive-parallel SIMD matrix core
Document Type
Conference
Source
2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS) Circuits and Systems (MWSCAS), 2020 IEEE 63rd International Midwest Symposium on. :486-489 Aug, 2020
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Fields, Waves and Electromagnetics
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Clocks
Image coding
Mobile handsets
Computer architecture
Streaming media
Pipeline processing
Performance evaluation
Language
ISSN
1558-3899
Abstract
Recently, several multimedia applications, such as digital image compression, digital video compression, and digital audio processing, are executed on the mobile devices. The processing core in the mobile device requires high performance and programmability. Generally, multimedia applications consist of repeated arithmetic operation and table-lookup coding operation. To improve the processing speed of the both operations on a processing core, Content Addressable Memory-based massive-parallel SIMD matrix core (CAMX) is proposed. The role of CAMX is an accelerator for mobile CPU core. CAMX has highly parallel processing capability and is configured by two CAM modules which are used in fast table-lookup operation. This paper shows that CAMX can process parallel repeated arithmetic operations and table-lookup coding operations assuming a 1.4 GHz operating frequency; AND, OR, XOR and ADD instructions can calculate 1,024 entries as 128-bit data in 0.34 GOPS (Giga Operations per Second) in parallel; search operation can search 1,024 entries as 128-bit data in 0.35 GOPS in parallel; multiplication can calculate 1,024 entries as 4-bit data in 0.34 GOPS in parallel. About multiplication processing, total clock cycle of table-lookup used processing is realized about 15 % lower than total clock cycle of bit-serial processing.