Smart memories pdf

2022.01.16 00:35

While designer productivity has improved over time, and technologies like system-on-a-chip help to manage complexity, each generation of complex machines is more expensive to design than the previous one.

High non-recurring fabrication costs e. Thus, these large complex chips are only cost-effective if they can be sold in large volumes. This need for a large market runs counter to the drive for efficient, narrowly focused, custom hardware solutions.

At the highest level, a Smart Memories chip is a modular computer. It contains an array of processor tiles and on-die DRAM memories connected by a packet-based, dynamically routed network Figure 1. The network also connects to high-speed links on the pins of the chip to allow for the construction of multi-chip systems. Most of the initial hardware design works in the Smart Memories project has been on the processor tile design and evaluation, so this paper focuses on these aspects.

The organization of a processor tile is a compromise between VLSI wire constraints and computational efficiency. Our initial goal was to make each processor tile small enough so the delay of a repeated wire around the semi-perimeter of the tile would be less then a clock cycle.

This leads to a tile edge of around 2. A mm2 die would then hold about 64 processor tiles, or a lesser number of processor tiles and some DRAM tiles.

Grouping the tiles into quads also makes the global interconnection network more efficient by reducing the number of global network interfaces and thus the number of hops between processors. Our goal in the tile design is to create a set of components that will span as wide an application set as possible.

In current architectures, computational elements are somewhat standardized; today, most processors have multiple segmented functional units to increase efficiency when working on limited precision numbers. Since much work has already been done on optimizing the mix of functional units for a wide application class, efforts on creating the flexibility needed to efficiently support different computational models requires creating a flexible memory system, flexible interconnection between the processing node and the memory, and flexible instruction decode.

Continued technology scaling causes a dilemma -- while computation gets cheaper, the design of computing devices becomes more expensive, so new computing devices must have large markets to be successful.

Smart Memories addresses this issue by extending the notion of a program. In conventional computing systems the memories and interconnect between the processors and memories is fixed, and what the programmer modifies is the code that runs on the processor. While this model is completely general, for many applications it is not very efficient.

In Smart Memories, the user can program the wires and the memory, as well as the processors. This allows the user to configure the computing substrate to better match the structure of the applications, which greatly increases the efficiency of the resulting solution. The increased almost completely scheduled by the compiler.

After looking at communication volume and resource usage tracking for such opera- Imagine, we will explore the performance of Hydra, a single chip 4- tions can easily be delegated to the secondary thread.

The two way multiprocessor. This machine is very different from Imagine, threads are assumed to be independent and any communication because the applications that it supports have irregular accesses and must be explicitly synchronized.

To improve performance of these applica- tions the machine supports speculative thread execution. Imagine is a co-processor optimized for high-performance on appli- cations that can be effectively encapsulated in a stream program- 4. Mapping Streaming and Speculative ming model.

This model expresses an application as a sequence of kernels that operate on long vectors of records, referred to as Architectures streams. Streams are typically accessed in predictable patterns and are tolerant of fetch latency. However, streaming applications One of the goals of the Smart Memories architecture is to effi- demand high bandwidth to stream data and are compute-intensive.

In the early stages of the project, arithmetic units to meet these requirements. Clearly the on-chip stream register file SRF , and local register files LRFs in memory system was general enough to allow changing the sizes the datapath. The SRF and LRFs provide increasing bandwidth and and characteristics of the caches in the system as well as to imple- allow temporary storage, resulting in reduced bandwidth demands ment other memory structures.

However, this is really only part of on the levels further away in the hierarchy. To provide multi-banked SRAM accessed via a single wide port. Streams are some concrete benchmarks, we configured a Smart Memories stored in the SRF in the order they will be accessed, yielding high machine to mimic two existing machines, the Hydra multiprocessor bandwidth via the single port.

The records of a stream are inter- [33] and the Imagine streaming processor [25]. The LRF level consists of of the mats are allocated to the SRF and are configured in streaming many small register files directly feeding the arithmetic units. Data structures that cannot be streamed, such as lookup tables, are allocated in mats configured as The high stream bandwidth achieved through the storage hierarchy scratchpad memories.

Instructions are stored in mats with the enables parallel computation on a large number of arithmetic units. The homoge- In Imagine, these units are arranged into eight clusters, each associ- neity of the Smart Memories memory structure allows the alloca- ated with a bank of the SRF. The eight clusters exploit data parallelism to perform the cation basis. Within each cluster, ILP is exploited to perform parallel computa- tions on the different units.

All the clusters execute a single micro- The SRF is physically distributed over the four tiles of a quad, with code instruction stream in lock-step, resulting in a single- a total SRF capacity of up to KB.

Records of a stream are inter- instruction multiple-data SIMD system. Multiple streams may be placed on non-overlapping with its compute resources to the Smart Memories substrate.

The address ranges of the same mat at the cost of reduced bandwidth to arrangement of these resources in Imagine is shown in Figure 8. This placement allows accesses to a mat to be sequen- The LRFs are embedded in the compute clusters and are not shown tial and accesses to different streams to proceed in parallel.

The explicitly. Comparison of peak BW in words per cycle 8 8 8 8 8 8 8 8 words 4. Microcode instructions are used to issue operations to all FP units in parallel. The integer units of Smart Memories tiles are used primarily to per- Figure 8. Imagine architecture form support functions such as scratchpad accesses, inter-tile com- munication, and control flow operations which are handled by The 8-cluster Imagine is mapped to a 4-tile Smart Memories quad.

Like Imagine, the mapped implementation is intended to 4. In the fol- lowing sections, we describe the mapping of Imagine to the Smart Much of the data bandwidth required in stream computations is to Memories, the differences between the mapping and Imagine, and local tile memory. However, data dependencies across loop itera- the impact on performance. In the mapping, these communications take place over the quad network. Since we emulate two bit Imagine clusters on a tile, the quad 4.

This contrasts with a full crossbar in Imagine mapping. The broadcast control bits in the Smart Memories quad for the same purpose, leading to a relative slowdown for the map- network distribute status information indicating participation of ping.

These bits combine with state information from previous communications to form the Longer latencies index into the lookup-table. The routed, general interconnects, used for data transfers outside of Gather and scatter of stream data between the SRF and off-quad compute clusters in the Smart Memories architecture, typically DRAM, fetch of microcode into the local store, communication have longer latencies compared to the dedicated communication with the host processor, and communication with other quads are resources of Imagine.

While most kernels are tolerant of stream performed over the global network. The first or final stage of these access latencies, some that perform scratchpad accesses or inter- transfers also utilizes the quad network but receives a lower priority cluster communications are sensitive to the latency of these opera- than intra-quad communications.

However, heavy communication does not necessarily lead to significant slowdowns if the latency can be masked through proper scheduling e. Other causes of latency increases 4. The simulations accounted for all dif- ferences between Imagine and the mapping, including the hardware Degradation Factors resource differences, the overheads incurred in software emulation Longer latencies 80 BW constraints of certain hardware functions of Imagine, and serialization penal- ties incurred in emulating two Imagine clusters on a tile.

Latencies of bit arithmetic operations were assumed to be the same for both architectures since their cycle times are comparable in gate delays 40 in their respective target technologies. Performance degradation accesses per cycle. However, constraints other than SRF bandwidth lead to performance losses. Figure 9 shows the percentage perfor- According to simulation results, the bandwidth hierarchy of the mance degradation for the four kernels on the mapping relative to mapping compares well with that of the original Imagine and pro- Imagine.

These performance losses arise due to the constraints dis- vides the necessary bandwidth. However, constraints primarily in cussed below. These results demonstrate that the configurable sub- The Smart Memories FP cluster consists of two fewer units an strate of Smart Memories, particularly the memory system, can sus- adder and a multiplier than an Imagine cluster, which leads to a tain performance within a small factor of what a specialized significant slowdown for some compute bound kernels e.

Simulations show that simply adding a second multiplier with no increase in memory or communication bandwidth reduces the performance degradation relative to Imagine for convolve from 4. We are currently exploring ways to increase the com- pute power of the Smart Memories tile without significantly The Hydra speculative multiprocessor enables code from a sequen- increasing the area devoted to arithmetic units.

A pre-processing script finds and marks loops in the original code. At run-time, different loop itera- tions from the marked loops are then speculatively distributed across all processors.

As In the Smart Memories implementation of Hydra, each Hydra pro- shown in Figure 10, the Hydra multiprocessor consists of four cessor and its associated L1 caches reside on a tile. The L2 cache RISC processors, a shared on-die L2, and speculative buffers which and speculative write buffers are distributed among the four tiles are interconnected by a bit read bus and a bit write-through that form a quad.

Figure 11 shows the memory mat allocation of a bus. The speculative buffers store writes made by a processor dur- single tile. The dual-ported mats are used to support three types of ing speculative operation to prevent potentially invalid data from memory structures: efficient set-associative tags, tags that support corrupting the L2.

When a processor commits state, this modified snooping, and arbitration-simplifying mats. The read bus handles L2 accesses and fills from the external memory interface while the write-through bus is 8-word line used to implement a simple cache-coherence scheme. The L2 is split by Cache Speculation Support address, so a portion of each way is on each tile.

Rather than dedi- CPU 0 Memory Controller cate two mats, one for each way, for the L2 tags, a single dual- Write-through ported mat is used. Placing both ways on the same tile reduces the Read Bus 64b communication overhead. Single-ported memories may be effi- Bus ciently used as tag mats for large caches, but they inefficiently b implement tags for small caches.

For example, the L1 data tags are Writes out Invalidates in not completely utilized because the tags only fill 2KB. The L1 data tags are dual-ported to facilitate snooping on the write bus under the write-through coherence protocol.

Speculation Write Buffers Finally, dual-ported mats are used to simplify arbitration between two requestors. The CAM not shown stores indices which point L2 cache into the speculation buffer mat, which holds data created by a spec- ulative thread. Hydra architecture sor and then read by a more speculative processor on an L1 miss at the same time. In this case, the dual-ported mat avoids complex When a speculative processor receives a less-speculative write to a buffering and arbitration schemes by allowing both requestors to memory address that it has read RAW hazard , a handler invali- simultaneously access the mat.

Compared to all speculative processors that they must update their speculative Hydra, the Smart Memories configuration uses lower set-associativ- rank. Similar to the approach taken with Imagine, we conducted cycle-level simulations by adapting the Hydra simu- lation environment [35] to reflect the Smart Memories tile and quad architecture. Memory configuration comparison affects the L1 access time, and since the L2 is distributed, the L2 merge time is increased.

The 2-cycle load delay slot is conserva- tively modeled in our simulations by inserting nops without code 4. Algorithmic modifications were necessary, since certain Hydra-spe- The increased L2 access time has a greater impact on performance cific hardware structures were not available.

This section presents than the L1 access time and causes performance degradations two examples and their performance impact. Conditional gang-invalidation The increase in the L2 access time is due to the additional nearest- neighbor access on the quad interconnect.

On a restart, Hydra removes speculatively modified cache lines in parallel through a conditional gang-invalidation if the appropriate control bit of the line is set. This mechanism keeps unmodified lines 4. Although the conditional gang-invalidation mecha- Figure 12 shows the performance degradations caused by the nism is found in other speculative architectures, such as the Specu- choice of memory configurations, algorithms, and memory access lative Versioning Cache [37], it is not commonly used in other latency.

The memory access latency and algorithmic changes con- architectures and introduces additional transistors to the SRAM tribute the greatest amount of performance degradation, whereas memory cell. Therefore, in the Smart Memories mapping, algorith- the configuration changes are relatively insignificant. Since the mic modifications are made so the control bits in the L1 tag are not Hydra processors pass data through the L2, the increased L2 conditionally gang-invalidated.

L2 Merge 40 In Hydra, the L2 and speculative buffers are centrally located, and on an L1 miss, a hardware priority encoder returns a merged line. How- ever, in Smart Memories the L2 and speculative buffers are distrib- 0 compress grep m88ksim wc ijpeg mpeg alvin simplex uted.

If a full merge of all less-speculative buffers and the L2 is performed, a large amount of data is unnecessarily broadcast across the quad network. Figure The overheads of the lated by dividing the execution time of one of the processors in coarse-grain configuration that Smart Memories uses, although Hydra by the respective execution times of the Smart Memories and modest, are not negligible; and as the mapping studies show, build- Hydra architectures.

Scalar benchmarks, m88ksim and wc, have the ing a machine optimized for a specific application will always be largest performance degradations and may actually slow down faster than configuring a general machine for that task. Yet the under the Smart Memories configuration. Since Hydra does not results are promising, since the overheads and resulting difference achieve significant speedup on these benchmarks, they should not in performance are not large.

So if an application or set of applica- be run on this configuration of Smart Memories. For example, we tions needs more than one computing or memory model, our recon- would achieve higher performance on the wc benchmark if we figurable architecture can exceed the efficiency and performance of devoted more tile memory to a larger L1 cache. Our next step is to create a more com- plete simulation environment to look at the overall performance of 3.

Acknowledgments Speedup We would like to thank Scott Rixner, Peter Mattson, and the other 2 members of the Imagine team for their help in preparing the Imag- ine mapping. We would also like to thank Lance Hammond and 1. Finally we would like to thank Vicky Wong and Andrew Chang for 1 their insightful comments. References grep wc ijpeg alvin mpeg compress simplex m88ksim [1] K. Diefendorff and P. Speedup 45, Sept. Slavenberg, et al. Conclusion Media Processor. In Proceedings of Hot Chips 8, Continued technology scaling causes a dilemma -- while computa- [3] L.

In Proceedings of expensive, so new computing devices must have large markets to be Hotchips 11, pages , Aug. Smart Memories addresses this issue by extending the notion of a program. In conventional computing systems the memo- [4] J. In Proceedings of Hot Chips 11, pages , and what the programmer modifies is the code that runs on the pro- Aug.

While this model is completely general, for many applica- tions it is not very efficient. In Smart Memories, the user can [5] P. Kozyrakis, et al. Scalable Processors in the Billion-tran- ciency of the resulting solution. Our initial tile architecture shows the potential of this approach. Using the same resources normally found in a superscalar proces- [7] M. Horowitz, et al. The Future of Wires. SRC White Paper: sor, we were able to arrange those resources into two very different Interconnect Technology Beyond the Roadmap, types of compute engines.

In this machine organization, the tile provides very [8] D. Matzke, et al. Will Physical Scalability Sabotage Perfor- high bandwidth and high computational throughput. The other mance Gains? Here the programma- [9] C. A New Microsystem Architecture for the bility of the memory was used to create the specialized memory Internet Era. Presented in Microprocessor Forum, Oct. Waingold, et al. Rixner, et al.

A Bandwidth-Efficient Architecture for Machines. Media Processing. Hauck, et al. The Chimaera Reconfigurable Functional Nov. Microprocessor Report, , Apr. Wittig, et al. Williams, et al. Hauser, et al. Thesis, Stanford University, Aug. In Proceedings of the 17th Annual Sym- [14] H. Zhang, et al.

bimisprecer1985's Ownd

0コメント

1000 / 1000