## MICROPROCESSOR www.MPRonline.com THE INSIDER'S GUIDE TO MICROPROCESSOR HARDWARE ## FUJITSU MAKES SPARC SEE DOUBLE SPARC64 VI Uses Process Shrink to Double Cores By Kevin Krewell {11/24/03-01} Following up on the SPARC64 V presentation at Microprocessor Forum 2002, Takumi Maruyama, manager of the E Processor Development at Fujitsu, presented the SPARC64 VI at Microprocessor Forum 2003. Fujitsu will use its 90nm process to fit two cores and 6MB of L2 cache on one die. The dual superscalar out-oforder cores are expected to clock at significantly greater than 2.4GHz Although Fujitsu calls the processor a multithreaded processor, it can also be viewed as a chip-level multiprocessor. While Intel's use of multithreading in the Pentium 4 is to hide memory-access latency and better use the core resources, server processors are using chip-level multicore designs to scale system performance. To scale server system performance, shared memory data and memory latency are still important problems to be solved. Fujitsu's solution reflects the prevailing thought in the industry: use a multithreaded or multicore processor. Fujitsu chose the multicore approach and combined it with a new system bus based on a mainframe server design. The SPARC64 VI will be fabricated in Fujitsu's 90nm process, with 10 layers of copper interconnect, a 40nm gate width, and low-k dielectric. The die dimensions are 19.1mm $\times$ 20.3mm (388mm²). The dual core and large cache will push the transistor count to 690 million. The die has 360 I/O pins and a core voltage of 1.0V, with the I/O voltage at 1.8V. Fujitsu already has a test version of the chip running at 2GHz. The test chip uses the SPARC64 V bus. This processor is quite a bit larger than the SPARC 64 V shipping today, which has 191 million transistors and a die size of 18.1mm × 16.0mm (289.6mm<sup>2</sup>). (For more information on the SPARC64 V, see *MPR 10/21/02-01*, "Fujitsu's SPARC64 V Is Real Deal.")At about 50W at 1.35GHz, this processor is a very power-friendly device that could be used in relatively dense designs, but Fujitsu didn't indicate the power for the SPARC64 VI. Although the SPARC64 VI processor is unlikely to be available in systems until 2006, Fujitsu's roadmap includes clock-speed improvements to the existing SPARC64 V processor, including a 90nm shrink in early 2005. The SPARC64 V was shipped earlier this year at 1.35GHz in a 130nm process, and the company expects to ship systems with a processor faster than 1.5GHz by early 2004. When Fujitsu shrinks the processor to its own 90nm process in early 2005, the company projects the processor will ship at more than 2GHz. To feed the cores, a shared large L2 cache was needed to maintain a high hit rate and offer shorter latency. Compared with the SPARC64 V, which has a 2MB four-way L2 cache, the SPARC64 VI was enhanced with a 6MB 12-way cache with a 256-byte line size and enough bandwidth to support the two cores. The bus from the cache to the cores is 32 bytes wide and the write bus is 16 bytes. The larger cache line should be more effective in data prefetch. Fujitsu also doubled the size of queues to support the data and snoop traffic from the two cores. At the Forum, Fujitsu showed projections on the way the larger cache and improved prefetch logic of the SPARC 64 VI perform on large data arrays, degrading gracefully for up to 4MB data arrays, whereas the SPARC64 V dropped off dramatically at arrays larger than 1MB. **Figure 1.** The MOWESI protocol state diagram. This protocol is optimized for multiprocessing designs where data is often shared between processors. The protocol adds another state to eliminate a cache-line copyback transaction. In the SPARC64 V, the prefetch logic pushes speculative data into the L2 cache and can hold up to 16 L1 cache misses. The SPARC64 VI improvements are to add descending data patterns, and, with a higher confidence in the prefetch, the data can be forwarded to the L1 data cache. With a successful prefetch multiple times, the logic will then prefetch another two to four lines ahead. The prefetch queue consists of 16 entries, and it keeps 16 series of independent prefetch requests. Each entry counts the number of prefetch requests that were successful, where the prefetched line is later used by a demand request. When the number exceeds a certain value, the prefetch will be pushed to the L1 cache, and the logic can select the more aggressive prefetch. Each SPARC64 VI core is similar to its predecessor, with the same 11-stage pipeline. Each core can fetch up to **Figure 2.** The top bus transaction example is from the SPARC64 V and requires CPU-X and -Y to perform four transfers or snoops. The improved protocol eliminates two transactions and a state change in CPU-Y. eight instructions from the instruction cache into the instruction buffer, and the buffer can then issue up to four instructions to the reservations stations. The new cores have some further refinements. The branch-prediction miss rate is reduced by increasing the number of branches sent to write-cycle global history table (WGHT) from one to two per cycle. Register window handling, when returning from a trap and transferring the general-purpose register (GPR) to the joint window register (JWR), is reduced from eight cycles to one. The cores also have a faster floating-point multiply and add (FMA) instruction. Fujitsu also added a 16-byte by five-entry write buffer between the store port and the L1 data cache in order to merge writes and reduce data-write traffic. The size of the translation lookaside buffer (TLB) was also doubled. ## For SPARC64, Scalability Is Essential Using the SPECjbb2000 benchmark, Fujitsu showed that the performance of the present SPARC64 V scaled in a linear fashion 10 times, from an 8-way system to the maximum 128-way configuration. The SPARC64 VI's dual-core design, however, requires improvements in system architecture to maintain scalability. To address this need, the new chip has an improved system bus and a larger L2 cache. The 6M shared L2 cache is large enough to maintain a high hit rate. The cache is 12-way set-associative, and uses a 256B line size with separate valid bits for each of four 64-byte subentries. The numbers of outstanding requests supported by the bus are 32 for reads and 16 for writes. To provide enough read bandwidth to satisfy both cores, the L2-cache read path, at 32 bytes wide, is twice the width of the L1-to-L2 write bus. The new mainframe-class bus, which Fujitsu calls the Jupiter Bus, is based on Fujitsu's long experience with mainframe design. The SPARC64 V used a bus protocol with the acronym MOESI for modified, owned, exclusive, shared, and invalid. The MOWESI protocol (see Figure 1) adds the W state for a cache line that is modified by another CPU and unmodified by the owner. The bus compensates for the effects of multiple processors' modifying the same cache lines. It is likely that a STORE operation to a cache line will be performed after a LOAD operation, and that this line is often modified by other CPUs. With the new unidirectional bus design, no bus arbitration is required to copyback data from CPU-Y. In the example shown in Figure 2, the W state eliminates the owned state in CPU-Y, which, due to the store operation, requires two transitions of the cache-line state in CPU-Y. Fujitsu provided some projection of the performance of the new core on SPEC2000 benchmarks and an unspecified standard online transaction processing (OLTP) benchmark, seen in Figure 3. The SPEC Web site lists the base scores for the existing 1.35GHz SPARC64 V as 776 SPECint\_2000 and 1,096 SPECfp\_2000. The graphs Fujitsu provided indicate expectations of core performance scores of about 1,630 SPECint\_2000 and 2,080 SPECfp\_2000. The performance improvements for these three benchmarks come from a combination of core improvements, core frequency increases, cache size, and bus improvements. The expected OLTP boost is about 40% from the new bus, but the larger caches and improved core with prefetch provide another 30%. The SPECint gains are about 70% tied to the faster clock frequency and about 20% to the larger cache and improved core. The gains in SPECfp performance are about 40% from clock frequency and about 40% from the better bus. The SPARC64 VI is a logical successor to the SPARC64 V. Recent reports have suggested that Sun Microsystems also finds Fujitsu's designs interesting, and Sun could choose to use these SPARC64 designs in its own product line. Although the SPARC64 VI will ship after Sun's UltraSparc IV, the Fujitsu design offers higher clock speeds and could provide similar performance to the Sun's UltraSparc V, which is scheduled to ship no sooner than 2006. With clock speeds very likely higher than 2.4GHz, the SPARC64 VI should offer good performance and scalability when it ships, but it will face tough competition in enterprise servers from IBM's Power5 and Intel's Montecito-based Itanium processor. **Figure 3.** These are Fujitsu's projected improvements for the 2.4GHz+SPARC64 VI core over the performance of the SPARC64 V at 1.35GHz. Total system performance will be higher, as each SPARC64 VI processor includes two cores. The pie charts show Fujitsu's projected breakdown of where the performance gains are coming from—core and cache size improvements, higher clock frequency, and improved system bus.