Intel Single-Chip Cloud Project

metricv Posted on 2023-07-02 41 Views


Intel Single-Chip Cloud, abbreviated as SCC, is a project that aims to build a cloud-like data center on a single chip. It never really took off, with its last whitepaper released in the year 2010 under a revision number of 0.7, which hints at its beta status.

The result is a manycore processor that bears somewhat traditional 2D-mesh interconnect but with interesting takes on communication and cache coherence.

TL;DR: Main Design Features

  • Manycore: A lot of cores on a single chip.
  • Not cache-coherent. Core exchange messages through message-passing, just like using ethernet, but on the chip.

Cores and Tiles

Top-Level Architecture (VRC not shown)

The tiles are laid out in a 2x6 mesh, and each tile contains 2 cores. The cores are Pentium P54C, which is a weird choice since the P54C microarchitecture came out in 1994, discontinued around 2000, yet the SCC was developed in 2009.

Each core has its own 16KB L1 cache and 256KB L2 cache. Both cores are then connected to a mesh interface unit (MIU), which connects to the on-chip network. The MIU handles data packaging and unpackaging, as well as hosting a smaller buffer that queues incoming and outgoing packets.

When both cores want to transmit data, MIU accepts data from the two cores in a round-robin fashion.

On-Chip Network

Tile indexing

Tiles are indexed with their X and Y coordinates in the mesh, shown above. Each tile has sub-indexes for indexing a core, its message buffer, or peripherals connected to it, as we will see later in Address Translation.

The Mesh Interface Unit (MIU) in each tile has a 16KB message buffer. This buffer is shared across all tiles on the chip, yielding 384KB in total. This 384KB of storage can be directly accessed by each core, but more often it is used through a message-passing interface.

The mesh interface also contained a traffic generator used for testing the performance of the mesh and is not used in normal operations.

The NoC connects to four DRAM memory controllers and a PCIe system interface. Each DRAM memory controller is connected to 16GB of memory, and the PCIe interface is connected to an FPGA on the development board to handle all I/O.

Power and Thermal Management

Voltage, Frequency Domains, and the VRC.

The chip has a single Voltage Regulator Controller (VRC) attached to the tile (x=0, y=0). It can be accessed by all the cores through the NoC. Programs can change the voltage and frequencies for all members of a voltage domain, which is not necessarily the voltage/frequency domain that the program is running in.

Each 4 (2x2) tiles are organized into a voltage domain; each tile has its own frequency domain. The entire set of tiles has an additional voltage domain. Therefore, the chip has 7 voltage domains and 24 frequency domains.

Each tile also has a digital temperature sensor. Its readings are written into a configuration register and allow dynamic scaling of voltage and frequency.

Memory Hierarchy

Each core has a 32-bit memory address space that covers 4GB of memory space, called the core address. The entire system uses system address, which is 46 bits and addresses all components in the system. Address translation happens in the mesh interface unit (MIU), which contains lookup tables (LUTs) to translate core address to system address. There is one LUT for each core.

A core address may get translated into three types of actions. Each action has a queue in the MIU to be executed.

  • Memory access: The request will be sent to the router, then to the DRAM controllers, then to the external DRAM memory.
  • Message Passing: The request will be sent to the Message Passing Buffer (MPB), either local or remote.
  • MIU Local Configuration Registers, which will be handled by the MIU.

Memory Organization

Programmer's view of memory

The off-chip DRAM is divided into two portions: Private memory for each core and shared memory among all cores. The amount of private and shared memory is configured through the lookup table (LUT). The default setting is to give each core as much private memory as possible and use the remaining as shared.

This division also affects the routing to DRAM controllers: For private memory access, the request is sent to the specific memory controller assigned for the tile. Shared memory accesses can go through any of the memory controllers.

Address Translation

Address Translation Diagram

The diagram above shows how the core address is translated to the system address. The lower 24 bits are passed through, and the higher 8 bits are sent into the lookup table.

The lookup table generates three fields: bypass, destID, subdestID, and a 10-bit prefix. They are used as follows:

  • bypass: If this bit is 1, then this address refers to the local message passing buffer. destID is ignored.
  • destID: Identifies a tile in (y, x) format. For example, tile (y=2, x=5) is encoded as 0010 0101.
  • subdestID: Refers to the specific element in the destID tile. It could be one of Core0, Core1, Configuration Registers (CRB), or Message Passing Buffer (MPB). For some specific tiles connected to memory controllers, PCIe system interface, or VRC, it can also choose one of these components using a direction (East, South, West, or North).
  • The 10-bit prefix is concatenated with the 24-bit passthrough address to get a 34-bit address, which can be used to address 16GB of memory. There are four memory controllers, and this request will be sent to one specific memory controller, so the total useable memory is 64GB.

Caching

The L1 and L2 cache for each core cache private memory as usual, but the programmer is responsible for explicitly managing cache coherence for shared memory, as this chip was not designed to have any cache coherence.

RCCE is a message-passing programming model from Intel provided with SCC, and it has primitives similar to MPI. If the programmer uses RCCE for communication, then cache coherence is handled by the library automatically. The system provides s special MPBT (Message Passing Buffer Tag) tag that identifies a cache line as shared and an instruction (CL1INVMB) to mark all MPBT lines as invalid.

Some memory latency numbers are given as follows. This is obviously not fast:

  • Reading from L2 takes around 18 cycles.
  • Reading from local MBP takes around 15 cycles if the bypass bit is set
  • Reading from local MBP takes 45 core cycles + 8 mesh cycles if the bypass bit is not set.
  • Reading from a remote MBP takes 45 cores cycles + 8*n cycles, where n is the number of hops.
  • DDR memory access takes 40 core cycles + 8*2*n mesh cycles to the memory controller, plus 30 cycles at 400MHz memory controller latency and 16 cycles at 400MHz, the DDR frequency.

RCCE

RCCE is pronounced "rocky," and they never mentioned its full name in the document. It is the programming model provided for SCC and has a couple of main functionalities:

  • Memory management, such as malloc and free.
  • MPI-like interfaces to send and receive messages, including synchronization barriers.
  • Power and frequency management.

System Architecture

The system connects to an FPGA through the PCIe system interface, and the FPGA is connected to some peripherals as well as a normal PC through another PCIe interface. The normal PC will be called the Management Console, with drivers and software kits that are designed to work with the SCC. The principle is that an operating system will be loaded onto the cores, and applications will be loaded and run on the OS. I/O system calls are redirected to and handled by the FPGA. All interactions with the system happen through the management console.

This is obviously a lab setup, but given that the project almost never walked out of Intel's research lab, such a setup does match what we are using to test newly developed chips.

Conclusion and Final Thoughts

The Single-Chip Cloud is an interesting take on manycore design. The intention of this project was obvious - to consolidate a data center into a single chip. Each core looks like a computer on the rack, and the on-chip network ties them together just like we connect machines together in a data center. However, the limited amount of memory and outdated-ish cores will not support a lot of cloud applications, not to mention security issues - there was basically no isolation between workloads running on different cores.

This project was brought to my attention because cloud and microservices are being mentioned more and more nowadays. As a result, an application consists of not a single program but a bunch of programs in their own processes, exchanging information between them using networking primitives. They are being deployed as containers, and in the ideal case, one process will be handled by only one core, maximizing the utilization of microarchitectural boosts such as TLBs, cache, and branch prediction. These processes may run better with a couple more threads but do not scale to lots of threads and has little need for shared memory or cache coherency.

As a result, the SCC architecture deployed in the old days happened to meet the needs of new applications. Recent works targeting improving the efficiency of microservices, such as , referred to SCC as a possible yet suboptimal architecture for microservices, pointing out that some cache coherence can help with multi-threaded services.

Wikipedia says this chip is still being actively used for research purposes, but I doubt that new research projects will use platforms that is more open, such as OpenPiton.

References

[1]
“List of References,” Coursera. Available: https://www.coursera.org/learn/rf-mmwave-circuit-design/home/welcome. [Accessed: Jun. 04, 2024]
[1]
D. M. W. Leenaerts, J. van der Tang, and C. S. Vaucher, Circuit design for RF transceivers. New York: Springer, 2011.
[1]
T. H. Lee, The design of CMOS radio-frequency integrated circuits, 2. ed., 7. printing. Cambridge: Cambridge Univ. Press, 2009.
[1]
B. Razavi, RF microelectronics, 2nd ed. Upper Saddle River, NJ: Prentice Hall, 2012.
[1]
Y. Yu, P. G. M. Baltus, and A. H. M. Van Roermund, Integrated 60GHz RF Beamforming in CMOS, vol. 1. in Analog Circuits and Signal Processing, vol. 1. Dordrecht: Springer Netherlands, 2011. doi: 10.1007/978-94-007-0662-0. Available: https://link.springer.com/10.1007/978-94-007-0662-0. [Accessed: Jun. 04, 2024]
[1]
Y. (Yikun) Yu, “Design methods for 60GHz beamformers in CMOS.” Technische Universiteit Eindhoven, 2010. doi: 10.6100/IR691208. Available: https://research.tue.nl/en/publications/design-methods-for-60ghz-beamformers-in-cmos(29a1aea8-7d5a-465b-9577-fb2bbdc99372).html. [Accessed: Jun. 04, 2024]
[1]
“Uniform interface is not flexible enough to handle complex and mixed network traffic.”
[1]
“one uniform die-to-die interface, which severely limits flexibility.”
[1]
“restricts cache coherence of an application or page to a subset of core.”
[1]
“no inter-node ordering requirements.”
[1]
“consuming a high priority packet is never dependent on lower priority traffic.”
[1]
“each consists of two 64-bit uni-directional links, one in each direction.”
[1]
“the L1.5 does not cache instructions–these cache lines are bypassed directly between the L1 instruction cache and the L2.”
[1]
“a write-back layer.”
[1]
“Rather than modifying the existing RTL for the L1s, we introduced an extra cache level (L1.5) to tackle both issues.”
[1]
“OpenPiton uses the OpenSPARC T1 [58] core with minimal modifications.”
[1]
J. Balkind et al., “OpenPiton: An Open Source Manycore Research Framework,” in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, Atlanta Georgia USA: ACM, Mar. 2016, pp. 217–232. doi: 10.1145/2872362.2872414. Available: https://dl.acm.org/doi/10.1145/2872362.2872414. [Accessed: May 24, 2024]
[1]
Y. Feng, D. Xiang, and K. Ma, “Heterogeneous Die-to-Die Interfaces: Enabling More Flexible Chiplet Interconnection Systems,” in 56th Annual IEEE/ACM International Symposium on Microarchitecture, Toronto ON Canada: ACM, Oct. 2023, pp. 930–943. doi: 10.1145/3613424.3614310. Available: https://dl.acm.org/doi/10.1145/3613424.3614310. [Accessed: May 21, 2024]
[1]
“exploration sequence NG1.”
[1]
“returns a candidate couple (un, vn) to be checked for the feasibility or a null couple ( ,  ) if there are no more couples to explore.”
[1]
“Ft.”
[1]
“Fs.”
[1]
“feasibility rules Fs and Ft.”
[1]
“if two nodes can be matched in a consistent mapping, they must be in the same class.”
[1]
“node explo- ration sequence NG1.”
[1]
“avoiding also consistent states that surely will not be part of a solution.”
[1]
“exploration of only consistent states, i.e. states satisfying the constraints of the subgraph isomorphism problem.”
[1]
“making the state space a tree.”
[1]
“State Space Representation (SSR).”
[1]
“partial mapping M˜(s).”
[1]
“consistent state.”
[1]
“adding a new pair of nodes (un, vn) to the partial mapping of the current state sc so as to generate a new state sn = sc ∪ (un, vn), that becomes the new current state.”
[1]
“In the case of the subgraph isomorphism, as detailed in [7,9,18], the function M must be injective and structure preserving, i.e. it must preserve both the presence and the absence of the edges between corresponding pairs of nodes.”
[1]
“Introducing VF3: A New Algorithm for Subgraph Isomorphism.”
[1]
“a greedy algorithm called GreatestConstraint- First to find a good sequence of vertices μ.”
[1]
“the order in which vertices of the pattern are matched is crucial to speeding up the pruning process.”
[1]
“reduce the search space.”
[1]
“In this paper we present a novel subgraph isomorphism algorithm, called RI (http://ferrolab.dmi.unict.it/ri.html).”
[1]
“Contribution.”
[1]
“Note that there may be an edge (u’, v’) is Î E’ without any corre- sponding edge in E; when this happens, the subgraph isomorphism is also called a monomorphism.”
[1]
“This paper introduces a new algorithm for the subgraph isomorphism problem and compares it on synthetic and biochemical data with the most efficient and recent algorithms present in literature [3,29,30].”
[1]
“The authors in [3] and related publications show their speedup compared to the algorithm in [1] which is used in [5,6].”
[1]
V. Bonnici, R. Giugno, A. Pulvirenti, D. Shasha, and A. Ferro, “A subgraph isomorphism algorithm and its application to biochemical data,” BMC Bioinformatics, vol. 14, no. 7, p. S13, Apr. 2013, doi: 10.1186/1471-2105-14-S7-S13. Available: https://doi.org/10.1186/1471-2105-14-S7-S13. [Accessed: May 10, 2024]
[1]
“makes the yield and wafer costs of the interposer much better.”
[1]
“average area per transistor and gate, and the defect density.”
[1]
“number of metal layers and cost per additional layer.”
[1]
“choice of process technology.”
[1]
“Rent’s.”
[1]
“Industry analysts have observed a rise in the design cost of a standard SoC by 2.7x between 28nm and 14nm designs, and anticipate a further increase to 9x, over $270 million, from 28nm to 7nm [9].”
[1]
“TSVs will block off active device area, so a given partitioned die will need slightly more area for the X number of TSVs: A3D = Adie + XT SV AT SV .”