DeathStarBench Microservice Benchmark

metricv Posted on 2023-07-03 43 Views


DeathStarBench is a benchmarking suite developed by the SAIL group at Cornell University. It is completely open-source and can be found at https://github.com/delimitrou/DeathStarBench.

This document is WIP.

Design Philosophy

Datacenters are shifting from complex monolithic services that handle all functionality in a single binary to graphs of tens of hundreds of single-purpose, loosely-coupled microservices. This has broad implications for cloud management, programming frameworks, operating systems, and datacenter hardware design.

This suite aims to include several end-to-end applications with tens of microservices, each to represent a cloud workload with lots of microservices.

Repo Watch

Applications are distributed with Docker Compose, with Helm charts provided.

It makes it easy to evaluate this benchmark on a platform with full Linux kernel and virtualization support but hard to evaluate micro-architectural changes with no taped-out chips or OS support.

Workloads and Properties

Social Network

A broadcast-style social network with uni-directional follow relationships.

Social Network Diagram

All applications downstream php-fpm uses Apache Thrift RPC framework.

Key workloads in this application:

  • The majority of microservices are written in C++.
  • New posts will be broadcasted to all followers with RabbitMQ, which probably means lots of traffic and database keying.
  • Includes some machine learning: recommenders and ads. search uses Xapian.
  • It uses MongoDB, a scalable key-value interface.

Media Service

Media Service Diagram

This application has many similarities with the social media application, with some differences worth mentioning:

  • It contained a MySQL database, which is a relational database and functionally less scalable than MongoDB (if you shard them correctly, that is).
  • It contained an NFS remote file server for storing large media files and chunking.

Hotel Reservation

Less is known about this workload since it is only released in the GitHub repository, not in the original paper. It is written in Golang, and it seems to have a static frontend. Communication between microservices is done with gRPC.

Though this application also contained search and recommend, they seem to be purely based on geographical distance, with no indexing or ranking whatsoever.

Unreleased Applications

The paper claims to have an E-Commerce Service, Banking System, and Drone Swarm Coordination system, but they were not present in the GitHub repository.

Traffic Generation & Tracing

All released applications contained Lua scripts for generating traffic into the application. They seem, however, to be unit tests instead of an emulation of what a user could do. It seems possible to generate a lot of requests to pressure test the service, but there is no way to configure the balance of each kind of request.

Applications using Apache Thrift have distributed tracing, using Thrift's timing interface to trace when requests arrive and depart from each microservice. gRPC did not have that feature built in, and Hotel Reservation using gRPC doesn't seem to integrate that functionality.

Their Evaluation Results and Implications

The authors of DeathStarBench evaluated these microservices on a cluster of well-equipped servers, with more than enough cores to give each container their dedicated core. Here are their observations:

  • A large portion of cycles, often a majority, is spent on the processor front-end (instruction fetch), but to a lesser extent than monolithic cloud services, due to their smaller code footprint.
    • Less portion of branch misprediction stalls.
    • Better I-cache locality and less I-cache misses.
  • ML Applications have extremely low IPC.
  • Interactive services still achieve better latency in servers that optimize for single-thread performance.
    • Microservices are much more sensitive to poor single-thread performance than traditional cloud applications.
    • Using an 48-core In-Order ARM (Cavium ThunderX) achieves similar E2E latency under low loads but saturate earlier in QPS - But this might be an unfair comparison: 20 Xeon servers vs 2 ThunderX boards.
  • Large number of cycles in the kernel: interrupts, TCP processing, waiting for IO, etc.
  • NIC queueing becomes dominant at high load
    • Offloading TCP to an FPGA provided 10-68x speedup in communication. 2.2x tail latency.
    • Network Acceleration provides a major boost in performance
  • Pressure on back-end services will propagate high latency to frontend. Hotspots propagate between tiers.
    • The cluster management platform does not always identify the app to scale out.
    • Using higher-level, less efficient programming languages for microservices cause bottlenecks, which saturates mid-tier services before they propagate to backend databases.
  • A slow server in a cluster is detrimental to overall performance.

References

[1]
“List of References,” Coursera. Available: https://www.coursera.org/learn/rf-mmwave-circuit-design/home/welcome. [Accessed: Jun. 04, 2024]
[1]
D. M. W. Leenaerts, J. van der Tang, and C. S. Vaucher, Circuit design for RF transceivers. New York: Springer, 2011.
[1]
T. H. Lee, The design of CMOS radio-frequency integrated circuits, 2. ed., 7. printing. Cambridge: Cambridge Univ. Press, 2009.
[1]
B. Razavi, RF microelectronics, 2nd ed. Upper Saddle River, NJ: Prentice Hall, 2012.
[1]
Y. Yu, P. G. M. Baltus, and A. H. M. Van Roermund, Integrated 60GHz RF Beamforming in CMOS, vol. 1. in Analog Circuits and Signal Processing, vol. 1. Dordrecht: Springer Netherlands, 2011. doi: 10.1007/978-94-007-0662-0. Available: https://link.springer.com/10.1007/978-94-007-0662-0. [Accessed: Jun. 04, 2024]
[1]
Y. (Yikun) Yu, “Design methods for 60GHz beamformers in CMOS.” Technische Universiteit Eindhoven, 2010. doi: 10.6100/IR691208. Available: https://research.tue.nl/en/publications/design-methods-for-60ghz-beamformers-in-cmos(29a1aea8-7d5a-465b-9577-fb2bbdc99372).html. [Accessed: Jun. 04, 2024]
[1]
“Uniform interface is not flexible enough to handle complex and mixed network traffic.”
[1]
“one uniform die-to-die interface, which severely limits flexibility.”
[1]
“restricts cache coherence of an application or page to a subset of core.”
[1]
“no inter-node ordering requirements.”
[1]
“consuming a high priority packet is never dependent on lower priority traffic.”
[1]
“each consists of two 64-bit uni-directional links, one in each direction.”
[1]
“the L1.5 does not cache instructions–these cache lines are bypassed directly between the L1 instruction cache and the L2.”
[1]
“a write-back layer.”
[1]
“Rather than modifying the existing RTL for the L1s, we introduced an extra cache level (L1.5) to tackle both issues.”
[1]
“OpenPiton uses the OpenSPARC T1 [58] core with minimal modifications.”
[1]
J. Balkind et al., “OpenPiton: An Open Source Manycore Research Framework,” in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, Atlanta Georgia USA: ACM, Mar. 2016, pp. 217–232. doi: 10.1145/2872362.2872414. Available: https://dl.acm.org/doi/10.1145/2872362.2872414. [Accessed: May 24, 2024]
[1]
Y. Feng, D. Xiang, and K. Ma, “Heterogeneous Die-to-Die Interfaces: Enabling More Flexible Chiplet Interconnection Systems,” in 56th Annual IEEE/ACM International Symposium on Microarchitecture, Toronto ON Canada: ACM, Oct. 2023, pp. 930–943. doi: 10.1145/3613424.3614310. Available: https://dl.acm.org/doi/10.1145/3613424.3614310. [Accessed: May 21, 2024]
[1]
“exploration sequence NG1.”
[1]
“returns a candidate couple (un, vn) to be checked for the feasibility or a null couple ( ,  ) if there are no more couples to explore.”
[1]
“Ft.”
[1]
“Fs.”
[1]
“feasibility rules Fs and Ft.”
[1]
“if two nodes can be matched in a consistent mapping, they must be in the same class.”
[1]
“node explo- ration sequence NG1.”
[1]
“avoiding also consistent states that surely will not be part of a solution.”
[1]
“exploration of only consistent states, i.e. states satisfying the constraints of the subgraph isomorphism problem.”
[1]
“making the state space a tree.”
[1]
“State Space Representation (SSR).”
[1]
“partial mapping M˜(s).”
[1]
“consistent state.”
[1]
“adding a new pair of nodes (un, vn) to the partial mapping of the current state sc so as to generate a new state sn = sc ∪ (un, vn), that becomes the new current state.”
[1]
“In the case of the subgraph isomorphism, as detailed in [7,9,18], the function M must be injective and structure preserving, i.e. it must preserve both the presence and the absence of the edges between corresponding pairs of nodes.”
[1]
“Introducing VF3: A New Algorithm for Subgraph Isomorphism.”
[1]
“a greedy algorithm called GreatestConstraint- First to find a good sequence of vertices μ.”
[1]
“the order in which vertices of the pattern are matched is crucial to speeding up the pruning process.”
[1]
“reduce the search space.”
[1]
“In this paper we present a novel subgraph isomorphism algorithm, called RI (http://ferrolab.dmi.unict.it/ri.html).”
[1]
“Contribution.”
[1]
“Note that there may be an edge (u’, v’) is Î E’ without any corre- sponding edge in E; when this happens, the subgraph isomorphism is also called a monomorphism.”
[1]
“This paper introduces a new algorithm for the subgraph isomorphism problem and compares it on synthetic and biochemical data with the most efficient and recent algorithms present in literature [3,29,30].”
[1]
“The authors in [3] and related publications show their speedup compared to the algorithm in [1] which is used in [5,6].”
[1]
V. Bonnici, R. Giugno, A. Pulvirenti, D. Shasha, and A. Ferro, “A subgraph isomorphism algorithm and its application to biochemical data,” BMC Bioinformatics, vol. 14, no. 7, p. S13, Apr. 2013, doi: 10.1186/1471-2105-14-S7-S13. Available: https://doi.org/10.1186/1471-2105-14-S7-S13. [Accessed: May 10, 2024]
[1]
“makes the yield and wafer costs of the interposer much better.”
[1]
“average area per transistor and gate, and the defect density.”
[1]
“number of metal layers and cost per additional layer.”
[1]
“choice of process technology.”
[1]
“Rent’s.”
[1]
“Industry analysts have observed a rise in the design cost of a standard SoC by 2.7x between 28nm and 14nm designs, and anticipate a further increase to 9x, over $270 million, from 28nm to 7nm [9].”
[1]
“TSVs will block off active device area, so a given partitioned die will need slightly more area for the X number of TSVs: A3D = Adie + XT SV AT SV .”