Design 3.4

This design is an extended 3.3, updated to remove the requirement of a “one off” pod to generate the curve certificate. This came out of discussion for an issue adding the operator to Kueue after presenting at Kubecon 2023 in Amsterdam.

The pro is that we now don’t require the certificate generator pod, the con is that we need a linked library and can’t use the distroless image, so the binary is larger.

Summary

  • A MiniCluster is an indexed job so we can create N copies of the “same” base containers (each with Flux, and the connected workers in our cluster).

    • Multiple containers are supported, where one is required to be the Flux runner, and others provide services to it.

    • Multiple containers can either be sidecars, or separate service containers

  • The flux config is written to a volume at /etc/flux/config (created via a config map) as a brokers.toml file.

  • The curve certificate is generated by the operator using ZeroMQ

  • The startup scripts “wait.sh” (customized per container) handles parsing user preferences and starting the MiniCluster.

  • The main broker runs flux start with a primary command, and the worker nodes run flux start to register with it.

  • The munge key shared across the nodes is inherited from the base image

  • Networking of the pods works via a headless service that includes the pod subdomain.

    • We add fully qualified domain names to the pods so that the hostname command matches the full name, and Flux is given the full names in its broker.toml.

Interactions

There are several modes of interaction. Some of these are ephemeral and some persistent. The operator is flexible to different use cases.

  • Single command A special command to run a job is given to flux start. The job runs, completes, and the MiniCluster cleans up

  • Batch command A series or listing of commands is provided via the batch settings, and flux submits them as a batch job to run across the cluster.

  • Interactive mode The broker is started the the cluster waits for further interaction (interactive: true)

  • Within this mode, the user can use a Python or Go SDK to interact directly with the broker

  • The user can also shell directly into the cluster, akin to traditional HPC, to interact

  • RESTFul API The broker starts a RESTFul API server that can be interacted with (interctive: false and no command)

Kubernetes Objects

  1. A MiniCluster is a CRD that includes containers (and for each, options such as image and command) and size (see the crd spec!)

  2. Creating a MiniCluster first creates Config Maps, Volumes (required customization), and secrets and then an indexed job with pods that use them

  3. Index 0 is “special” in that is creates the main shared assets, and launches the main command or server (the others start with flux start and a sleep and essentially register to the cluster.

  4. Flux is required in the running container image (from the user), however the Flux RESTful API is not (it is installed when the server comes up).

  5. A way to make a workflow more portable is to use a Flux base container, but then pull a Singularity container to run the workflow

Testing Differences

I learned something really interesting here about zeromq hostnames! I first tested to look at differences in times for small LAMMPS runs with the new design

# These are times with the operator generator the curve cert, but with a bug
[11.077011585235596, 10.691956281661987, 10.66898226737976, 11.484421968460083, 10.63238525390625, 40.45787453651428, 11.327536582946777, 11.74747085571289, 41.571057081222534, 12.504639625549316, 10.984244585037231, 11.348772525787354, 41.009217262268066, 11.148920059204102, 11.831494569778442, 11.30379605293274, 11.368267297744751, 14.460205554962158, 20.25166130065918, 16.3896803855896]
Mean: 16.612979781627654
Std: 10.770296858792106

I noticed huge variability - and on inspection, that the worker pods weren’t connected. Here is a run with the previous design:

# These are times with a one-off pod doing it
[14.729787349700928, 13.066138982772827, 14.67087459564209, 14.394914865493774, 14.968234062194824, 14.576695919036865, 14.472598314285278, 14.54642391204834, 14.63283109664917, 14.919774770736694, 14.643915176391602, 14.472495317459106, 14.504558563232422, 15.089675903320312, 12.9026780128479, 14.505521535873413, 14.302513122558594, 15.221238851547241, 14.384584665298462, 14.278271198272705]
Mean: 14.464186310768127
Std: 0.5660379429411524

But then I realized the hostname in the certificate was empty, and fixed the bug. I ran the timed tests again, this time with a proper hostname:

# These are times with the operator generating, but the bug fixed (the certificate has a hostname)
All pods are terminated.
[11.522273302078247, 11.835262298583984, 11.50618028640747, 11.119516849517822, 11.024176120758057, 11.401328563690186, 11.369296312332153, 11.040543556213379, 11.348724126815796, 11.640495300292969, 11.585636854171753, 11.250822067260742, 11.169968128204346, 11.857943773269653, 11.732101202011108, 11.886579751968384, 11.454119682312012, 10.956253290176392, 11.347179651260376, 11.463804721832275]
Mean: 11.425610291957856
Std: 0.2796793262007871

So yay! We can see with the (working) curve certificate, the pod generation (mean) is about 3 seconds faster. That’s pretty great! And the fact that without a hostname the networking doesn’t happen is really interesting.

the-operator.png


Last update: Apr 04, 2024