🦄 Spark on Kubernetes

1/7/2024

The Problem with Running Spark on Kubernetes


Kubernetes is great at running services. It's not great at running batch workloads — at least not out of the box.

The default Kubernetes scheduler was designed for long-running services, not for Spark jobs that need to spin up a driver and 50 executors simultaneously, process data, and disappear. When you try to run Spark on Kubernetes with the default scheduler, you run into a set of familiar problems:

  • ▶️ No workload queuing — if the cluster is full, Kubernetes rejects your job outright. You implement retry logic. It gets messy.
  • ▶️ No multi-tenancy — all workloads share the same resource pool. Team A's heavy job starves Team B's pipeline.
  • ▶️ No gang scheduling — Spark needs a driver and executors. The default scheduler allocates pods one by one. You end up with a driver running but no executors, burning resources while waiting.
  • ▶️ Inefficient autoscaling — without gang scheduling, the cluster autoscaler doesn't know how many nodes a job actually needs upfront. It scales reactively, in batches, with significant delays.

This is where Apache YuniKorn comes in.



What is Apache YuniKorn?


YuniKorn is a Kubernetes scheduler built specifically for batch and big data workloads. It replaces (or augments) the default Kubernetes scheduler and adds:

  • Hierarchical resource queues — define queues per team, project, or environment with guaranteed and maximum resource limits
  • Gang scheduling — allocate all resources for a job at once, or not at all
  • Application-aware scheduling — YuniKorn understands the concept of an "application" (driver + executors), not just individual pods
  • Fair scheduling and preemption — configurable priorities, preemption policies, and user/group quota enforcement
  • Web UI — visibility into queues, applications, and resource utilization

It's used in production by Apple, Alibaba, Visa, Lyft, Zillow, and Cloudera. AWS themselves recommend YuniKorn with gang scheduling for running EMR on EKS.



Gang Scheduling: The Key Feature

gang scheduling

Gang scheduling is the most important thing YuniKorn brings to Spark on Kubernetes. Here's why it matters.


The Resource Segmentation Problem


Without gang scheduling, when multiple Spark jobs compete for limited resources, you get this:

Job 1: [driver ✅] [executor ⏳] [executor ⏳] [executor ⏳]
Job 2: [driver ✅] [executor ⏳] [executor ⏳] [executor ⏳]
Job 3: [driver ✅] [executor ⏳] [executor ⏳] [executor ⏳]

Every job has a driver running but no executors. All jobs are stuck. The cluster is deadlocked.

With gang scheduling:

Job 1: [driver ✅] [executor ✅] [executor ✅] [executor ✅]  ← running
Job 2: [driver ⏳] [executor ⏳] [executor ⏳] [executor ⏳]  ← waiting
Job 3: [driver ⏳] [executor ⏳] [executor ⏳] [executor ⏳]  ← waiting

Job 1 gets all its resources and runs. Jobs 2 and 3 wait in the queue. Predictable, no deadlocks.


How It Works


You define TaskGroups in your Spark job spec — one for the driver, one for executors:

# Spark application annotations for YuniKorn gang scheduling
yunikorn.apache.org/task-group-name: "spark-driver"
yunikorn.apache.org/task-groups: |
  [{
    "name": "spark-driver",
    "minMember": 1,
    "minResource": {"cpu": "1", "memory": "2Gi"}
  }, {
    "name": "spark-executor",
    "minMember": 10,
    "minResource": {"cpu": "2", "memory": "4Gi"}
  }]

When the job is submitted, YuniKorn:

  1. Creates lightweight placeholder pods for all task groups (1 driver + 10 executors)
  2. Waits until all placeholders are scheduled — this triggers the cluster autoscaler in one shot
  3. Swaps placeholders with real pods once resources are available
  4. Cleans up any over-reserved placeholders

The result: the autoscaler sees the full resource demand immediately and scales the cluster to the right size in one event, instead of scaling reactively in multiple waves.


Soft vs Hard Gang Scheduling


YuniKorn supports both:

  • Hard gang scheduling — all or nothing. If the full resource set can't be allocated, the job waits.
  • Soft gang scheduling — tries gang scheduling first, falls back to normal scheduling if it can't be satisfied. Useful for jobs that can make progress with fewer executors.


Hierarchical Queues


Beyond gang scheduling, YuniKorn's queue system is what makes multi-tenant Spark clusters actually manageable:

queues:
  - name: root
    queues:
      - name: data-engineering
        resources:
          guaranteed: {cpu: "40", memory: "200Gi"}
          max: {cpu: "80", memory: "400Gi"}
        queues:
          - name: team-a
            resources:
              guaranteed: {cpu: "20", memory: "100Gi"}
          - name: team-b
            resources:
              guaranteed: {cpu: "20", memory: "100Gi"}
      - name: ml-training
        resources:
          guaranteed: {cpu: "20", memory: "100Gi"}
          max: {cpu: "60", memory: "300Gi"}

Each team gets guaranteed resources and can burst up to the queue maximum. No team can starve another. Jobs queue within their allocation instead of being rejected.

tenants


YuniKorn vs AWS Glue


So where does AWS Glue fit in?

AspectYuniKorn + Spark on K8sAWS Glue
SetupModerate — deploy YuniKorn, configure queuesZero — fully managed
Ops burdenYou own the clusterNone
Scheduling controlFull — gang scheduling, queues, prioritiesNone
Multi-tenancyBuilt-in via queuesLimited
Spark versionAnyGlue-managed (Glue 4.0 = Spark 3.3)
AutoscalingEfficient with gang schedulingAutomatic
Cost modelEC2/spot pricing + cluster overheadPer DPU-hour (managed premium)
Vendor lock-inNoneAWS-only
DebuggingStandard Spark UI + YuniKorn UICloudWatch + limited Spark UI
Cold startFast (warm cluster)Slower (job startup overhead)
Best forHigh-volume, multi-tenant, cost-sensitiveLow-ops, occasional ETL, AWS-native teams

When Glue Makes Sense


Glue is genuinely good if you're running occasional ETL jobs and don't want to manage infrastructure. You pay a premium for the managed service, but you get zero ops overhead, native AWS integrations (S3, Glue Catalog, IAM), and serverless scaling. For teams that run a handful of pipelines and don't want to think about Kubernetes, it's hard to beat.


When YuniKorn + Spark on K8s Makes Sense


The economics shift when you're running Spark at scale. Glue's per-DPU pricing adds up quickly for high-volume workloads. With Spark on EKS and spot instances, you can run the same workloads at significantly lower cost — often 50–70% cheaper for sustained usage — at the expense of managing the cluster.

Beyond cost, YuniKorn gives you something Glue simply can't: control over scheduling. If you have multiple teams sharing compute, complex job dependencies, or ML training workloads that need gang scheduling, Glue's "submit and wait" model doesn't cut it.

The combination of YuniKorn's gang scheduling + Kubernetes cluster autoscaler + spot instances is particularly powerful: jobs get their full resource allocation in one autoscaling event, spot interruptions are handled gracefully, and you're paying EC2 spot prices instead of Glue DPU rates.



The Verdict


AWS Glue is the right choice if you want zero ops and you're already deep in the AWS ecosystem. The managed experience is genuinely good for straightforward ETL.

YuniKorn + Spark on Kubernetes is the right choice if you're running Spark at scale, need multi-tenancy, or want to optimize costs. The gang scheduling feature alone solves real problems that the default Kubernetes scheduler can't — resource deadlocks, inefficient autoscaling, and unpredictable job runtimes. The fact that AWS themselves recommend it for EMR on EKS says a lot.

If you're already running Kubernetes and considering Spark workloads, YuniKorn should be your first stop before reaching for Glue.



Further Reading