Workspace Prebuilds #16969

dannykopping · 2025-03-17T15:26:15Z

dannykopping
Mar 17, 2025
Collaborator

Note

We invite your participation on this feature proposal. Please keep comments substantive. We'd especially love feedback on ways in which this feature may be useful to you and/or where you feel this RFC falls short.

Problem Statement

Customers often use public clouds for workspace provisioning, but these clouds can face resource constraints (especially for GPUs), causing slow builds. Startup scripts that clone large monorepos or install many dependencies also add delays. This poor first-touch experience - sometimes up to 15 minutes - hurts customers’ internal adoption and, subsequently, our sales.

We need a way to pre-provision workspaces so provisioning time is reduced to seconds.

User Stories

As a developer, I want to create workspaces near instantly, in order to start delivering value as soon as possible
As a developer, I want workspace creation to be fast, in order to have short-lived / ephemeral workspaces for quick experiments or code-reviews
As an operator, I want to provision workspaces preemptively so that developers can create workspaces within 60 seconds, to keep them in flow
As an operator, I am willing to trade off increased infrastructure spend to improve developers’ productivity, but I need to control this spend
As an operator, I want to view a template’s prebuilt workspaces for troubleshooting purposes
As an operator, I want my users to have a fast first experience with workspace provisioning, in order to reduce any inertia in their onboarding process
As an operator, I want metrics or other insights, in order to assess how prebuilds are being used

Requirements

Initial Functional Requirements

MUST accelerate workspace creation for net-new builds
- prebuilds WILL NOT work for rebuilding existing workspaces, because it requires creating workspaces from scratch
MUST provision a workspace synchronously if a prebuild is not available (graceful fallback)
MUST allow operators to configure how many prebuilt instances to create, to control costs
MUST NOT restrict any existing functionality of workspaces
MUST allow for configuring combinations of coder_parameter values to produce different prebuilt workspace “flavors” (see Workspace Presets #16304)
MUST warn template admins about incompatibilities with prebuilds at template import time
- see Constraints
MUST keep prebuilds in a running state when not in use, since the compute resource of the workspaces are usually the slowest to provision
MUST support scaling prebuilds to 0 outside of working hours to control costs
MUST expose observability to enable introspection of prebuilds provisioning and usage
MUST require a Premium license

Initial Non-functional Requirements

MUST reduce workspace provisioning time to 60 seconds or less
- NOTE: provisioning time refers to the time taken to produce a workspace, but not for it to be fully operational (i.e. agent startup scripts have been run)
MUST NOT be slower than current workspace provisioning, if there is no prebuild available
MUST NOT require template admins to refactor their templates significantly
MUST NOT change workspace behavior or template semantics

Basic Flow

template is configured by template admin to have prebuilds enabled (see UX & Design)
n prebuilt workspaces are created (”first pass”) using terraform apply
1. all prebuilds are owned by a special user
2. agent on each prebuilt workspace starts and connects to coderd
  1. startup scripts execute conditionally
  2. SSH and other non-essential services disabled
user requests a new workspace
prebuild exists to satisfy the request (see Matching Logic)
prebuild is marked locked
prebuild’s ownership is transferred to the requesting user (”second pass”)
1. prebuild is now indistinguishable from a regular workspace
terraform apply is invoked again with new ownership metadata & parameters chosen in point 3 (”third pass”)
the agent is instructed to reconfigure itself with new metadata, including new ownership (see Agent Reinitialization)
the workspace is now ready for use!

UX & Design

# existing template

resource "coder_workspace_preset" "us-nix" {
  name = "Nix US"
  
  parameters = {
    (data.coder_parameter.region.name)     = "us-pittsburgh"
    (data.coder_parameter.image_type.name) = "codercom/oss-dogfood-nix:latest"
  }
  
  # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
  prebuilds = {
	  instances = 2
	  cache_invalidation = {
		  # See the Invalidation section for more
		  invalidate_after_secs = 86400
	  }
	  autoscaling = {
		  ... # See the Autoscaling section for examples
	  }
  }
}

...

Integration with Workspace Presets

Workspace presets allow operators to specify sets of parameters to simplify workspace builds for users. In order to build a workspace, all required parameters MUST be provided.

If we piggy-back on Workspace Presets, we can use them to define which “flavors” of workspaces operators want to prebuild (i.e. small/medium/large - each with their own combination of parameters). Each preset can also have its own number of prebuilt instances; some presets might be more popular than others.

This has the nice property that presets can be used without prebuilds (i.e. instances=0), and enabling prebuilds is as simple as defining the number of instances.

Persistence

The above coder_workspace_preset resources will be captured during the template import process and inserted into the database. Each template version will have its own associated preset entries.

Prebuilds themselves can be stored in the workspaces table; they are workspaces after all. Prebuilds will be identified only by their ownership. If they are owned by the prebuilds user, then they are by definition a prebuild.

It’s important to note that presets are stored against a template_version.

Matching Logic

When a user requests a new workspace and a preset is chosen, the UUID of the chosen preset is used to compare against any available prebuilds which also use that preset UUID.

A prebuild will be ONLY considered available if its lifecycle_state is ready, and its preset UUID matches.

Invalidation

New workspaces always use the latest template version. Therefore when a new template version is promoted to the active version, all existing prebuilds must be destroyed.

The proposed usage above shows that an invalidate_after_secs attribute can be set. The use-case for this is for workspaces which clone a monorepo: incremental updates (i.e. delta between prebuilt state and current state) will work up to a certain point, but after a certain period of time it might be preferable to just build a new prebuild.

We could also expose an API to invalidate all prebuilds for a preset if operators need that degree of control; i.e. a new AMI is built.

Provisioning

A nice property of our current design is that if no prebuilds are available, a new workspace will be provisioned synchronously. Failing to build prebuilds will not block users, it’ll just fall back to the existing behavior of imperative provisioning of workspace resources (graceful degradation).

Reconciliation Loop

We will build a reconciliation loop which will reconcile all templates’ prebuilds.

This needs to be triggered under the following scenarios:

A new active template version is chosen, leading to existing prebuilds being invalidated
A workspace build completes (which may have used a prebuild)
A new Autoscaling schedule becomes active (i.e. now is within crontab expression)
An Invalidation event occurs
coderd startup
Periodically (i.e. every 15s)

The control loop will invoke a reconciliation of template states on an interval, but can also be “nudged” when the above scenarios occur to reduce waiting time.

Once the number of desired vs actual prebuilds for the given template is determined, this mechanism will enqueue a number of provisioner jobs to either create new, or destroy outdated/extraneous, prebuilds to satisfy the desired count.

NOTE:
We need to use an advisory lock (per template) when performing this reconciliation to prevent multiple coderd replicas from performing this same action simultaneously. Multiple coderd replicas could attempt to perform this reconciliation simultaneously.

Ownership

We will create a “Prebuild Owner” user and have it own all prebuilt workspaces.

This user MUST be excluded from user listing APIs ****
This user’s workspaces (i.e. prebuilds) MUST be excluded from workspace listing APIs
- We will need specific APIs for prebuilds
This user MUST NOT count towards a license seat

We will build a mechanism to “claim” a prebuild.
Prebuilds are workspaces, except they are owned by the prebuilds user; in fact, this is all that defines a prebuild. Once a prebuild is matched, it will be atomically assigned to the requestor.

No advisory lock is needed for this action; SELECT ... FOR UPDATE SKIP LOCKED will protect a prebuild from being eligible for assignment to multiple users simultaneously.

Build Phases

Each workspace will have 3 workspace builds (”phases”).

1st phase: provisioning of the prebuild itself. This will require us to stub out identity datasources (see Constraints).

This phase is entirely asynchronous and is not involved in the workspace creation process.
The Reconciliation Loop will be reconcile the state, and at this point a new prebuild provisioning attempt will be triggered.

2nd phase: workspace build following the workspace creation request. If an available prebuild is matched (see Matching Logic), the ownership (i.e. owner_id field in the workspaces table) will be atomically changed to the initiator of the request.

This & the subsequent stage MUST occur synchronously in the workspace creation process

3rd phase: prepare the workspace using the new ownership identity. We will invoke another terraform apply but now the identity datasources will have legitimate values injected, which may cause some resources to be modified (see Failure Modes). Once the build succeeds, we will need to reinitialise the agent on the prebuilt workspace with a new (updated) manifest. See Agent Reinitialization for more details.

If this phase fails, the workspace build will need to be manually retried.
We MAY need an API and/or UI here to allow a workspace to have another start transition initiated, since we don’t really want to retry (i.e. stop → start) - as this would destroy and recreate all workspace resources, obviating the point of prebuilds
The agent MUST be instructed to reinitialize whenever a start is initiated on an already running workspace.

Failure Modes

Should the 1st phase fail, the Reconciliation Loop will leave these prebuilds in their failed state. We don’t want to provision potentially many additional resources by retrying, so an operator will either need to manually restart the prebuild (via normal workspace controls) or delete it; the latter case will cause a new prebuild to be provisioned.

The 2nd phase will occur atomically; if it fails for whatever reason, the prebuild will still be available for claiming later.

If the 3rd phase fails, the workspace build will need to be manually retried; at this point it is technically no longer a prebuild, and will not be under the purview of the Reconciliation Loop.

Conditionalized Templates & Startup Scripts

Operators may require a way to conditionalize how a template behaves when it’s provisioning a prebuild vs a regular build.

Currently we use a start_count value on the coder_workspace datasource to discriminate between a start and stop transition. Similarly, we will expose a prebuild_count attribute on the coder_workspace resource (remember, a prebuild is a workspace) which will be set to 1 when building the prebuild in phase 1.

For example, a template admin could choose to only execute a script on the prebuild:

data "coder_workspace" "me" {}

resource "coder_script" "script1" {
  # prebuild_count will only be 1 during prebuild provisioning
  count        = data.coder_workspace.me.prebuild_count
  
  agent_id     = coder_agent.dev1.id
  display_name = "Foobar Script 1"
  script       = "echo foobar 1"

  run_on_start = true
}

Startup scripts can also be defined in the coder_agent resource, and these cannot take advantage of the count technique above. To ameliorate this limitation, we will need to support a new prebuild_startup_script field. We don’t need to define a prebuild_startup_script_behavior equivalent because SSH will be disabled, which this behavior interacts with.

Agent Reinitialization

The agent will need to reinitialize once it has been assigned a new identity (and possibly some of its attributes are updated like env or startup scripts).

Once build phase 3 completes, the agent will need to be notified that its manifest has been updated. The agent API will be notified via pubsub (on a per-workspace channel), and will then push an update to the agent.

Once the agent receives its new manifest, it will use it to reinitialize itself.

Observability

We should expose Prometheus metrics for (with partitioning in brackets):

counter of prebuilds created (preset_name, template_name) → collected
gauge of desired prebuilds (preset_name, template_name) → collected
gauge of actual prebuilds (preset_name, template_name) → collected
counter of failed prebuilds (preset_name, template_name, reason) → collected
counter of claimed prebuilds (preset_name, template_name, user_id) → collected
counter of presets used (preset_name, template_name) → collected
counter of workspace builds which DID NOT match a prebuild, but could have (preset_name, template_name, user_id)
- i.e. there was no prebuild available at the time

Autoscaling

Given that prebuilt instances will be consuming (potentially very expensive) cloud resources, operators will need a mechanism to 0 outside working hours.

For the initial phase, we will expose an autoscaling field under coder_workspace_preset:

data "coder_workspace_preset" "us-nix" {
  ...
  
  prebuilds = {
	  instances = 0                  # default to 0 instances
	  
	  autoscaling = {
		  timezone = "UTC"             # only a single timezone may be used
		                               # for simplicity
		  
		  # scale to 3 instances during the work week
		  schedule {
		    cron = "* 8-18 * * 1-5"    # from 8AM-6PM, Mon-Fri, UTC
		    instances = 3              # scale to 3 instances
		  }
		  
		  # scale to 1 instance on Saturdays for urgent support queries
		  schedule {
		    cron = "* 8-14 * * 6"      # from 8AM-2PM, Sat, UTC
		    instances = 1              # scale to 1 instance
		  }
	  }
  }
}

The solution above is designed to mirror the autostart scheduling.

The crontab format will already be familiar to operators, and will be intuitive to understand. The design above allows operators to specify a default number of instances, and then to scale that number dynamically based on one or more schedules. This can either be used to start from 0 and scale up (as the example above demonstrates), or the inverse; whichever the operator prefers.

This design would also allow for validation at template import time, where we will detect scheduling conflicts (i.e. if multiple schedules overlap and produce different values).

This will require a simple ticker to evaluate when the current time matches the crontab expression of a schedule, and to trigger the appropriate reconciliation in the Reconciliation Loop.

Constraints

An infinite set of possible template configurations are possible with Terraform
Almost all templates use the coder_workspace and coder_workspace_owner (”identity”) data-sources, both of which rely on a workspace being owned by a user
Templates can be customised in non-deterministic ways through coder_parameters

In order to allow prebuilding of workspaces, we have to side-step constraint 2. Consider the following snippet from this template:

resource "google_compute_instance" "dev" {
  ...
  count        = data.coder_workspace.me.start_count
  name         = "coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"
  ...

If we were to create a prebuilt workspace, what would we provide to the data.coder_workspace_owner.me.name and data.coder_workspace.me.name values? Changing this name attribute forces a replacement of the resource, and therefore makes the prebuild irrelevant.

To counteract this, we will:

Inject known “stub” values into the above data-sources before a real identity is associated with this workspace
- data.coder_workspace_owner.me.name: coder_prebuild_owner_${UUID}
- data.coder_workspace.me.name: coder_prebuild_${UUID}
  - …etc
- These values have to be human-readable since these resources will retain these values in their names, visible via the cloud console/
Create/reuse a linter which can detect known-bad values for name, and show a warning to the template author
name is not the only attribute which can cause a replacement; each provider and each resource has its own behavior. Consequently, we will need to add provider-specific checks for other resource attributes to further assist template authors
- we likely just need to cover the major compute resources of the major cloud & orchestration (i.e. k8s/nomad) providers
- later on we can catch all possible cases: expand the template import process to detect when a resource will be replaced during the second build phase (i.e. once a workspace has been assigned to a user)
To achieve this, we could either use tfsec’s custom checks, or query the plan file using JMESPath expressions.

Workarounds for existing templates:

Using ignore_changes:

resource "google_compute_instance" "dev" {
  count        = data.coder_workspace.me.start_count
  name         = "coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"
  ...
}

The above would result in a replacement, but simply adding:

resource "google_compute_instance" "dev" {
  lifecycle {
    ignore_changes = [name]
  }
  
  count        = data.coder_workspace.me.start_count
  name         = "coder-${lower(data.coder_workspace_owner.me.name)}-${lower(data.coder_workspace.me.name)}-root"
  ...
}

This will instruct terraform to disregard changes to this attribute.

Onboarding

We added Workspace Build Timings which provided insight into speed issues, it didn’t offer any solutions to this particular problem.

We could use the timings graph to prompt users to try prebuilds.

Infrastructure Cost Concerns

Prebuilds will drain infrastructure spend, and we have to make that trade-off known to customers. Initially we can just highlight this in the documentation, but later we might want to provide a calculator to determine if prebuilds are worth the cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace Prebuilds #16969

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Workspace Prebuilds #16969

dannykoppingMar 17, 2025Collaborator

Note

Problem Statement

User Stories

Requirements

Initial Functional Requirements

Initial Non-functional Requirements

Basic Flow

UX & Design

Integration with Workspace Presets

Persistence

Matching Logic

Invalidation

Provisioning

Reconciliation Loop

Ownership

Build Phases

Failure Modes

Conditionalized Templates & Startup Scripts

Agent Reinitialization

Observability

Autoscaling

Constraints

Onboarding

Infrastructure Cost Concerns

Replies: 0 comments

dannykopping
Mar 17, 2025
Collaborator