You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We invite your participation on this feature proposal. Please keep comments substantive. We'd especially love feedback on ways in which this feature may be useful to you and/or where you feel this RFC falls short.
Problem Statement
Customers often use public clouds for workspace provisioning, but these clouds can face resource constraints (especially for GPUs), causing slow builds. Startup scripts that clone large monorepos or install many dependencies also add delays. This poor first-touch experience - sometimes up to 15 minutes - hurts customers’ internal adoption and, subsequently, our sales.
We need a way to pre-provision workspaces so provisioning time is reduced to seconds.
User Stories
As a developer, I want to create workspaces near instantly, in order to start delivering value as soon as possible
As a developer, I want workspace creation to be fast, in order to have short-lived / ephemeral workspaces for quick experiments or code-reviews
As an operator, I want to provision workspaces preemptively so that developers can create workspaces within 60 seconds, to keep them in flow
As an operator, I am willing to trade off increased infrastructure spend to improve developers’ productivity, but I need to control this spend
As an operator, I want to view a template’s prebuilt workspaces for troubleshooting purposes
As an operator, I want my users to have a fast first experience with workspace provisioning, in order to reduce any inertia in their onboarding process
As an operator, I want metrics or other insights, in order to assess how prebuilds are being used
Requirements
Initial Functional Requirements
MUST accelerate workspace creation for net-new builds
prebuilds WILL NOT work for rebuilding existing workspaces, because it requires creating workspaces from scratch
MUST provision a workspace synchronously if a prebuild is not available (graceful fallback)
MUST allow operators to configure how many prebuilt instances to create, to control costs
MUST NOT restrict any existing functionality of workspaces
MUST allow for configuring combinations of coder_parameter values to produce different prebuilt workspace “flavors” (see Workspace Presets #16304)
MUST warn template admins about incompatibilities with prebuilds at template import time
see Constraints
MUST keep prebuilds in a running state when not in use, since the compute resource of the workspaces are usually the slowest to provision
MUST support scaling prebuilds to 0 outside of working hours to control costs
MUST expose observability to enable introspection of prebuilds provisioning and usage
MUST require a Premium license
Initial Non-functional Requirements
MUST reduce workspace provisioning time to 60 seconds or less
NOTE: provisioning time refers to the time taken to produce a workspace, but not for it to be fully operational (i.e. agent startup scripts have been run)
MUST NOT be slower than current workspace provisioning, if there is no prebuild available
MUST NOT require template admins to refactor their templates significantly
MUSTNOT change workspace behavior or template semantics
Basic Flow
template is configured by template admin to have prebuilds enabled (see UX & Design)
n prebuilt workspaces are created (”first pass”) using terraform apply
all prebuilds are owned by a special user
agent on each prebuilt workspace starts and connects to coderd
startup scripts execute conditionally
SSH and other non-essential services disabled
user requests a new workspace
prebuild exists to satisfy the request (see Matching Logic)
prebuild is marked locked
prebuild’s ownership is transferred to the requesting user (”second pass”)
prebuild is now indistinguishable from a regular workspace
terraform apply is invoked again with new ownership metadata & parameters chosen in point 3 (”third pass”)
the agent is instructed to reconfigure itself with new metadata, including new ownership (see Agent Reinitialization)
the workspace is now ready for use!
UX & Design
# existing templateresource"coder_workspace_preset""us-nix" {
name="Nix US"parameters={
(data.coder_parameter.region.name) ="us-pittsburgh"
(data.coder_parameter.image_type.name) ="codercom/oss-dogfood-nix:latest"
}
# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓prebuilds={
instances =2
cache_invalidation = {
# See the Invalidation section for more
invalidate_after_secs =86400
}
autoscaling = {
...# See the Autoscaling section for examples
}
}
}
...
Workspace presets allow operators to specify sets of parameters to simplify workspace builds for users. In order to build a workspace, all required parameters MUST be provided.
If we piggy-back on Workspace Presets, we can use them to define which “flavors” of workspaces operators want to prebuild (i.e. small/medium/large - each with their own combination of parameters). Each preset can also have its own number of prebuilt instances; some presets might be more popular than others.
This has the nice property that presets can be used without prebuilds (i.e. instances=0), and enabling prebuilds is as simple as defining the number of instances.
Persistence
The above coder_workspace_preset resources will be captured during the template import process and inserted into the database. Each template version will have its own associated preset entries.
Prebuilds themselves can be stored in the workspaces table; they are workspaces after all. Prebuilds will be identified only by their ownership. If they are owned by the prebuilds user, then they are by definition a prebuild.
It’s important to note that presets are stored against a template_version.
Matching Logic
When a user requests a new workspace and a preset is chosen, the UUID of the chosen preset is used to compare against any available prebuilds which also use that preset UUID.
A prebuild will be ONLY considered available if its lifecycle_state is ready, and its preset UUID matches.
Invalidation
New workspaces always use the latest template version. Therefore when a new template version is promoted to the active version, all existing prebuilds must be destroyed.
The proposed usage above shows that an invalidate_after_secs attribute can be set. The use-case for this is for workspaces which clone a monorepo: incremental updates (i.e. delta between prebuilt state and current state) will work up to a certain point, but after a certain period of time it might be preferable to just build a new prebuild.
We could also expose an API to invalidate all prebuilds for a preset if operators need that degree of control; i.e. a new AMI is built.
Provisioning
A nice property of our current design is that if no prebuilds are available, a new workspace will be provisioned synchronously. Failing to build prebuilds will not block users, it’ll just fall back to the existing behavior of imperative provisioning of workspace resources (graceful degradation).
Reconciliation Loop
We will build a reconciliation loop which will reconcile all templates’ prebuilds.
This needs to be triggered under the following scenarios:
A new active template version is chosen, leading to existing prebuilds being invalidated
A workspace build completes (which may have used a prebuild)
A new Autoscaling schedule becomes active (i.e. now is within crontab expression)
An Invalidation event occurs
coderd startup
Periodically (i.e. every 15s)
The control loop will invoke a reconciliation of template states on an interval, but can also be “nudged” when the above scenarios occur to reduce waiting time.
Once the number of desired vs actual prebuilds for the given template is determined, this mechanism will enqueue a number of provisioner jobs to either create new, or destroy outdated/extraneous, prebuilds to satisfy the desired count.
NOTE: We need to use an advisory lock (per template) when performing this reconciliation to prevent multiple coderd replicas from performing this same action simultaneously. Multiple coderd replicas could attempt to perform this reconciliation simultaneously.
Ownership
We will create a “Prebuild Owner” user and have it own all prebuilt workspaces.
This user MUST be excluded from user listing APIs ****
This user’s workspaces (i.e. prebuilds) MUST be excluded from workspace listing APIs
We will need specific APIs for prebuilds
This user MUST NOT count towards a license seat
We will build a mechanism to “claim” a prebuild. Prebuilds are workspaces, except they are owned by the prebuilds user; in fact, this is all that defines a prebuild. Once a prebuild is matched, it will be atomically assigned to the requestor.
No advisory lock is needed for this action; SELECT ... FOR UPDATE SKIP LOCKED will protect a prebuild from being eligible for assignment to multiple users simultaneously.
Build Phases
Each workspace will have 3 workspace builds (”phases”).
1st phase: provisioning of the prebuild itself. This will require us to stub out identity datasources (see Constraints).
This phase is entirely asynchronous and is not involved in the workspace creation process.
The Reconciliation Loop will be reconcile the state, and at this point a new prebuild provisioning attempt will be triggered.
2nd phase: workspace build following the workspace creation request. If an available prebuild is matched (see Matching Logic), the ownership (i.e. owner_id field in the workspaces table) will be atomically changed to the initiator of the request.
This & the subsequent stage MUST occur synchronously in the workspace creation process
3rd phase: prepare the workspace using the new ownership identity. We will invoke another terraform apply but now the identity datasources will have legitimate values injected, which may cause some resources to be modified (see Failure Modes). Once the build succeeds, we will need to reinitialise the agent on the prebuilt workspace with a new (updated) manifest. See Agent Reinitialization for more details.
If this phase fails, the workspace build will need to be manually retried.
We MAY need an API and/or UI here to allow a workspace to have another start transition initiated, since we don’t really want to retry (i.e. stop → start) - as this would destroy and recreate all workspace resources, obviating the point of prebuilds
The agent MUST be instructed to reinitialize whenever a start is initiated on an already running workspace.
Failure Modes
Should the 1st phase fail, the Reconciliation Loop will leave these prebuilds in their failed state. We don’t want to provision potentially many additional resources by retrying, so an operator will either need to manually restart the prebuild (via normal workspace controls) or delete it; the latter case will cause a new prebuild to be provisioned.
The 2nd phase will occur atomically; if it fails for whatever reason, the prebuild will still be available for claiming later.
If the 3rd phase fails, the workspace build will need to be manually retried; at this point it is technically no longer a prebuild, and will not be under the purview of the Reconciliation Loop.
Conditionalized Templates & Startup Scripts
Operators may require a way to conditionalize how a template behaves when it’s provisioning a prebuild vs a regular build.
Currently we use a start_count value on the coder_workspace datasource to discriminate between a start and stop transition. Similarly, we will expose a prebuild_count attribute on the coder_workspace resource (remember, a prebuild is a workspace) which will be set to 1 when building the prebuild in phase 1.
For example, a template admin could choose to only execute a script on the prebuild:
data"coder_workspace""me" {}
resource"coder_script""script1" {
# prebuild_count will only be 1 during prebuild provisioningcount=data.coder_workspace.me.prebuild_countagent_id=coder_agent.dev1.iddisplay_name="Foobar Script 1"script="echo foobar 1"run_on_start=true
}
Startup scripts can also be defined in the coder_agent resource, and these cannot take advantage of the count technique above. To ameliorate this limitation, we will need to support a new prebuild_startup_script field. We don’t need to define a prebuild_startup_script_behavior equivalent because SSH will be disabled, which this behavior interacts with.
Agent Reinitialization
The agent will need to reinitialize once it has been assigned a new identity (and possibly some of its attributes are updated like env or startup scripts).
Once build phase 3 completes, the agent will need to be notified that its manifest has been updated. The agent API will be notified via pubsub (on a per-workspace channel), and will then push an update to the agent.
Once the agent receives its new manifest, it will use it to reinitialize itself.
Observability
We should expose Prometheus metrics for (with partitioning in brackets):
counter of prebuilds created (preset_name, template_name) → collected
gauge of desired prebuilds (preset_name, template_name) → collected
gauge of actual prebuilds (preset_name, template_name) → collected
counter of failed prebuilds (preset_name, template_name, reason) → collected
counter of claimed prebuilds (preset_name, template_name, user_id) → collected
counter of presets used (preset_name, template_name) → collected
counter of workspace builds which DID NOT match a prebuild, but could have (preset_name, template_name, user_id)
i.e. there was no prebuild available at the time
Autoscaling
Given that prebuilt instances will be consuming (potentially very expensive) cloud resources, operators will need a mechanism to 0 outside working hours.
For the initial phase, we will expose an autoscaling field under coder_workspace_preset:
data"coder_workspace_preset""us-nix" {
...prebuilds={
instances =0# default to 0 instances
autoscaling = {
timezone ="UTC"# only a single timezone may be used# for simplicity# scale to 3 instances during the work week
schedule {
cron ="* 8-18 * * 1-5"# from 8AM-6PM, Mon-Fri, UTC
instances =3# scale to 3 instances
}
# scale to 1 instance on Saturdays for urgent support queries
schedule {
cron ="* 8-14 * * 6"# from 8AM-2PM, Sat, UTC
instances =1# scale to 1 instance
}
}
}
}
The solution above is designed to mirror the autostart scheduling.
The crontab format will already be familiar to operators, and will be intuitive to understand. The design above allows operators to specify a default number of instances, and then to scale that number dynamically based on one or more schedules. This can either be used to start from 0 and scale up (as the example above demonstrates), or the inverse; whichever the operator prefers.
This design would also allow for validation at template import time, where we will detect scheduling conflicts (i.e. if multiple schedules overlap and produce different values).
This will require a simple ticker to evaluate when the current time matches the crontab expression of a schedule, and to trigger the appropriate reconciliation in the Reconciliation Loop.
Constraints
An infinite set of possible template configurations are possible with Terraform
Almost all templates use the coder_workspace and coder_workspace_owner (”identity”) data-sources, both of which rely on a workspace being owned by a user
Templates can be customised in non-deterministic ways through coder_parameters
In order to allow prebuilding of workspaces, we have to side-step constraint 2. Consider the following snippet from this template:
If we were to create a prebuilt workspace, what would we provide to the data.coder_workspace_owner.me.name and data.coder_workspace.me.name values? Changing this name attribute forces a replacement of the resource, and therefore makes the prebuild irrelevant.
To counteract this, we will:
Inject known “stub” values into the above data-sources before a real identity is associated with this workspace
These values have to be human-readable since these resources will retain these values in their names, visible via the cloud console/
Create/reuse a linter which can detect known-bad values for name, and show a warning to the template author
name is not the only attribute which can cause a replacement; each provider and each resource has its own behavior. Consequently, we will need to add provider-specific checks for other resource attributes to further assist template authors
we likely just need to cover the major compute resources of the major cloud & orchestration (i.e. k8s/nomad) providers
later on we can catch all possible cases: expand the template import process to detect when a resource will be replaced during the second build phase (i.e. once a workspace has been assigned to a user)
This will instruct terraform to disregard changes to this attribute.
Onboarding
We added Workspace Build Timings which provided insight into speed issues, it didn’t offer any solutions to this particular problem.
We could use the timings graph to prompt users to try prebuilds.
Infrastructure Cost Concerns
Prebuilds will drain infrastructure spend, and we have to make that trade-off known to customers. Initially we can just highlight this in the documentation, but later we might want to provide a calculator to determine if prebuilds are worth the cost.
reacted with thumbs up emojireacted with thumbs down emojireacted with laugh emojireacted with hooray emojireacted with confused emojireacted with heart emojireacted with rocket emojireacted with eyes emoji
-
Note
We invite your participation on this feature proposal. Please keep comments substantive. We'd especially love feedback on ways in which this feature may be useful to you and/or where you feel this RFC falls short.
Problem Statement
Customers often use public clouds for workspace provisioning, but these clouds can face resource constraints (especially for GPUs), causing slow builds. Startup scripts that clone large monorepos or install many dependencies also add delays. This poor first-touch experience - sometimes up to 15 minutes - hurts customers’ internal adoption and, subsequently, our sales.
We need a way to pre-provision workspaces so provisioning time is reduced to seconds.
User Stories
As a developer, I want to create workspaces near instantly, in order to start delivering value as soon as possible
As a developer, I want workspace creation to be fast, in order to have short-lived / ephemeral workspaces for quick experiments or code-reviews
As an operator, I want to provision workspaces preemptively so that developers can create workspaces within 60 seconds, to keep them in flow
As an operator, I am willing to trade off increased infrastructure spend to improve developers’ productivity, but I need to control this spend
As an operator, I want to view a template’s prebuilt workspaces for troubleshooting purposes
As an operator, I want my users to have a fast first experience with workspace provisioning, in order to reduce any inertia in their onboarding process
As an operator, I want metrics or other insights, in order to assess how prebuilds are being used
Requirements
Initial Functional Requirements
coder_parameter
values to produce different prebuilt workspace “flavors” (see Workspace Presets #16304)Initial Non-functional Requirements
Basic Flow
terraform apply
coderd
terraform apply
is invoked again with new ownership metadata & parameters chosen in point 3 (”third pass”)UX & Design
Integration with Workspace Presets
Workspace presets allow operators to specify sets of parameters to simplify workspace builds for users. In order to build a workspace, all required parameters MUST be provided.
If we piggy-back on Workspace Presets, we can use them to define which “flavors” of workspaces operators want to prebuild (i.e. small/medium/large - each with their own combination of parameters). Each preset can also have its own number of prebuilt instances; some presets might be more popular than others.
This has the nice property that presets can be used without prebuilds (i.e.
instances=0
), and enabling prebuilds is as simple as defining the number of instances.Persistence
The above
coder_workspace_preset
resources will be captured during the template import process and inserted into the database. Each template version will have its own associated preset entries.Prebuilds themselves can be stored in the
workspaces
table; they are workspaces after all. Prebuilds will be identified only by their ownership. If they are owned by the prebuilds user, then they are by definition a prebuild.It’s important to note that presets are stored against a
template_version
.Matching Logic
When a user requests a new workspace and a preset is chosen, the UUID of the chosen preset is used to compare against any available prebuilds which also use that preset UUID.
A prebuild will be ONLY considered available if its
lifecycle_state
isready
, and its preset UUID matches.Invalidation
New workspaces always use the latest template version. Therefore when a new template version is promoted to the active version, all existing prebuilds must be destroyed.
The proposed usage above shows that an
invalidate_after_secs
attribute can be set. The use-case for this is for workspaces which clone a monorepo: incremental updates (i.e. delta between prebuilt state and current state) will work up to a certain point, but after a certain period of time it might be preferable to just build a new prebuild.We could also expose an API to invalidate all prebuilds for a preset if operators need that degree of control; i.e. a new AMI is built.
Provisioning
A nice property of our current design is that if no prebuilds are available, a new workspace will be provisioned synchronously. Failing to build prebuilds will not block users, it’ll just fall back to the existing behavior of imperative provisioning of workspace resources (graceful degradation).
Reconciliation Loop
We will build a reconciliation loop which will reconcile all templates’ prebuilds.
This needs to be triggered under the following scenarios:
coderd
startupThe control loop will invoke a reconciliation of template states on an interval, but can also be “nudged” when the above scenarios occur to reduce waiting time.
Once the number of desired vs actual prebuilds for the given template is determined, this mechanism will enqueue a number of provisioner jobs to either create new, or destroy outdated/extraneous, prebuilds to satisfy the desired count.
NOTE:
We need to use an advisory lock (per template) when performing this reconciliation to prevent multiple
coderd
replicas from performing this same action simultaneously. Multiplecoderd
replicas could attempt to perform this reconciliation simultaneously.Ownership
We will create a “Prebuild Owner” user and have it own all prebuilt workspaces.
We will build a mechanism to “claim” a prebuild.
Prebuilds are workspaces, except they are owned by the prebuilds user; in fact, this is all that defines a prebuild. Once a prebuild is matched, it will be atomically assigned to the requestor.
No advisory lock is needed for this action;
SELECT ... FOR UPDATE SKIP LOCKED
will protect a prebuild from being eligible for assignment to multiple users simultaneously.Build Phases
Each workspace will have 3 workspace builds (”phases”).
1st phase: provisioning of the prebuild itself. This will require us to stub out identity datasources (see Constraints).
2nd phase: workspace build following the workspace creation request. If an available prebuild is matched (see Matching Logic), the ownership (i.e.
owner_id
field in theworkspaces
table) will be atomically changed to the initiator of the request.3rd phase: prepare the workspace using the new ownership identity. We will invoke another
terraform apply
but now the identity datasources will have legitimate values injected, which may cause some resources to be modified (see Failure Modes). Once the build succeeds, we will need to reinitialise the agent on the prebuilt workspace with a new (updated) manifest. See Agent Reinitialization for more details.start
transition initiated, since we don’t really want to retry (i.e.stop
→start
) - as this would destroy and recreate all workspace resources, obviating the point of prebuildsstart
is initiated on an already running workspace.Failure Modes
Should the 1st phase fail, the Reconciliation Loop will leave these prebuilds in their failed state. We don’t want to provision potentially many additional resources by retrying, so an operator will either need to manually restart the prebuild (via normal workspace controls) or delete it; the latter case will cause a new prebuild to be provisioned.
The 2nd phase will occur atomically; if it fails for whatever reason, the prebuild will still be available for claiming later.
If the 3rd phase fails, the workspace build will need to be manually retried; at this point it is technically no longer a prebuild, and will not be under the purview of the Reconciliation Loop.
Conditionalized Templates & Startup Scripts
Operators may require a way to conditionalize how a template behaves when it’s provisioning a prebuild vs a regular build.
Currently we use a
start_count
value on thecoder_workspace
datasource to discriminate between astart
andstop
transition. Similarly, we will expose aprebuild_count
attribute on thecoder_workspace
resource (remember, a prebuild is a workspace) which will be set to1
when building the prebuild in phase 1.For example, a template admin could choose to only execute a script on the prebuild:
Startup scripts can also be defined in the
coder_agent
resource, and these cannot take advantage of thecount
technique above. To ameliorate this limitation, we will need to support a newprebuild_startup_script
field. We don’t need to define aprebuild_startup_script_behavior
equivalent because SSH will be disabled, which this behavior interacts with.Agent Reinitialization
The agent will need to reinitialize once it has been assigned a new identity (and possibly some of its attributes are updated like env or startup scripts).
Once build phase 3 completes, the agent will need to be notified that its manifest has been updated. The agent API will be notified via pubsub (on a per-workspace channel), and will then push an update to the agent.
Once the agent receives its new manifest, it will use it to reinitialize itself.
Observability
We should expose Prometheus metrics for (with partitioning in brackets):
Autoscaling
Given that prebuilt instances will be consuming (potentially very expensive) cloud resources, operators will need a mechanism to 0 outside working hours.
For the initial phase, we will expose an autoscaling field under
coder_workspace_preset
:The solution above is designed to mirror the autostart scheduling.
The crontab format will already be familiar to operators, and will be intuitive to understand. The design above allows operators to specify a default number of instances, and then to scale that number dynamically based on one or more schedules. This can either be used to start from 0 and scale up (as the example above demonstrates), or the inverse; whichever the operator prefers.
This design would also allow for validation at template import time, where we will detect scheduling conflicts (i.e. if multiple schedules overlap and produce different values).
This will require a simple ticker to evaluate when the current time matches the crontab expression of a schedule, and to trigger the appropriate reconciliation in the Reconciliation Loop.
Constraints
coder_workspace
andcoder_workspace_owner
(”identity”) data-sources, both of which rely on a workspace being owned by a usercoder_parameter
sIn order to allow prebuilding of workspaces, we have to side-step constraint 2. Consider the following snippet from this template:
If we were to create a prebuilt workspace, what would we provide to the
data.coder_workspace_owner.me.name
anddata.coder_workspace.me.name
values? Changing thisname
attribute forces a replacement of the resource, and therefore makes the prebuild irrelevant.To counteract this, we will:
Inject known “stub” values into the above data-sources before a real identity is associated with this workspace
data.coder_workspace_owner.me.name
:coder_prebuild_owner_${UUID}
data.coder_workspace.me.name
:coder_prebuild_${UUID}
Create/reuse a linter which can detect known-bad values for
name
, and show a warning to the template authorname
is not the only attribute which can cause a replacement; each provider and each resource has its own behavior. Consequently, we will need to add provider-specific checks for other resource attributes to further assist template authorsTo achieve this, we could either use
tfsec
’s custom checks, or query the plan file using JMESPath expressions.Workarounds for existing templates:
Using
ignore_changes
:The above would result in a replacement, but simply adding:
This will instruct terraform to disregard changes to this attribute.
Onboarding
We added Workspace Build Timings which provided insight into speed issues, it didn’t offer any solutions to this particular problem.
We could use the timings graph to prompt users to try prebuilds.
Infrastructure Cost Concerns
Prebuilds will drain infrastructure spend, and we have to make that trade-off known to customers. Initially we can just highlight this in the documentation, but later we might want to provide a calculator to determine if prebuilds are worth the cost.
Beta Was this translation helpful? Give feedback.
All reactions