One idea I’ve had kicking around for a bit is something like a stock exchange but for compute. Buyers (users) submit a computation with resource requirements, a container hash, and a price; sellers (providers) submit their computational capacities and prices; the exchange does some matching and does the money things.

This is one possible future for the stuff I’m doing with Program Explorer. I’m slogging through the last 10% until I have a v0 public release for that so maybe writing down some of these ideas will help me get there.

I don’t have a tone of prose to elaborate in a straightforward fashion (as if anything I write is here is straightforward) so I’ll start with bullet points and that will either turn into a bigger post someday or stay a list forever.

  • cloud providers each have things like this to run containers, for example AWS ECS or lambda, Google Cloud Run
    • they don’t have a uniform API
    • they don’t support all the hardware configurations (like memory size caps) they offer for dedicated instances
  • cloud providers each have spot instances with a bid price
    • your instance can be interrupted (with warning) so you need logic to support being interrupted
    • you can’t place one bid between two providers and take the lowest/first (I think there are 3rd party versions of this)
  • there are only a handful of cloud providers, is this useful? would they adopt it?
    • they wouldn’t initiallly because they are happy having lock-in
    • I think there could be way more cloud providers than the big few and having a way for buyers to use them without even knowing who they are makes it possible for them to start getting work
  • how can I trust the provider which gets matched with my job is trustworthy
    • idk hard problem
    • part of exchange duties would be to vet providers
    • lean on combination of TPM, attestations, SEV, etc.
  • right now the price of computation is set by some finance people at big cloud providers in a static way
    • this seems incredibly hard to get correct, or incredibly easy to over-price since you don’t want to lose
    • maybe they rely on bandwidth costs and stuff; wish I could look at their numbers…
  • the price of computation should be a fluctuating thing, just like a market
    • the price of electricity can change and will continue to change
    • the price of a FLOP/INSN (instruction count if we’re not AI focused on floating point) is changing all the time
    • the geopolitics and business politics of whether/when how expensive every next gen of chips puts uncertainty on future pricing
  • there is a huge amount of old/older compute on the secondhand market
    • hard to price the value of the equipment currently because
    • with exchange data, could price the equipment based on compute/watt
  • users care about some combination of price and time
  • estimating resource requirements is a hard problem for users and developers for known containers
    • don’t want to overestimate memory or cpu otherwise we pay too much
    • users can’t be expected to know when they create a job how much they need
    • developers don’t always know a good rule of thumb or close overestimate, but sometimes they do
    • one possibility is to have some metadata / standard on how to invoke a container with the input files in “resource estimation mode”
    • another is to have a mode of “elastic” (though would AWS sue you?) execution where providers are expected to be able to migrate you around if you need more memory (up to some limit)
      • doesn’t exactly help with cpus, but you could also imagine an API for “hey please give me 10 more CPU cores” which would be kinda cool
  • some containers are public and providers could easily pull/cache them
    • I think cloud providers like to make you put any containers you want to run in your own registry; this does make sense but is also a pain
    • others are private and would require access keys to a registry (goes back to trust of provider)
  • where does the input to the program come from and where do the outputs go?
    • presigned urls from object stores (hello semi-standardized API for storage, where are you for compute?) are nice, but are per-object for reading or writing, so doesn’t really scale well
      • ideally there would be presigned url for reading from a list of objects and/or prefix and writing to a prefix
      • one presigned url per output object is again bad UX b/c the user and/or developer need to know upfront how many outputs we produce
  • where can you view the status of your computation?
    • maybe exchange manages this
    • will need to keep record for money stuff anyways
  • this is all with the idea of batch type computation, can it work for services
    • much much harder
  • if my data is in S3 but my job gets matched to azure, will I pay a fortune in egress?
    • yes
    • buyers likely need constraints they could specify, but having the exchange do something smart here too would be nice
    • hopefully we could somehow do away with egress costs one day
      • in some ways they are legit b/c it costs a X pJ/bit/mile or whatever, but again if it’s a fixed price, then it is not getting priced accurately
  • what about GPUs?
    • yes those are important
    • this is where it seems like there are actually more successful small providers already
  • what about batch jobs that expect to communicate and care about locality?
    • ideally the hard problems of scheduling with constraints could be centralized in the exchange
  • providers could ideally just netboot machines from an exchange-provided OS (or custom if they prefer) and start collecting money
  • all resembles some kind of mega job/cluster scheduler and job/workflow/DAG scheduler
    • maybe exchange would support DAG definitions (see earlier post on container build systems); never really seen one I’ve liked
  • how should a user decide whether to run on aarch64 or amd64?
    • multi arch containers should signify you don’t care and run on the cheaper
    • non obvious tradeoffs in compute time and cost though
  • if a job doesn’t complete 90% of the way through a 1 hour job, who pays?
    • if it’s a hardware failure, should be the provider
    • if it’s a bug in the program, should be the developer (joke)
    • if it’s a bug by the user, should be the user
    • would be nice to have checkpointing either at the VM/container level or some metadata on how to checkpoint a given container (what signal to send and what file(s) are in your checkpoint)
  • how much batch computation market is there?
    • scientists
    • companies
    • lower than there could be because of how much friction there is to just run a damn thing
    • tool use by AI agents
  • is container enough to specify working environment?
    • something might require a certain kernel version
    • support bare metal? need really secure root of trust to make sure user doesn’t flash your mobo with a rootkit or whatever
    • some things like benchmarking would benefit from specifying specific CPU requirements
  • what about disk space for scratch?
  • can the provider cache files?
    • would be so nice to lean on a content addressable system here
  • what is the exchange’s cut?
    • fixed fee?
    • percentage?
  • what kinds of fairness can you provide or need to provide