> ## Documentation Index
> Fetch the complete documentation index at: https://vastai-80aa3a82-ltxv2-serverless.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Getting Started With Serverless

> Learn how to get started with Vast.ai Serverless. Understand the prerequisites, setup process, and how to use the serverless engine.

<script
  type="application/ld+json"
  dangerouslySetInnerHTML={{
__html: JSON.stringify({
"@context": "https://schema.org",
"@type": "HowTo",
"name": "How to Get Started with Vast.ai Serverless",
"description": "A comprehensive tutorial on setting up a vLLM + Qwen3-8B Serverless Engine on Vast.ai, from configuring environment variables to load testing your endpoint.",
"step": [
  {
    "@type": "HowToStep",
    "name": "Configure User Environment Variables",
    "text": "Navigate to the user account settings page at cloud.vast.ai/account and drop down the 'Environment Variables' tab. Add 'HF_TOKEN' in the Key field and your HuggingFace read-access token in the Value field. Click the '+' button and then 'Save Edits'."
  },
  {
    "@type": "HowToStep",
    "name": "Prepare a Template for Workers",
    "text": "Navigate to the Templates Page at cloud.vast.ai/templates, select the Serverless filter, and click Edit on the 'vLLM + Qwen/Qwen3-8B (Serverless)' template. In the Environment Variables section, set MODEL_NAME to your desired model. Set the template to Private and click Save & Use. Copy the template hash for CLI usage."
  },
  {
    "@type": "HowToStep",
    "name": "Create The Endpoint",
    "text": "Navigate to the Serverless Page at cloud.vast.ai/serverless and click Create Endpoint. Configure parameters: endpoint_name (the name), cold_mult (multiplier for future load prediction, default 2 for LLMs), min_load (baseline tokens/second, default 100 for LLMs), target_util (percentage of compute resources in-use, default 0.9), max_workers (maximum number of workers), and cold_workers (minimum workers kept ready). Click Create."
  },
  {
    "@type": "HowToStep",
    "name": "Create a Workergroup",
    "text": "From the Serverless page, click '+ Workergroup' under the Endpoint. Your custom vLLM template should be selected. Enter values: Cold Multiplier = 3, Minimum Load = 1, Target Utilization = 0.9, Workergroup Name = 'Workergroup', and Select Endpoint = 'vLLM-Qwen3-8B'. Click Create. The serverless engine will automatically find offers and create instances."
  },
  {
    "@type": "HowToStep",
    "name": "Wait for First Ready Worker",
    "text": "Monitor the workers in the Serverless section of the Vast.ai console. Workers will download the Qwen3-8B model and initialize. When a worker finishes benchmarking (Curr. Performance is non-zero) and status becomes 'Ready', the serverless engine is ready to receive requests."
  },
  {
    "@type": "HowToStep",
    "name": "Test the Serverless Engine with Client",
    "text": "Clone the PyWorker repository from GitHub. Find your Serverless API key using 'vastai show endpoints' command. Optionally install the TLS certificate. Run the client with: 'python -m workers.openai.client --chat-stream --endpoint <ENDPOINT_NAME> --model <MODEL_NAME>' to test completions."
  }
]
})
}}
/>

<Warning>
  For users not familiar with Vast.ai's Serverless engine, we recommend starting with the [Serverless Architecture documentation](/documentation/serverless/architecture). It will be helpful in understanding how the system operates, processes requests, and manages resources.
</Warning>

# Overview & Prerequisites

Vast.ai provides pre-made serverless templates ([vLLM](/documentation/serverless/vllm), [ComfyUI](/documentation/serverless/comfy-ui)) for popular use cases, and can be used with minimal setup effort. In this guide, we will setup a serverless engine to handle inference requests to a model using vLLM, namely Qwen3-8B , using the pre-made Vast.ai vLLM serverless template. This prebuilt template bundles vLLM with scaling logic so you don’t have to write custom orchestration code. By the end of this guide, you will be able to host the Qwen3-8B model with dynamic scaling to meet your demand.

<Note>
  This guide assumes knowledge of the Vast CLI. An introduction for it can be found [here](/cli/get-started).
</Note>

Before we start, there are a few things you will need:

1. A Vast.ai account with credits
2. A Vast.ai [API Key](/documentation/reference/keys)
3. A HuggingFace account with a [read-access API token](https://huggingface.co/docs/hub/en/security-tokens)

# Setting Up a vLLM + **Qwen3-8B**  Serverless Engine

<Steps>
  <Step title="Configure User Environment Variables">
    Navigate to the user account settings page [here](https://cloud.vast.ai/account/) and drop down the "Environment Variables" tab. In the Key field, add "HF\_TOKEN", and in the Value field add the HuggingFace read-access token. Click the "+" button to the right of the fields, then click "Save Edits".

    <img src="https://mintcdn.com/vastai-80aa3a82-ltxv2-serverless/JhQSAfRh8QywTqSv/images/getting-started-serverless.webp?fit=max&auto=format&n=JhQSAfRh8QywTqSv&q=85&s=9c18afe327dcce2f2e38eb143840c39c" alt="" width="1034" height="1129" data-path="images/getting-started-serverless.webp" />
  </Step>

  <Step title="Prepare a Template for our Workers">
    Templates encapsulate all the information required to run an application on a GPU worker, including machine parameters, docker image, and environment variables.

    Navigate to the [Templates Page](https://cloud.vast.ai/templates/), select the Serverless filter, and click the Edit button on the 'vLLM + Qwen/Qwen3-8B (Serverless)' template.&#x20;

    In the Environment Variables section, "Qwen/Qwen3-8B" is the default value for `MODEL_NAME`, but can be changed to any compatible vLLM model on HuggingFace. Set this template to Private and click Save & Use.&#x20;

    <img src="https://mintcdn.com/vastai-80aa3a82-ltxv2-serverless/JhQSAfRh8QywTqSv/images/getting-started-serverless-2.webp?fit=max&auto=format&n=JhQSAfRh8QywTqSv&q=85&s=fc7873a880f0aa733e76c8314dbafbbc" alt="" width="1006" height="1212" data-path="images/getting-started-serverless-2.webp" />

    <Check>
      The template will now work without any further edits, but can be customized to suit specific needs. Vast recommends keeping the template private to avoid making any private information publically known.
    </Check>

    We should now see the Vast.ai search page with the template selected. For those intending to use the Vast CLI, click More Options on the template and select 'Copy template hash'. We will use this in step 3.

    <img src="https://mintcdn.com/vastai-80aa3a82-ltxv2-serverless/JhQSAfRh8QywTqSv/images/getting-started-serverless-3.webp?fit=max&auto=format&n=JhQSAfRh8QywTqSv&q=85&s=0c5ebf016fa5a7ff8729f7bbd3892151" alt="" width="1280" height="1200" data-path="images/getting-started-serverless-3.webp" />
  </Step>

  <Step title="Create The Endpoint">
    Next we will create an Endpoint that any user can query for generation. This can be done through the Web UI or the Vast CLI. Here, we'll create an endpoint named 'vLLM-Qwen3-8B '.

    <Tabs>
      <Tab title="Web UI">
        Navigate to the [Serverless Page](https://cloud.vast.ai/serverless/) and click Create Endpoint. A screen to create a new Endpoint will pop up, with default values already assigned. Our Endpoint will work with these default values, but you can change them to suit your needs.

        <img src="https://mintcdn.com/vastai-80aa3a82-ltxv2-serverless/JhQSAfRh8QywTqSv/images/getting-started-serverless-4.webp?fit=max&auto=format&n=JhQSAfRh8QywTqSv&q=85&s=31ec2cf89fbb7937adafb04966bbbc8f" alt="" width="800" height="1210" data-path="images/getting-started-serverless-4.webp" />

        * `endpoint_name`: The name of the Endpoint.
        * `cold_mult`: The multiple of the current load that is used to predict the future load. For example, if we currently have 10 users, but expect there to be 20 in the near future, we can set cold\_mult = 2.&#x20;
          * For LLMs, a good default is 2.
        * `min_load`: The baseline amount of load (tokens / second for LLMs) we want the Endpoint to be able to handle.&#x20;
          * For LLMs, a good default is 100.0
        * `target_util`: The percentage of the Endpoint compute resources that we want to be in-use at any given time. A lower value allows for more slack, which means the Endpoint will be less likely to be overwhelmed if there is a sudden spike in usage.&#x20;
          * For LLMs, a good default is 0.9
        * `max_workers`: The maximum number of workers the Endpoint can have at any one time.
        * `cold_workers`: The minimum number of workers kept "cold" (meaning stopped but fully loaded with the image) when the Endpoint has no load. Having cold workers available allows the Serverless system to seamlessly spin up more workers as when load increases.

        Click Create, where you will be taken back to the Serverless page. After a few moments, the Endpoint will show up with the name 'vLLM-Qwen3-8B'.
      </Tab>

      <Tab title="Vast CLI">
        If your machine is properly configured for the Vast CLI, you can run the following command:

        ```cli CLI Command theme={null}
        vastai create endpoint --endpoint_name "vLLM-Qwen3-8B" --cold_mult 1.0 --min_load 100 --target_util 0.9 --max_workers 20 --cold_workers 5
        ```

        * `endpoint_name`: The name you use to identify your Endpoint.
        * `cold_mult`: The multiple of your current load that is used to predict your future load. For example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold\_mult = 2.0.
          * For LLMs, a good default is 2.0
        * `min_load`: This is the baseline amount of load (tokens / second for LLMs) you want your Endpoint to be able to handle.&#x20;
          * For LLMs, a good default is 100.0
        * `target_util`: The percentage of your Endpoint compute resources that you want to be in-use at any given time. A lower value allows for more slack, which means your Endpoint will be less likely to be overwhelmed if there is a sudden spike in usage.&#x20;
          * For LLMs, a good default is 0.9
        * `max_workers`: The maximum number of workers your Endpoint can have at any one time.
        * `cold_workers`: The minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your Endpoint has no load.

        A successful creation of the endpoint should return a `'success': True` as the output in the terminal.
      </Tab>
    </Tabs>
  </Step>

  <Step title="Create a Workergroup">
    Now that we have our Endpoint, we can create a Workergroup with the template we prepared in step 1.&#x20;

    <Tabs>
      <Tab title="Web UI">
        From the Serverless page, click '+ Workergroup' under the Endpoint. Our custom vLLM (Serverless) template should already be selected. To confirm, click the Edit button and check that the `MODEL_NAME`environment variable is filled in.

        For our simple setup, we can enter the following values:

        * Workergroup Name = 'Workergroup'
        * Select Endpoint = 'vLLM-Qwen3-8B'

        A complete page should look like the following:

        <img src="https://mintcdn.com/vastai-80aa3a82-ltxv2-serverless/JhQSAfRh8QywTqSv/images/getting-started-serverless-5.webp?fit=max&auto=format&n=JhQSAfRh8QywTqSv&q=85&s=c9007d3fb02e28e2073d46295f63553e" alt="" width="943" height="1143" data-path="images/getting-started-serverless-5.webp" />

        After entering the values, click Create, where you will be taken back to the Serverless page. After a moment, the Workergroup will be created under the 'vLLM-Qwen3-8B' Endpoint.
      </Tab>

      <Tab title="Vast CLI">
        Run the following command to create your Workergroup:

        ```sh CLI Command theme={null}
        vastai create workergroup --endpoint_name "vLLM-DeepSeek" --template_hash "$TEMPLATE_HASH" --gpu_ram 24
        ```

        `endpoint_name`: The name of the Endpoint.
        `template_hash`: The hash code of our custom vLLM (Serverless) template.
        `gpu_ram`: The amount of memory (in GB) that you expect your template to load onto the GPU (i.e. model weights).

        <Warning>
          You will need to replace "\$TEMPLATE\_HASH" with the template hash copied from step 1.
        </Warning>
      </Tab>
    </Tabs>

    Once the Workergroup is created, the serverless engine will automatically find offers and create instances. This may take \~10-60 seconds to find appropritate GPU workers.

    <Tabs>
      <Tab title="Web UI">
        To see the instances the system creates, click the 'View detailed stats' button on the Workergroup. Five workers should startup, showing the 'Loading' status:

        <img src="https://mintcdn.com/vastai-80aa3a82-ltxv2-serverless/JhQSAfRh8QywTqSv/images/getting-started-serverless-6.webp?fit=max&auto=format&n=JhQSAfRh8QywTqSv&q=85&s=d50e6ca2ab151a2cae59841086223744" alt="" width="1280" height="206" data-path="images/getting-started-serverless-6.webp" />
      </Tab>

      <Tab title="Vast CLI">
        To see the instances the autoscaler creates, run the following command:

        ```sh CLI Command theme={null}
        vastai show instances
        ```
      </Tab>
    </Tabs>
  </Step>

  <Step title="Getting The First Ready Worker">
    Now that we have created both the Endpoint and the Workergroup, all that is left to do is await for the first "Ready" worker. We can see the status of the workers in the Serverless section of the Vast.ai console. The workers will automatically download the Qwen3-8B model defined in the template, but it will take time to fully initialize. The worker is loaded and benchmarked when the `Curr. Performance` value is non-zero.

    When a worker has finished benchmarking, the worker's status in the Workergroup will become Ready. We are now able to get a successful /route/ call to the Workergroup and send it requests!

    <img src="https://mintcdn.com/vastai-80aa3a82-ltxv2-serverless/JhQSAfRh8QywTqSv/images/getting-started-serverless-7.webp?fit=max&auto=format&n=JhQSAfRh8QywTqSv&q=85&s=f8916a93b0f9e462d3c65b26ab95407c" alt="" width="800" height="1107" data-path="images/getting-started-serverless-7.webp" />
  </Step>
</Steps>

We have now successfully created a vLLM + Qwen3-8B Serverless Engine! It is ready to receive user requests and will automatically scale up or down to meet the request demand. In this next section, we will setup a client to test the serverless engine, and learn how to use the core serverless endpoints along the way.

***

# Using the Serverless Engine

To make requests to your endpoint, first install the `vastai_sdk` from pip.

```sh Bash theme={null}
pip install vastai_sdk
```

Make sure you have configured the `VAST_API_KEY` environment variable with your Serverless API key.

## API Keys

Upon creation of a Serverless endpoint group, the group will obtain a special API key specifically for Serverless. This key is unique to an account, and will be used for all calls to the Serverless engine. This key is different from a standard Vast.ai API key and only works with Serverless endpoint groups. &#x20;

### Where to find a Serverless API key:

Use the Vast CLI to find a Serverless API key.

```cli CLI Command theme={null}
  vastai show endpoints
```

The `show endpoints` command will return a JSON blob like this:

```javascript Javascript icon="js" theme={null}
{
  "api_key": "952laufhuefiu2he72yhewikhf28732873827uifdhfiuh2ifh72hs80a8s728c699s9",
  "cold_mult": 2.0,
  "cold_workers": 3,
  "created_at": 1755115734.0841732,
  "endpoint_name": "vLLM-Qwen3-8B",
  "endpoint_state": "active",
  "id": 1234,
  "max_workers": 5,
  "min_load": 10.0,
  "target_util": 0.9,
  "user_id": 123456
 }
```

## Usage

Create a Python script to send a request to your endpoint:

```python icon="python" Python theme={null}
from vastai import Serverless
import asyncio

async def main():
    async with Serverless() as client:
        endpoint = await client.get_endpoint(name="my-endpoint")

        payload = {
            "input" : {
                "model": "Qwen/Qwen3-8B",
                "prompt" : "Who are you?",
                "max_tokens" : 100,
                "temperature" : 0.7
            }
        }
        
        response = await endpoint.request("/v1/completions", payload)
        print(response["response"]["choices"][0]["text"])

if __name__ == "__main__":
    asyncio.run(main())
```

This is everything you need to start a vLLM + Qwen3-8B Serverless engine! There are other Vast pre-made [serverless templates](/documentation/templates/quickstart), like the ComfyUI Image Generation model, that can be setup in a similar fashion.&#x20;
