Request transformations

Use LLM request transformations to dynamically compute and set fields in LLM requests using Common Expression Language (CEL) CEL (Common Expression Language) A simple expression language used throughout agentgateway to enable flexible configuration. CEL expressions can access request context, JWT claims, and other variables to make dynamic decisions. expressions. Transformations let you enforce policies such as capping token usage or conditionally modifying request parameters, without changing client code.

To learn more about CEL, see the following resources:

Before you begin

Set up an agentgateway proxy.
Set up access to the OpenAI LLM provider.

Configure LLM request transformations

Create an AgentgatewayPolicy resource to apply an LLM request transformation. The following example limits max_completion_tokens to no more than 10. If the client requests less than 10 tokens, this number is applied. If the client requests more than 10 tokens, the maximum number of 10 is applied.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: cap-max-tokens
  namespace: agentgateway-system
  labels:
    app: agentgateway
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  backend:
    ai:
      transformations:
      - field: max_completion_tokens
        expression: "min(llmRequest.max_completion_tokens, 10)"
EOF

Setting	Description
`backend.ai.transformations`	A list of LLM request field transformations.
`field`	The name of the LLM request field to set. Maximum 256 characters.
`expression`	A CEL expression that computes the value for the field. Use the `llmRequest` variable to access the original LLM request body. Maximum 16,384 characters.

ℹ️

You can specify up to 64 transformations per policy. Transformations take priority over overrides for the same field. If an expression fails to evaluate, the field is silently removed from the request.

Thinking budget fields, such as reasoning_effort and thinking_budget_tokens can also be set or capped by using transformations. This way, operators can enforce reasoning limits centrally without requiring client changes. For example, use "field": "reasoning_effort" with the expression "medium" to cap all requests to medium reasoning efforts regardless of what the client sends.

Verify that the AgentgatewayPolicy is accepted.

kubectl get AgentgatewayPolicy cap-max-tokens -n agentgateway-system -o jsonpath='{.status.ancestors[0].conditions[?(@.type=="Accepted")].status}'

Send a request with max_completion_tokens set to a value greater than 10. The transformation limits it to 10 before the request reaches the LLM provider. Verify that the completion_tokens value in the response is 10 or fewer and the finish_reason is set to length.

ℹ️

Some older OpenAI models use max_tokens instead of max_completion_tokens. If the transformation does not appear to take effect, check the model’s API documentation for the correct field name and update the transformation’s field value accordingly.

curl "$INGRESS_GW_ADDRESS/v1/chat/completions" \
-H "content-type: application/json" \
-d '{
  "model": "gpt-3.5-turbo",
  "max_completion_tokens": 5000,
  "messages": [
    {
      "role": "user",
      "content": "Tell me a short story"
    }
  ]
}' | jq

curl "localhost:8080/v1/chat/completions" \
-H "content-type: application/json" \
-d '{
  "model": "gpt-3.5-turbo",
  "max_completion_tokens": 5000,
  "messages": [
    {
      "role": "user",
      "content": "Tell me a short story"
    }
  ]
}' | jq

Example output:

{
  "model": "gpt-3.5-turbo-0125",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 10,
    "total_tokens": 22,
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "cached_tokens": 0,
      "audio_tokens": 0
    }
  },
  "choices": [
    {
      "message": {
        "content": "Once upon a time, in a small village nestled",
        "role": "assistant",
        "refusal": null,
        "annotations": []
      },
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  ...
}

Inject LLM model information as response headers

Use CEL expressions to inject LLM model information as response headers. This strategy is useful for detecting silent fallbacks, where a request is redirected to a different model without the client being notified. However, this setup might not be suitable for streaming responses.

Inject model headers from request and response bodies

Parse the model field from the incoming request body and the upstream response body using json(), then inject them as response headers. This configuration lets you compare which model was requested against which model actually responded.

json(request.body).model: Reads the model field from the incoming request body.
json(response.body).model: Reads the model field from the upstream response body.

Create a AgentgatewayPolicy resource that targets the OpenAI provider’s HTTPRoute and injects the model fields as response headers.

kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: llm-model-headers
  namespace: agentgateway-system
  labels:
    app: agentgateway
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: openai
  traffic:
    transformation:
      response:
        set:
        - name: x-requested-model
          value: 'string(json(request.body).model)'
        - name: x-actual-model
          value: 'string(json(response.body).model)'
EOF

Send a chat completion request through the gateway and inspect the response headers.

curl -vi "http://$INGRESS_GW_ADDRESS/v1/chat/completions" \
 -H "Content-Type: application/json" \
 -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hi"}]}'

curl -vi "http://localhost:8080/v1/chat/completions" \
 -H "Content-Type: application/json" \
 -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hi"}]}'

Example output:

< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< content-type: application/json
content-type: application/json
< x-requested-model: gpt-4
x-requested-model: gpt-4
< x-actual-model: gpt-3.5-turbo-0125
x-actual-model: gpt-3.5-turbo-0125
...

Actual model values might differ slightly from the requested model, even if the same model is used. Some responses might include a unique identifier as part of the model name. In these circumstances, you might use the contains() function to verify.

When a fallback model handles the request, x-actual-model differs from x-requested-model:

< x-requested-model: gpt-4o
x-requested-model: gpt-4o
< x-actual-model: gpt-4o-mini
x-actual-model: gpt-4o-mini

ℹ️

When sending traffic to the gateway with traffic compression enabled, such as gzip or br, the CEL expression could fail. If a header is missing from a response, try a different accept-encoding header in your request.

Cleanup

You can remove the resources that you created in this guide.

kubectl delete AgentgatewayPolicy cap-max-tokens -n agentgateway-system --ignore-not-found
kubectl delete AgentgatewayPolicy llm-model-headers -n agentgateway-system --ignore-not-found

Prompt templates Budget and spend limits