Documentation | Evaluations

Evaluations

EvaluationsResource

Methods

client.evaluations.create() ->

post/v5/evaluations

Create an evaluation together with its items, optionally running test criteria against them.

Accepts three request shapes: standalone (inline data), from an existing dataset (dataset_id with optional per-item references), or with a new reusable dataset created inline from data. When the evaluation includes tasks that require execution (for example an LLM judge or custom function), an async job and a Temporal workflow are started and the evaluation is returned immediately with status running; task results and error_count populate asynchronously. When it includes only contributor tasks, taxonomy-only input, or no tasks, no workflow runs and it is returned with status completed. Optional tasks, metadata, tags, and taxonomy_params are persisted alongside the evaluation and its items.

client.evaluations.list() -> SyncCursorPage[

Evaluation

]

get/v5/evaluations

List evaluations for the account, with pagination.

Supports filtering by case-insensitive name substring and by tags; archived evaluations are excluded unless include_archived is set. Pass the tasks view to include each evaluation's task configurations in the response. Use this for simple name or tag lookups; to filter on metadata key-value pairs or status, use the filter endpoint instead.

client.evaluations.retrieve(, ) ->

Evaluation

get/v5/evaluations/{evaluation_id}

Retrieve a single evaluation by ID.

Returns the evaluation with its datasets, async-job progress, metadata, and task-error count. Archived evaluations are excluded unless include_archived is set. Pass the tasks view to include the evaluation's task configurations in the response.

client.evaluations.archive() ->

Evaluation

delete/v5/evaluations/{evaluation_id}

Archive (soft-delete) an evaluation.

Sets the evaluation's archived timestamp rather than permanently deleting it, and cascades the archive to the evaluation's items and dashboards while removing it from any evaluation groups. The evaluation can later be brought back with a restore request to the update endpoint.

client.evaluations.update(, ) ->

Evaluation

patch/v5/evaluations/{evaluation_id}

Update an evaluation's mutable fields, or restore it from the archive.

The action is selected by the request body: a restore request un-archives the evaluation and cascades the restore to its items and dashboards, while any other body applies a partial update to fields such as name, description, tags, and metadata (metadata is applied as an RFC 7396 merge patch). Updating an already-archived evaluation is rejected — restore it first. The evaluation row is locked for the duration of the write to avoid concurrent-update races.

client.evaluations.retrieve_schema(, ) ->

EvaluationSchemaResponse

get/v5/evaluations/{evaluation_id}/schema

Describe the data schema of an evaluation's items.

Inspects the item data and task-result fields and returns each discovered field with its flattened key path, JSON type, source, and the number of items containing it, ordered alphabetically by field name. For large evaluations the schema may be inferred from a sample of items, in which case is_sampled is set and sample_size reports how many were analyzed. Set include_archived to include archived items in the analysis.

client.evaluations.filter() -> SyncCursorPage[

Evaluation

]

post/v5/evaluations/filter

Filter evaluations by metadata, status, and tags.

Accepts up to 10 filters combined with AND logic, each comparing a key against a value with an operator (==, !=, >=, <=, IN, NOT_IN). Filter on metadata keys returned by the metadata-keys endpoint, plus the built-in status and tag keys. Archived evaluations are excluded unless include_archived is set, and the tasks view includes task configurations in each result. Use this for metadata or status filtering; for simple name or tag lookups the list endpoint is sufficient.

Parameters

filters: Iterable[

Filter

]

List of metadata filters to apply (maximum 10)

key: str

The metadata key to filter on

operator: Literal["==", "!=", ">=", "<=", "IN", "NOT_IN"]

The comparison operator to use

value: str

The value to compare against (string for all types)

object: Optional[Literal["metadata_filter"]]

(default: "metadata_filter")

ending_before: Optional[str]

include_archived: Optional[

bool

]

limit: Optional[int]

(maximum: 10000, minimum: 1, default: 100)

sort_by: Optional[str]

sort_order: Optional[

SortOrder

]

starting_after: Optional[str]

views: Optional[List[

EvaluationViews

]]

(default: [])

"tasks"

Returns

Evaluation

id: str

The unique identifier of the entity.

created_at:

datetime

(format: date-time)

The date and time when the entity was created in ISO format.

created_by:

Identity

The identity that created the entity.

datasets: List[

Dataset

]

name: str

status: Literal["failed", "completed", "running"]

tags: List[str]

The tags associated with the entity

archived_at: Optional[datetime]

(format: date-time)

The date and time when the entity was archived in ISO format.

description: Optional[str]

error_count: Optional[int]

Number of task errors across all items in this evaluation.

metadata: Optional[Dict[str,

object

]]

Metadata key-value pairs for the evaluation

object: Optional[Literal["evaluation"]]

(default: "evaluation")

progress: Optional[EvaluationTasksProgressSchema]

Progress of the evaluation's underlying async job

status_reason: Optional[str]

Reason for evaluation status

tasks: Optional[List[

EvaluationTask

]]

Tasks executed during evaluation. Populated with optional task view.

Request example

import os
from scale_gp_beta import SGPClient

client = SGPClient(
    api_key=os.environ.get("SGP_API_KEY"),  # This is the default and can be omitted
)
page = client.evaluations.filter(
    filters=[{
        "key": "key",
        "operator": "==",
        "value": "value",
    }],
)
page = page.items[0]
print(page.id)

200Example

{
  "has_more": true,
  "items": [
    {
      "id": "id",
      "created_at": "2019-12-27T18:11:19.117Z",
      "created_by": {
        "id": "id",
        "type": "user",
        "object": "identity"
      },
      "datasets": [
        {
          "id": "id",
          "created_at": "2019-12-27T18:11:19.117Z",
          "created_by": {
            "id": "id",
            "type": "user",
            "object": "identity"
          },
          "current_version_num": 0,
          "name": "name",
          "tags": [
            "string"
          ],
          "archived_at": "2019-12-27T18:11:19.117Z",
          "description": "description",
          "object": "dataset"
        }
      ],
      "name": "name",
      "status": "failed",
      "tags": [
        "string"
      ],
      "archived_at": "2019-12-27T18:11:19.117Z",
      "description": "description",
      "error_count": 0,
      "metadata": {
        "foo": "bar"
      },
      "object": "evaluation",
      "progress": {
        "items": {
          "failed": 0,
          "pending": 0,
          "successful": 0,
          "total": 0,
          "failed_items": [
            {
              "item_id": "item_id",
              "error": "error",
              "error_type": "error_type"
            }
          ]
        },
        "workflows": {
          "completed": 0,
          "failed": 0,
          "pending": 0,
          "total": 0
        }
      },
      "status_reason": "status_reason",
      "tasks": [
        {
          "configuration": {
            "messages": [
              {
                "foo": "bar"
              }
            ],
            "model": "model",
            "audio": {
              "foo": "bar"
            },
            "frequency_penalty": -2,
            "function_call": {
              "foo": "bar"
            },
            "functions": [
              {
                "foo": "bar"
              }
            ],
            "logit_bias": {
              "foo": 0
            },
            "logprobs": true,
            "max_completion_tokens": 0,
            "max_tokens": 0,
            "metadata": {
              "foo": "string"
            },
            "modalities": [
              "string"
            ],
            "n": 0,
            "parallel_tool_calls": true,
            "prediction": {
              "foo": "bar"
            },
            "presence_penalty": -2,
            "reasoning_effort": "reasoning_effort",
            "response_format": {
              "foo": "bar"
            },
            "seed": 0,
            "stop": "string",
            "store": true,
            "temperature": 0,
            "tool_choice": "string",
            "tools": [
              {
                "foo": "bar"
              }
            ],
            "top_k": 0,
            "top_logprobs": 0,
            "top_p": 0
          },
          "alias": "alias",
          "task_type": "chat_completion"
        }
      ]
    }
  ],
  "total": 0,
  "limit": 0,
  "object": "list"
}

client.evaluations.retrieve_taxonomy() ->

EvaluationRetrieveTaxonomyResponse

get/v5/evaluations/{evaluation_id}/taxonomy

Get the taxonomy JSON for an evaluation's contributor question tasks.

Returns the raw taxonomy document stored for the evaluation. Responds with a not-found error if the evaluation has no taxonomy.

Domain types

class AndEvaluationRunCondition: ...

class AutoEvaluationAgentTaskRequestWithItemLocator: ...

class EqEvaluationRunCondition: ...

class Evaluation: ...

Dict[str,

object

]

class EvaluationSchemaResponse: ...

Schema information for an evaluation's item data structure

EvaluationTask

class EvaluationTasksProgressSchema: ...

Literal["tasks"]

class GtEvaluationRunCondition: ...

class GteEvaluationRunCondition: ...

class InEvaluationRunCondition: ...

class IsNotNullEvaluationRunCondition: ...

class IsNullEvaluationRunCondition: ...

str

class LtEvaluationRunCondition: ...

class LteEvaluationRunCondition: ...

class NeEvaluationRunCondition: ...

class NotEvaluationRunCondition: ...

class NotInEvaluationRunCondition: ...

class OrEvaluationRunCondition: ...

class PaginatedListEvaluation: ...

Evaluations

Tasks

EvaluationsResource.TasksResource

Methods

client.evaluations.tasks.add(, ) ->

Evaluation

post/v5/evaluations/{evaluation_id}/tasks

Add a new test criteria to an existing evaluation.

Narrowed to contributor question tasks (contributor_evaluation.question); other task types must be configured when the evaluation is first created and are rejected here. The request is also rejected if the evaluation is archived, if a test criteria with the same alias already exists, or if any contributor annotation task for the evaluation has already been claimed or completed. Because only contributor question tasks are accepted, the added criteria is applied synchronously and contributors answer it against the evaluation's existing items — no async job or Temporal workflow is started.

client.evaluations.tasks.update(, ) ->

Evaluation

patch/v5/evaluations/{evaluation_id}/tasks/{alias}

Replace the full configuration of a single test criteria, identified by its alias.

The alias must match an existing test criteria on the evaluation, and the replacement configuration is validated against the evaluation's current items before being applied. The request is rejected if the evaluation is archived, if no test criteria matches the alias, or if any contributor annotation task for the evaluation has already been claimed or completed — at that point labelers are in-flight and mutating the task definition would corrupt their work.