DataDesigner Interface

DataDesigner validates configs, generates in-memory previews, creates persisted datasets, lists configured MCP tools, and exposes default model settings.

For runtime settings passed through set_run_config(), see run_config. For persisted creation results returned by create(), see results.

`DataDesigner`

Bases: DataDesignerInterface[DatasetCreationResults]

Main interface for creating datasets with Data Designer.

This class provides the primary interface for building synthetic datasets using Data Designer configurations. It manages model providers, artifact storage, and orchestrates the dataset creation and profiling processes.

Parameters:

Name	Type	Description	Default
`artifact_path`	`Path \| str \| None`	Path where generated artifacts will be stored. If not provided, artifacts are stored in an `artifacts` directory under the current working directory.	`None`
`model_providers`	`list[ModelProvider] \| None`	Optional list of model providers for LLM generation. If None, uses default providers.	`None`
`secret_resolver`	`SecretResolver \| None`	Resolver for handling secrets and credentials. If None, uses the default composite resolver, which checks environment variables and plaintext values.	`None`
`seed_readers`	`list[SeedReader] \| None`	Optional list of seed readers. If None, uses default readers.	`None`
`managed_assets_path`	`Path \| str \| None`	Path to the managed assets directory. This is used to point to the location of managed datasets and other assets used during dataset generation. If not provided, will check for an environment variable called DATA_DESIGNER_MANAGED_ASSETS_PATH. If the environment variable is not set, will use the default managed assets directory, which is defined in `data_designer.config.utils.constants`.	`None`
`person_reader`	`PersonReader \| None`	Optional custom reader for person datasets. If provided, this reader will be used instead of the default local reader. This allows clients to customize how managed datasets are accessed (e.g., using custom fsspec clients for S3 or other remote storage).	`None`
`mcp_providers`	`list[MCPProviderT] \| None`	Optional list of MCP provider configurations to enable tool-calling for LLM generation columns. Supports both MCPProvider (remote SSE or Streamable HTTP) and LocalStdioMCPProvider (local subprocess).	`None`

Methods:

Name	Description
`create`	Create dataset and save results to the local artifact storage.
`get_default_model_configs`	Get the default model configurations.
`get_default_model_providers`	Get the default model providers.
`get_models`	Get a dict of ModelFacade instances for custom column development.
`list_mcp_tool_names`	Connect to a configured MCP provider and return the names of its available tools.
`preview`	Generate preview dataset for fast iteration on your Data Designer configuration.
`set_run_config`	Set the runtime configuration for dataset generation.
`validate`	Validate the Data Designer configuration as defined by the DataDesignerConfigBuilder

Attributes:

Name	Type	Description
`info`	`InterfaceInfo`	Get information about the Data Designer interface.
`model_provider_registry`	`ModelProviderRegistry`	Get the resolved model provider registry.
`run_config`	`RunConfig`	Get the runtime configuration applied to dataset generation.
`secret_resolver`	`SecretResolver`	Get the secret resolver used by this DataDesigner instance.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def __init__(
    self,
    artifact_path: Path | str | None = None,
    *,
    model_providers: list[ModelProvider] | None = None,
    secret_resolver: SecretResolver | None = None,
    seed_readers: list[SeedReader] | None = None,
    managed_assets_path: Path | str | None = None,
    person_reader: PersonReader | None = None,
    mcp_providers: list[MCPProviderT] | None = None,
):
    _initialize_interface_runtime()
    self._secret_resolver = secret_resolver or DEFAULT_SECRET_RESOLVER
    self._artifact_path = Path(artifact_path) if artifact_path is not None else Path.cwd() / "artifacts"
    self._run_config = RunConfig()
    self._managed_assets_path = Path(managed_assets_path or MANAGED_ASSETS_PATH)
    self._person_reader = person_reader
    # Only consult the YAML's `default:` key when we are also falling back to
    # the YAML's `providers:` list. A user-supplied `model_providers` list
    # owns its own default (first wins), so the YAML default must not leak
    # in and either (a) hard-fail validation when the YAML names a provider
    # absent from the supplied list or (b) silently override the
    # documented first-wins ordering. See issue #588.
    if model_providers is None:
        self._model_providers = self._resolve_model_providers(None)
        default_provider_name = get_default_provider_name()
    else:
        self._model_providers = self._resolve_model_providers(model_providers)
        default_provider_name = None
    self._mcp_providers = mcp_providers or []
    self._model_provider_registry = resolve_model_provider_registry(self._model_providers, default_provider_name)
    self._seed_reader_registry = SeedReaderRegistry(readers=seed_readers or DEFAULT_SEED_READERS)

`info` `property`

Get information about the Data Designer interface.

Returns:

Type	Description
`InterfaceInfo`	InterfaceInfo object with information about the Data Designer interface.

`model_provider_registry` `property`

Get the resolved model provider registry.

Returns:

Type	Description
`ModelProviderRegistry`	The ModelProviderRegistry containing the providers and default
`ModelProviderRegistry`	resolved at construction time. The default is taken from the
`ModelProviderRegistry`	first user-supplied provider when `model_providers` was passed
`ModelProviderRegistry`	to the constructor; otherwise from the YAML's `default:` key
`ModelProviderRegistry`	when set, falling back to the first provider in the YAML list.

`run_config` `property`

Get the runtime configuration applied to dataset generation.

Returns:

Type	Description
`RunConfig`	The active RunConfig instance. Note that `RunConfig` normalizes
`RunConfig`	some fields on construction (e.g., `shutdown_error_rate` becomes
`RunConfig`	`1.0` when `disable_early_shutdown=True`), so the returned
`RunConfig`	object may not exactly equal the one originally passed to
`RunConfig`	`set_run_config`.

`secret_resolver` `property`

Get the secret resolver used by this DataDesigner instance.

Returns:

Type	Description
`SecretResolver`	The SecretResolver instance handling credentials and secrets.

`create(config_builder, *, num_records=DEFAULT_NUM_RECORDS, dataset_name='dataset')`

Create dataset and save results to the local artifact storage.

This method orchestrates the full dataset creation pipeline including building the dataset according to the configuration, profiling the generated data, and storing artifacts.

Parameters:

Name	Type	Description	Default
`config_builder`	`DataDesignerConfigBuilder`	The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).	required
`num_records`	`int`	Number of records to generate.	`DEFAULT_NUM_RECORDS`
`dataset_name`	`str`	Name of the dataset. This name will be used as the dataset folder name in the artifact path directory. If a non-empty directory with the same name already exists, dataset will be saved to a new directory with a datetime stamp. For example, if the dataset name is "awesome_dataset" and a directory with the same name already exists, the dataset will be saved to a new directory with the name "awesome_dataset_2025-01-01_12-00-00".	`'dataset'`

Returns:

Type	Description
`DatasetCreationResults`	DatasetCreationResults object with methods for loading the generated dataset,
`DatasetCreationResults`	analysis results, and displaying sample records for inspection.

Raises:

Type	Description
`DataDesignerGenerationError`	If an error occurs during dataset generation.
`DataDesignerProfilingError`	If an error occurs during dataset profiling.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def create(
    self,
    config_builder: DataDesignerConfigBuilder,
    *,
    num_records: int = DEFAULT_NUM_RECORDS,
    dataset_name: str = "dataset",
) -> DatasetCreationResults:
    """Create dataset and save results to the local artifact storage.

    This method orchestrates the full dataset creation pipeline including building
    the dataset according to the configuration, profiling the generated data, and
    storing artifacts.

    Args:
        config_builder: The DataDesignerConfigBuilder containing the dataset
            configuration (columns, constraints, seed data, etc.).
        num_records: Number of records to generate.
        dataset_name: Name of the dataset. This name will be used as the dataset
            folder name in the artifact path directory. If a non-empty directory with the
            same name already exists, dataset will be saved to a new directory with
            a datetime stamp. For example, if the dataset name is "awesome_dataset" and a directory
            with the same name already exists, the dataset will be saved to a new directory
            with the name "awesome_dataset_2025-01-01_12-00-00".

    Returns:
        DatasetCreationResults object with methods for loading the generated dataset,
        analysis results, and displaying sample records for inspection.

    Raises:
        DataDesignerGenerationError: If an error occurs during dataset generation.
        DataDesignerProfilingError: If an error occurs during dataset profiling.
    """
    logger.info("🎨 Creating Data Designer dataset")
    self._log_jinja_rendering_engine_mode()

    resource_provider = self._create_resource_provider(dataset_name, config_builder)

    try:
        builder = self._create_dataset_builder(config_builder.build(), resource_provider)
        builder.build(num_records=num_records)
    except DeprecationWarning:
        raise
    except Exception as e:
        raise DataDesignerGenerationError(f"🛑 Error generating dataset: {e}") from e

    task_traces = builder.task_traces

    try:
        dataset_for_profiler = builder.artifact_storage.load_dataset_with_dropped_columns()
    except Exception as e:
        # Distinguish "early shutdown produced zero records" from generic load failures
        # so callers can react programmatically (e.g. retry on a different alias) instead
        # of parsing a wrapped FileNotFoundError. The scheduler's structured signal lives
        # on the builder for the duration of the run. We also require the run to have
        # produced zero records: a partial-salvage run that fails to load for unrelated
        # reasons (corrupt parquet, dropped-columns mismatch, filesystem hiccup) should
        # surface the original cause, not a misleading "zero records" diagnosis.
        if builder.early_shutdown and builder.actual_num_records == 0:
            raise DataDesignerEarlyShutdownError(
                "🛑 Generation produced zero records — early shutdown was triggered. "
                "The non-retryable error rate exceeded the configured threshold; check the "
                "warnings above (and any 'Provider showing degraded performance' logs) for "
                "the contributing failures."
            ) from e
        # Surface the original task error when the run produced 0 records due to a
        # deterministic non-retryable failure (e.g. bad seed source). Without this,
        # the user sees a generic FileNotFoundError-on-parquet that obscures the cause.
        # ``actual_num_records`` is set only on the async path; sync runs leave it at
        # ``-1`` and ``first_non_retryable_error`` at ``None``, so this branch is
        # async-only by construction.
        root_cause = builder.first_non_retryable_error
        if root_cause is not None and builder.actual_num_records == 0:
            raise DataDesignerGenerationError(f"🛑 {type(root_cause).__name__}: {root_cause}") from root_cause
        raise DataDesignerGenerationError(
            f"🛑 Failed to load generated dataset — all records may have been dropped "
            f"due to generation failures. Check the warnings above for details. Original error: {e}"
        ) from e

    # Defensive: the batch manager skips writing when the buffer is empty, so in
    # practice load_dataset_with_dropped_columns() would raise before returning a
    # zero-row DataFrame. This guard protects against future changes to that contract.
    if len(dataset_for_profiler) == 0:
        # Mirror the load-failure guard above: only raise the typed error when
        # the run actually produced zero records. A partial-salvage run that
        # somehow returns an empty DF for unrelated reasons should surface the
        # generic error.
        if builder.early_shutdown and builder.actual_num_records == 0:
            raise DataDesignerEarlyShutdownError(
                "🛑 Dataset is empty — early shutdown was triggered before any records "
                "could complete. Check the warnings above for the contributing failures."
            )
        root_cause = builder.first_non_retryable_error
        if root_cause is not None and builder.actual_num_records == 0:
            raise DataDesignerGenerationError(f"🛑 {type(root_cause).__name__}: {root_cause}") from root_cause
        raise DataDesignerGenerationError(
            "🛑 Dataset is empty — all records were dropped due to generation failures. "
            "Check the warnings above for details on which columns failed."
        )

    try:
        profiler = self._create_dataset_profiler(config_builder, resource_provider)
        analysis = profiler.profile_dataset(num_records, dataset_for_profiler)
    except Exception as e:
        raise DataDesignerProfilingError(f"🛑 Error profiling dataset: {e}") from e

    dataset_metadata = resource_provider.get_dataset_metadata()

    # Update metadata with column statistics from analysis
    if analysis:
        builder.artifact_storage.update_metadata(
            {"column_statistics": [stat.model_dump(mode="json") for stat in analysis.column_statistics]}
        )

    return DatasetCreationResults(
        artifact_storage=builder.artifact_storage,
        analysis=analysis,
        config_builder=config_builder,
        dataset_metadata=dataset_metadata,
        task_traces=task_traces,
    )

`get_default_model_configs()`

Get the default model configurations.

Returns:

Type	Description
`list[ModelConfig]`	List of default model configurations.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def get_default_model_configs(self) -> list[ModelConfig]:
    """Get the default model configurations.

    Returns:
        List of default model configurations.
    """
    logger.info(f"♻️ Using default model configs from {str(MODEL_CONFIGS_FILE_PATH)!r}")
    return get_default_model_configs()

`get_default_model_providers()`

Get the default model providers.

Returns:

Type	Description
`list[ModelProvider]`	List of default model providers.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def get_default_model_providers(self) -> list[ModelProvider]:
    """Get the default model providers.

    Returns:
        List of default model providers.
    """
    logger.info(f"♻️ Using default model providers from {str(MODEL_PROVIDERS_FILE_PATH)!r}")
    return get_default_providers()

`get_models(model_aliases)`

Get a dict of ModelFacade instances for custom column development.

Use this to experiment with custom column generator functions outside of the full pipeline. The returned dict matches the models argument passed to 3-arg custom column functions.

Parameters:

Name	Type	Description	Default
`model_aliases`	`list[str]`	List of model aliases to include in the dict.	required

Returns:

Type	Description
`dict[str, ModelFacade]`	Dict mapping alias to ModelFacade instance.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def get_models(self, model_aliases: list[str]) -> dict[str, ModelFacade]:
    """Get a dict of ModelFacade instances for custom column development.

    Use this to experiment with custom column generator functions outside of
    the full pipeline. The returned dict matches the `models` argument passed
    to 3-arg custom column functions.

    Args:
        model_aliases: List of model aliases to include in the dict.

    Returns:
        Dict mapping alias to ModelFacade instance.
    """
    config_builder = DataDesignerConfigBuilder()
    resource_provider = self._create_resource_provider("dev", config_builder)
    return {alias: resource_provider.model_registry.get_model(model_alias=alias) for alias in model_aliases}

`list_mcp_tool_names(mcp_provider_name, *, timeout_sec=10.0)`

Connect to a configured MCP provider and return the names of its available tools.

Parameters:

Name	Type	Description	Default
`mcp_provider_name`	`str`	The `name` field of an MCP provider passed to the constructor.	required
`timeout_sec`	`float`	Timeout in seconds for the MCP handshake. Defaults to 10.	`10.0`

Returns:

Type	Description
`list[str]`	A list of tool name strings exposed by the MCP server.

Raises:

Type	Description
`ValueError`	If no provider with the given name was configured.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def list_mcp_tool_names(self, mcp_provider_name: str, *, timeout_sec: float = 10.0) -> list[str]:
    """Connect to a configured MCP provider and return the names of its available tools.

    Args:
        mcp_provider_name: The ``name`` field of an MCP provider passed to the constructor.
        timeout_sec: Timeout in seconds for the MCP handshake. Defaults to 10.

    Returns:
        A list of tool name strings exposed by the MCP server.

    Raises:
        ValueError: If no provider with the given name was configured.
    """
    for provider in self._mcp_providers:
        if provider.name == mcp_provider_name:
            return list_tool_names(provider, timeout_sec=timeout_sec)
    configured = [p.name for p in self._mcp_providers]
    raise ValueError(f"No MCP provider named {mcp_provider_name!r}. Configured providers: {configured}")

`preview(config_builder, *, num_records=DEFAULT_NUM_RECORDS)`

Generate preview dataset for fast iteration on your Data Designer configuration.

All preview results are stored in memory. Once you are satisfied with the preview, use the create method to generate data at a larger scale and save results to disk.

Parameters:

Name	Type	Description	Default
`config_builder`	`DataDesignerConfigBuilder`	The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).	required
`num_records`	`int`	Number of records to generate.	`DEFAULT_NUM_RECORDS`

Returns:

Type	Description
`PreviewResults`	PreviewResults object with methods for inspecting the results.

Raises:

Type	Description
`DataDesignerGenerationError`	If an error occurs during preview dataset generation.
`DataDesignerEarlyShutdownError`	If preview terminated via the early-shutdown gate with zero records produced. Subclass of `DataDesignerGenerationError`.
`DataDesignerProfilingError`	If an error occurs during preview dataset profiling.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def preview(
    self, config_builder: DataDesignerConfigBuilder, *, num_records: int = DEFAULT_NUM_RECORDS
) -> PreviewResults:
    """Generate preview dataset for fast iteration on your Data Designer configuration.

    All preview results are stored in memory. Once you are satisfied with the preview,
    use the `create` method to generate data at a larger scale and save results to disk.

    Args:
        config_builder: The DataDesignerConfigBuilder containing the dataset
            configuration (columns, constraints, seed data, etc.).
        num_records: Number of records to generate.

    Returns:
        PreviewResults object with methods for inspecting the results.

    Raises:
        DataDesignerGenerationError: If an error occurs during preview dataset generation.
        DataDesignerEarlyShutdownError: If preview terminated via the early-shutdown gate
            with zero records produced. Subclass of ``DataDesignerGenerationError``.
        DataDesignerProfilingError: If an error occurs during preview dataset profiling.
    """
    logger.info(f"{RandomEmoji.previewing()} Preview generation in progress")
    self._log_jinja_rendering_engine_mode()

    resource_provider = self._create_resource_provider("preview-dataset", config_builder)
    try:
        builder = self._create_dataset_builder(config_builder.build(), resource_provider)
        raw_dataset = builder.build_preview(num_records=num_records)
        processed_dataset = builder.process_preview(raw_dataset)
    except DeprecationWarning:
        raise
    except Exception as e:
        raise DataDesignerGenerationError(f"🛑 Error generating preview dataset: {e}") from e

    if len(processed_dataset) == 0:
        # Mirror the create() path: distinguish "early shutdown produced zero
        # records" from generic empty-dataset failures so callers can react
        # programmatically.
        if builder.early_shutdown and builder.actual_num_records == 0:
            raise DataDesignerEarlyShutdownError(
                "🛑 Preview is empty — early shutdown was triggered before any records "
                "could complete. Check the warnings above for the contributing failures."
            )
        root_cause = builder.first_non_retryable_error
        if root_cause is not None and builder.actual_num_records == 0:
            raise DataDesignerGenerationError(f"🛑 {type(root_cause).__name__}: {root_cause}") from root_cause
        raise DataDesignerGenerationError(
            "🛑 Dataset is empty — all records were dropped due to generation or processing failures. "
            "Check the warnings above for details on which columns failed."
        )

    dropped_columns = raw_dataset.columns.difference(processed_dataset.columns)
    if len(dropped_columns) > 0:
        dataset_for_profiler = lazy.pd.concat([processed_dataset, raw_dataset[dropped_columns]], axis=1)
    else:
        dataset_for_profiler = processed_dataset

    try:
        profiler = self._create_dataset_profiler(config_builder, resource_provider)
        analysis = profiler.profile_dataset(num_records, dataset_for_profiler)
    except Exception as e:
        raise DataDesignerProfilingError(f"🛑 Error profiling preview dataset: {e}") from e

    processor_artifacts: dict[str, list[dict]] = {}
    for name in builder.artifact_storage.list_processor_names():
        processor_artifacts[name] = builder.artifact_storage.load_processor_dataset(name).to_dict(orient="records")

    if isinstance(analysis, DatasetProfilerResults) and len(analysis.column_statistics) > 0:
        logger.info(f"{RandomEmoji.success()} Preview complete!")

    # Create dataset metadata from the resource provider
    dataset_metadata = resource_provider.get_dataset_metadata()

    return PreviewResults(
        dataset=processed_dataset,
        analysis=analysis,
        processor_artifacts=processor_artifacts,
        config_builder=config_builder,
        dataset_metadata=dataset_metadata,
        task_traces=builder.task_traces or None,
    )

`set_run_config(run_config)`

Set the runtime configuration for dataset generation.

Parameters:

Name	Type	Description	Default
`run_config`	`RunConfig`	A RunConfig instance containing runtime settings such as early shutdown behavior, batch sizing via `buffer_size`, and non-inference worker concurrency via `non_inference_max_parallel_workers`.	required

Notes

When disable_early_shutdown=True, DataDesigner will never terminate generation early due to error-rate thresholds. Errors are still tracked for reporting.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def set_run_config(self, run_config: RunConfig) -> None:
    """Set the runtime configuration for dataset generation.

    Args:
        run_config: A RunConfig instance containing runtime settings such as
            early shutdown behavior, batch sizing via `buffer_size`, and non-inference worker
            concurrency via `non_inference_max_parallel_workers`.

    Notes:
        When `disable_early_shutdown=True`, DataDesigner will never terminate generation early
        due to error-rate thresholds. Errors are still tracked for reporting.
    """
    self._run_config = run_config

`validate(config_builder)`

Validate the Data Designer configuration as defined by the DataDesignerConfigBuilder with the configured engine components (SecretResolver, SeedReaders, etc.).

Parameters:

Name	Type	Description	Default
`config_builder`	`DataDesignerConfigBuilder`	The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).	required

Returns:

Type	Description
`None`	None if the configuration is valid.

Raises:

Type	Description
`InvalidConfigError`	If the configuration is invalid.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py

def validate(self, config_builder: DataDesignerConfigBuilder) -> None:
    """Validate the Data Designer configuration as defined by the DataDesignerConfigBuilder
    with the configured engine components (SecretResolver, SeedReaders, etc.).

    Args:
        config_builder: The DataDesignerConfigBuilder containing the dataset
            configuration (columns, constraints, seed data, etc.).

    Returns:
        None if the configuration is valid.

    Raises:
        InvalidConfigError: If the configuration is invalid.
    """
    resource_provider = self._create_resource_provider("validate-configuration", config_builder)
    compile_data_designer_config(config_builder.build(), resource_provider)

DataDesigner Interface

DataDesigner

info property

model_provider_registry property

run_config property

secret_resolver property

create(config_builder, *, num_records=DEFAULT_NUM_RECORDS, dataset_name='dataset')

get_default_model_configs()

get_default_model_providers()

get_models(model_aliases)

list_mcp_tool_names(mcp_provider_name, *, timeout_sec=10.0)

preview(config_builder, *, num_records=DEFAULT_NUM_RECORDS)

set_run_config(run_config)

validate(config_builder)

`DataDesigner`

`info` `property`

`model_provider_registry` `property`

`run_config` `property`

`secret_resolver` `property`

`create(config_builder, *, num_records=DEFAULT_NUM_RECORDS, dataset_name='dataset')`

`get_default_model_configs()`

`get_default_model_providers()`

`get_models(model_aliases)`

`list_mcp_tool_names(mcp_provider_name, *, timeout_sec=10.0)`

`preview(config_builder, *, num_records=DEFAULT_NUM_RECORDS)`

`set_run_config(run_config)`

`validate(config_builder)`