Skip to content

DataDesigner Interface

DataDesigner validates configs, generates in-memory previews, creates persisted datasets, lists configured MCP tools, and exposes default model settings.

For runtime settings passed through set_run_config(), see run_config. For persisted creation results returned by create(), see results.

DataDesigner

Bases: DataDesignerInterface[DatasetCreationResults]

Main interface for creating datasets with Data Designer.

This class provides the primary interface for building synthetic datasets using Data Designer configurations. It manages model providers, artifact storage, and orchestrates the dataset creation and profiling processes.

Parameters:

Name Type Description Default
artifact_path Path | str | None

Path where generated artifacts will be stored. If not provided, artifacts are stored in an artifacts directory under the current working directory.

None
model_providers list[ModelProvider] | None

Optional list of model providers for LLM generation. If None, uses default providers.

None
secret_resolver SecretResolver | None

Resolver for handling secrets and credentials. If None, uses the default composite resolver, which checks environment variables and plaintext values.

None
seed_readers list[SeedReader] | None

Optional list of seed readers. If None, uses default readers.

None
managed_assets_path Path | str | None

Path to the managed assets directory. This is used to point to the location of managed datasets and other assets used during dataset generation. If not provided, will check for an environment variable called DATA_DESIGNER_MANAGED_ASSETS_PATH. If the environment variable is not set, will use the default managed assets directory, which is defined in data_designer.config.utils.constants.

None
person_reader PersonReader | None

Optional custom reader for person datasets. If provided, this reader will be used instead of the default local reader. This allows clients to customize how managed datasets are accessed (e.g., using custom fsspec clients for S3 or other remote storage).

None
mcp_providers list[MCPProviderT] | None

Optional list of MCP provider configurations to enable tool-calling for LLM generation columns. Supports both MCPProvider (remote SSE or Streamable HTTP) and LocalStdioMCPProvider (local subprocess).

None

Methods:

Name Description
create

Create dataset and save results to the local artifact storage.

get_default_model_configs

Get the default model configurations.

get_default_model_providers

Get the default model providers.

get_models

Get a dict of ModelFacade instances for custom column development.

list_mcp_tool_names

Connect to a configured MCP provider and return the names of its available tools.

preview

Generate preview dataset for fast iteration on your Data Designer configuration.

set_run_config

Set the runtime configuration for dataset generation.

validate

Validate the Data Designer configuration as defined by the DataDesignerConfigBuilder

Attributes:

Name Type Description
info InterfaceInfo

Get information about the Data Designer interface.

model_provider_registry ModelProviderRegistry

Get the resolved model provider registry.

run_config RunConfig

Get the runtime configuration applied to dataset generation.

secret_resolver SecretResolver

Get the secret resolver used by this DataDesigner instance.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def __init__(
    self,
    artifact_path: Path | str | None = None,
    *,
    model_providers: list[ModelProvider] | None = None,
    secret_resolver: SecretResolver | None = None,
    seed_readers: list[SeedReader] | None = None,
    managed_assets_path: Path | str | None = None,
    person_reader: PersonReader | None = None,
    mcp_providers: list[MCPProviderT] | None = None,
):
    _initialize_interface_runtime()
    self._secret_resolver = secret_resolver or DEFAULT_SECRET_RESOLVER
    self._artifact_path = Path(artifact_path) if artifact_path is not None else Path.cwd() / "artifacts"
    self._run_config = RunConfig()
    self._managed_assets_path = Path(managed_assets_path or MANAGED_ASSETS_PATH)
    self._person_reader = person_reader
    # Only consult the YAML's `default:` key when we are also falling back to
    # the YAML's `providers:` list. A user-supplied `model_providers` list
    # owns its own default (first wins), so the YAML default must not leak
    # in and either (a) hard-fail validation when the YAML names a provider
    # absent from the supplied list or (b) silently override the
    # documented first-wins ordering. See issue #588.
    if model_providers is None:
        self._model_providers = self._resolve_model_providers(None)
        default_provider_name = get_default_provider_name()
    else:
        self._model_providers = self._resolve_model_providers(model_providers)
        default_provider_name = None
    self._mcp_providers = mcp_providers or []
    self._model_provider_registry = resolve_model_provider_registry(self._model_providers, default_provider_name)
    self._seed_reader_registry = SeedReaderRegistry(readers=seed_readers or DEFAULT_SEED_READERS)

info property

Get information about the Data Designer interface.

Returns:

Type Description
InterfaceInfo

InterfaceInfo object with information about the Data Designer interface.

model_provider_registry property

Get the resolved model provider registry.

Returns:

Type Description
ModelProviderRegistry

The ModelProviderRegistry containing the providers and default

ModelProviderRegistry

resolved at construction time. The default is taken from the

ModelProviderRegistry

first user-supplied provider when model_providers was passed

ModelProviderRegistry

to the constructor; otherwise from the YAML's default: key

ModelProviderRegistry

when set, falling back to the first provider in the YAML list.

run_config property

Get the runtime configuration applied to dataset generation.

Returns:

Type Description
RunConfig

The active RunConfig instance. Note that RunConfig normalizes

RunConfig

some fields on construction (e.g., shutdown_error_rate becomes

RunConfig

1.0 when disable_early_shutdown=True), so the returned

RunConfig

object may not exactly equal the one originally passed to

RunConfig

set_run_config.

secret_resolver property

Get the secret resolver used by this DataDesigner instance.

Returns:

Type Description
SecretResolver

The SecretResolver instance handling credentials and secrets.

create(config_builder, *, num_records=DEFAULT_NUM_RECORDS, dataset_name='dataset')

Create dataset and save results to the local artifact storage.

This method orchestrates the full dataset creation pipeline including building the dataset according to the configuration, profiling the generated data, and storing artifacts.

Parameters:

Name Type Description Default
config_builder DataDesignerConfigBuilder

The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).

required
num_records int

Number of records to generate.

DEFAULT_NUM_RECORDS
dataset_name str

Name of the dataset. This name will be used as the dataset folder name in the artifact path directory. If a non-empty directory with the same name already exists, dataset will be saved to a new directory with a datetime stamp. For example, if the dataset name is "awesome_dataset" and a directory with the same name already exists, the dataset will be saved to a new directory with the name "awesome_dataset_2025-01-01_12-00-00".

'dataset'

Returns:

Type Description
DatasetCreationResults

DatasetCreationResults object with methods for loading the generated dataset,

DatasetCreationResults

analysis results, and displaying sample records for inspection.

Raises:

Type Description
DataDesignerGenerationError

If an error occurs during dataset generation.

DataDesignerProfilingError

If an error occurs during dataset profiling.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
def create(
    self,
    config_builder: DataDesignerConfigBuilder,
    *,
    num_records: int = DEFAULT_NUM_RECORDS,
    dataset_name: str = "dataset",
) -> DatasetCreationResults:
    """Create dataset and save results to the local artifact storage.

    This method orchestrates the full dataset creation pipeline including building
    the dataset according to the configuration, profiling the generated data, and
    storing artifacts.

    Args:
        config_builder: The DataDesignerConfigBuilder containing the dataset
            configuration (columns, constraints, seed data, etc.).
        num_records: Number of records to generate.
        dataset_name: Name of the dataset. This name will be used as the dataset
            folder name in the artifact path directory. If a non-empty directory with the
            same name already exists, dataset will be saved to a new directory with
            a datetime stamp. For example, if the dataset name is "awesome_dataset" and a directory
            with the same name already exists, the dataset will be saved to a new directory
            with the name "awesome_dataset_2025-01-01_12-00-00".

    Returns:
        DatasetCreationResults object with methods for loading the generated dataset,
        analysis results, and displaying sample records for inspection.

    Raises:
        DataDesignerGenerationError: If an error occurs during dataset generation.
        DataDesignerProfilingError: If an error occurs during dataset profiling.
    """
    logger.info("🎨 Creating Data Designer dataset")
    self._log_jinja_rendering_engine_mode()

    resource_provider = self._create_resource_provider(dataset_name, config_builder)

    try:
        builder = self._create_dataset_builder(config_builder.build(), resource_provider)
        builder.build(num_records=num_records)
    except DeprecationWarning:
        raise
    except Exception as e:
        raise DataDesignerGenerationError(f"🛑 Error generating dataset: {e}") from e

    task_traces = builder.task_traces

    try:
        dataset_for_profiler = builder.artifact_storage.load_dataset_with_dropped_columns()
    except Exception as e:
        # Distinguish "early shutdown produced zero records" from generic load failures
        # so callers can react programmatically (e.g. retry on a different alias) instead
        # of parsing a wrapped FileNotFoundError. The scheduler's structured signal lives
        # on the builder for the duration of the run. We also require the run to have
        # produced zero records: a partial-salvage run that fails to load for unrelated
        # reasons (corrupt parquet, dropped-columns mismatch, filesystem hiccup) should
        # surface the original cause, not a misleading "zero records" diagnosis.
        if builder.early_shutdown and builder.actual_num_records == 0:
            raise DataDesignerEarlyShutdownError(
                "🛑 Generation produced zero records — early shutdown was triggered. "
                "The non-retryable error rate exceeded the configured threshold; check the "
                "warnings above (and any 'Provider showing degraded performance' logs) for "
                "the contributing failures."
            ) from e
        # Surface the original task error when the run produced 0 records due to a
        # deterministic non-retryable failure (e.g. bad seed source). Without this,
        # the user sees a generic FileNotFoundError-on-parquet that obscures the cause.
        # ``actual_num_records`` is set only on the async path; sync runs leave it at
        # ``-1`` and ``first_non_retryable_error`` at ``None``, so this branch is
        # async-only by construction.
        root_cause = builder.first_non_retryable_error
        if root_cause is not None and builder.actual_num_records == 0:
            raise DataDesignerGenerationError(f"🛑 {type(root_cause).__name__}: {root_cause}") from root_cause
        raise DataDesignerGenerationError(
            f"🛑 Failed to load generated dataset — all records may have been dropped "
            f"due to generation failures. Check the warnings above for details. Original error: {e}"
        ) from e

    # Defensive: the batch manager skips writing when the buffer is empty, so in
    # practice load_dataset_with_dropped_columns() would raise before returning a
    # zero-row DataFrame. This guard protects against future changes to that contract.
    if len(dataset_for_profiler) == 0:
        # Mirror the load-failure guard above: only raise the typed error when
        # the run actually produced zero records. A partial-salvage run that
        # somehow returns an empty DF for unrelated reasons should surface the
        # generic error.
        if builder.early_shutdown and builder.actual_num_records == 0:
            raise DataDesignerEarlyShutdownError(
                "🛑 Dataset is empty — early shutdown was triggered before any records "
                "could complete. Check the warnings above for the contributing failures."
            )
        root_cause = builder.first_non_retryable_error
        if root_cause is not None and builder.actual_num_records == 0:
            raise DataDesignerGenerationError(f"🛑 {type(root_cause).__name__}: {root_cause}") from root_cause
        raise DataDesignerGenerationError(
            "🛑 Dataset is empty — all records were dropped due to generation failures. "
            "Check the warnings above for details on which columns failed."
        )

    try:
        profiler = self._create_dataset_profiler(config_builder, resource_provider)
        analysis = profiler.profile_dataset(num_records, dataset_for_profiler)
    except Exception as e:
        raise DataDesignerProfilingError(f"🛑 Error profiling dataset: {e}") from e

    dataset_metadata = resource_provider.get_dataset_metadata()

    # Update metadata with column statistics from analysis
    if analysis:
        builder.artifact_storage.update_metadata(
            {"column_statistics": [stat.model_dump(mode="json") for stat in analysis.column_statistics]}
        )

    return DatasetCreationResults(
        artifact_storage=builder.artifact_storage,
        analysis=analysis,
        config_builder=config_builder,
        dataset_metadata=dataset_metadata,
        task_traces=task_traces,
    )

get_default_model_configs()

Get the default model configurations.

Returns:

Type Description
list[ModelConfig]

List of default model configurations.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
426
427
428
429
430
431
432
433
def get_default_model_configs(self) -> list[ModelConfig]:
    """Get the default model configurations.

    Returns:
        List of default model configurations.
    """
    logger.info(f"♻️ Using default model configs from {str(MODEL_CONFIGS_FILE_PATH)!r}")
    return get_default_model_configs()

get_default_model_providers()

Get the default model providers.

Returns:

Type Description
list[ModelProvider]

List of default model providers.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
435
436
437
438
439
440
441
442
def get_default_model_providers(self) -> list[ModelProvider]:
    """Get the default model providers.

    Returns:
        List of default model providers.
    """
    logger.info(f"♻️ Using default model providers from {str(MODEL_PROVIDERS_FILE_PATH)!r}")
    return get_default_providers()

get_models(model_aliases)

Get a dict of ModelFacade instances for custom column development.

Use this to experiment with custom column generator functions outside of the full pipeline. The returned dict matches the models argument passed to 3-arg custom column functions.

Parameters:

Name Type Description Default
model_aliases list[str]

List of model aliases to include in the dict.

required

Returns:

Type Description
dict[str, ModelFacade]

Dict mapping alias to ModelFacade instance.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
def get_models(self, model_aliases: list[str]) -> dict[str, ModelFacade]:
    """Get a dict of ModelFacade instances for custom column development.

    Use this to experiment with custom column generator functions outside of
    the full pipeline. The returned dict matches the `models` argument passed
    to 3-arg custom column functions.

    Args:
        model_aliases: List of model aliases to include in the dict.

    Returns:
        Dict mapping alias to ModelFacade instance.
    """
    config_builder = DataDesignerConfigBuilder()
    resource_provider = self._create_resource_provider("dev", config_builder)
    return {alias: resource_provider.model_registry.get_model(model_alias=alias) for alias in model_aliases}

list_mcp_tool_names(mcp_provider_name, *, timeout_sec=10.0)

Connect to a configured MCP provider and return the names of its available tools.

Parameters:

Name Type Description Default
mcp_provider_name str

The name field of an MCP provider passed to the constructor.

required
timeout_sec float

Timeout in seconds for the MCP handshake. Defaults to 10.

10.0

Returns:

Type Description
list[str]

A list of tool name strings exposed by the MCP server.

Raises:

Type Description
ValueError

If no provider with the given name was configured.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
def list_mcp_tool_names(self, mcp_provider_name: str, *, timeout_sec: float = 10.0) -> list[str]:
    """Connect to a configured MCP provider and return the names of its available tools.

    Args:
        mcp_provider_name: The ``name`` field of an MCP provider passed to the constructor.
        timeout_sec: Timeout in seconds for the MCP handshake. Defaults to 10.

    Returns:
        A list of tool name strings exposed by the MCP server.

    Raises:
        ValueError: If no provider with the given name was configured.
    """
    for provider in self._mcp_providers:
        if provider.name == mcp_provider_name:
            return list_tool_names(provider, timeout_sec=timeout_sec)
    configured = [p.name for p in self._mcp_providers]
    raise ValueError(f"No MCP provider named {mcp_provider_name!r}. Configured providers: {configured}")

preview(config_builder, *, num_records=DEFAULT_NUM_RECORDS)

Generate preview dataset for fast iteration on your Data Designer configuration.

All preview results are stored in memory. Once you are satisfied with the preview, use the create method to generate data at a larger scale and save results to disk.

Parameters:

Name Type Description Default
config_builder DataDesignerConfigBuilder

The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).

required
num_records int

Number of records to generate.

DEFAULT_NUM_RECORDS

Returns:

Type Description
PreviewResults

PreviewResults object with methods for inspecting the results.

Raises:

Type Description
DataDesignerGenerationError

If an error occurs during preview dataset generation.

DataDesignerEarlyShutdownError

If preview terminated via the early-shutdown gate with zero records produced. Subclass of DataDesignerGenerationError.

DataDesignerProfilingError

If an error occurs during preview dataset profiling.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
def preview(
    self, config_builder: DataDesignerConfigBuilder, *, num_records: int = DEFAULT_NUM_RECORDS
) -> PreviewResults:
    """Generate preview dataset for fast iteration on your Data Designer configuration.

    All preview results are stored in memory. Once you are satisfied with the preview,
    use the `create` method to generate data at a larger scale and save results to disk.

    Args:
        config_builder: The DataDesignerConfigBuilder containing the dataset
            configuration (columns, constraints, seed data, etc.).
        num_records: Number of records to generate.

    Returns:
        PreviewResults object with methods for inspecting the results.

    Raises:
        DataDesignerGenerationError: If an error occurs during preview dataset generation.
        DataDesignerEarlyShutdownError: If preview terminated via the early-shutdown gate
            with zero records produced. Subclass of ``DataDesignerGenerationError``.
        DataDesignerProfilingError: If an error occurs during preview dataset profiling.
    """
    logger.info(f"{RandomEmoji.previewing()} Preview generation in progress")
    self._log_jinja_rendering_engine_mode()

    resource_provider = self._create_resource_provider("preview-dataset", config_builder)
    try:
        builder = self._create_dataset_builder(config_builder.build(), resource_provider)
        raw_dataset = builder.build_preview(num_records=num_records)
        processed_dataset = builder.process_preview(raw_dataset)
    except DeprecationWarning:
        raise
    except Exception as e:
        raise DataDesignerGenerationError(f"🛑 Error generating preview dataset: {e}") from e

    if len(processed_dataset) == 0:
        # Mirror the create() path: distinguish "early shutdown produced zero
        # records" from generic empty-dataset failures so callers can react
        # programmatically.
        if builder.early_shutdown and builder.actual_num_records == 0:
            raise DataDesignerEarlyShutdownError(
                "🛑 Preview is empty — early shutdown was triggered before any records "
                "could complete. Check the warnings above for the contributing failures."
            )
        root_cause = builder.first_non_retryable_error
        if root_cause is not None and builder.actual_num_records == 0:
            raise DataDesignerGenerationError(f"🛑 {type(root_cause).__name__}: {root_cause}") from root_cause
        raise DataDesignerGenerationError(
            "🛑 Dataset is empty — all records were dropped due to generation or processing failures. "
            "Check the warnings above for details on which columns failed."
        )

    dropped_columns = raw_dataset.columns.difference(processed_dataset.columns)
    if len(dropped_columns) > 0:
        dataset_for_profiler = lazy.pd.concat([processed_dataset, raw_dataset[dropped_columns]], axis=1)
    else:
        dataset_for_profiler = processed_dataset

    try:
        profiler = self._create_dataset_profiler(config_builder, resource_provider)
        analysis = profiler.profile_dataset(num_records, dataset_for_profiler)
    except Exception as e:
        raise DataDesignerProfilingError(f"🛑 Error profiling preview dataset: {e}") from e

    processor_artifacts: dict[str, list[dict]] = {}
    for name in builder.artifact_storage.list_processor_names():
        processor_artifacts[name] = builder.artifact_storage.load_processor_dataset(name).to_dict(orient="records")

    if isinstance(analysis, DatasetProfilerResults) and len(analysis.column_statistics) > 0:
        logger.info(f"{RandomEmoji.success()} Preview complete!")

    # Create dataset metadata from the resource provider
    dataset_metadata = resource_provider.get_dataset_metadata()

    return PreviewResults(
        dataset=processed_dataset,
        analysis=analysis,
        processor_artifacts=processor_artifacts,
        config_builder=config_builder,
        dataset_metadata=dataset_metadata,
        task_traces=builder.task_traces or None,
    )

set_run_config(run_config)

Set the runtime configuration for dataset generation.

Parameters:

Name Type Description Default
run_config RunConfig

A RunConfig instance containing runtime settings such as early shutdown behavior, batch sizing via buffer_size, and non-inference worker concurrency via non_inference_max_parallel_workers.

required
Notes

When disable_early_shutdown=True, DataDesigner will never terminate generation early due to error-rate thresholds. Errors are still tracked for reporting.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
479
480
481
482
483
484
485
486
487
488
489
490
491
def set_run_config(self, run_config: RunConfig) -> None:
    """Set the runtime configuration for dataset generation.

    Args:
        run_config: A RunConfig instance containing runtime settings such as
            early shutdown behavior, batch sizing via `buffer_size`, and non-inference worker
            concurrency via `non_inference_max_parallel_workers`.

    Notes:
        When `disable_early_shutdown=True`, DataDesigner will never terminate generation early
        due to error-rate thresholds. Errors are still tracked for reporting.
    """
    self._run_config = run_config

validate(config_builder)

Validate the Data Designer configuration as defined by the DataDesignerConfigBuilder with the configured engine components (SecretResolver, SeedReaders, etc.).

Parameters:

Name Type Description Default
config_builder DataDesignerConfigBuilder

The DataDesignerConfigBuilder containing the dataset configuration (columns, constraints, seed data, etc.).

required

Returns:

Type Description
None

None if the configuration is valid.

Raises:

Type Description
InvalidConfigError

If the configuration is invalid.

Source code in packages/data-designer/src/data_designer/interface/data_designer.py
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
def validate(self, config_builder: DataDesignerConfigBuilder) -> None:
    """Validate the Data Designer configuration as defined by the DataDesignerConfigBuilder
    with the configured engine components (SecretResolver, SeedReaders, etc.).

    Args:
        config_builder: The DataDesignerConfigBuilder containing the dataset
            configuration (columns, constraints, seed data, etc.).

    Returns:
        None if the configuration is valid.

    Raises:
        InvalidConfigError: If the configuration is invalid.
    """
    resource_provider = self._create_resource_provider("validate-configuration", config_builder)
    compile_data_designer_config(config_builder.build(), resource_provider)