Skip to content

Dataset Creation Results

DatasetCreationResults is returned by DataDesigner.create(). It provides access to persisted creation artifacts, including the generated dataset, profiling analysis, processor outputs, task traces, dataset metadata, and Hugging Face Hub upload support.

Preview generation uses the in-memory data_designer.config.preview_results.PreviewResults object returned by DataDesigner.preview(). Persisted dataset creation uses DatasetCreationResults.

DatasetCreationResults

Bases: WithRecordSamplerMixin

Results container for a Data Designer dataset creation run.

This class provides access to the generated dataset, profiling analysis, and visualization utilities. It is returned by the DataDesigner.create() method and implements ResultsProtocol of the DataDesigner interface.

Creates a new instance with results based on a dataset creation run.

Parameters:

Name Type Description Default
artifact_storage ArtifactStorage

Storage manager for accessing generated artifacts.

required
analysis DatasetProfilerResults

Profiling results for the generated dataset.

required
config_builder DataDesignerConfigBuilder

Configuration builder used to create the dataset.

required
dataset_metadata DatasetMetadata

Metadata about the generated dataset (e.g., seed column names).

required
task_traces list[TaskTrace] | None

Optional list of TaskTrace objects from the async scheduler.

None

Methods:

Name Description
get_path_to_processor_artifacts

Get the path to the artifacts generated by a processor.

load_analysis

Load the profiling analysis results for the generated dataset.

load_dataset

Load the generated dataset as a pandas DataFrame.

load_processor_dataset

Load the dataset generated by a processor.

push_to_hub

Push dataset to HuggingFace Hub.

Source code in packages/data-designer/src/data_designer/interface/results.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def __init__(
    self,
    *,
    artifact_storage: ArtifactStorage,
    analysis: DatasetProfilerResults,
    config_builder: DataDesignerConfigBuilder,
    dataset_metadata: DatasetMetadata,
    task_traces: list[TaskTrace] | None = None,
):
    """Creates a new instance with results based on a dataset creation run.

    Args:
        artifact_storage: Storage manager for accessing generated artifacts.
        analysis: Profiling results for the generated dataset.
        config_builder: Configuration builder used to create the dataset.
        dataset_metadata: Metadata about the generated dataset (e.g., seed column names).
        task_traces: Optional list of TaskTrace objects from the async scheduler.
    """
    self.artifact_storage = artifact_storage
    self._analysis = analysis
    self._config_builder = config_builder
    self.dataset_metadata = dataset_metadata
    self.task_traces: list[TaskTrace] = task_traces or []

get_path_to_processor_artifacts(processor_name)

Get the path to the artifacts generated by a processor.

Parameters:

Name Type Description Default
processor_name str

The name of the processor to load the artifact from.

required

Returns:

Type Description
Path

The path to the artifacts.

Source code in packages/data-designer/src/data_designer/interface/results.py
85
86
87
88
89
90
91
92
93
94
95
96
def get_path_to_processor_artifacts(self, processor_name: str) -> Path:
    """Get the path to the artifacts generated by a processor.

    Args:
        processor_name: The name of the processor to load the artifact from.

    Returns:
        The path to the artifacts.
    """
    if not self.artifact_storage.processors_outputs_path.exists():
        raise ArtifactStorageError(f"Processor {processor_name} has no artifacts.")
    return self.artifact_storage.processors_outputs_path / processor_name

load_analysis()

Load the profiling analysis results for the generated dataset.

Returns:

Type Description
DatasetProfilerResults

DatasetProfilerResults containing statistical analysis and quality metrics for configured columns in the generated dataset.

Source code in packages/data-designer/src/data_designer/interface/results.py
55
56
57
58
59
60
61
62
def load_analysis(self) -> DatasetProfilerResults:
    """Load the profiling analysis results for the generated dataset.

    Returns:
        DatasetProfilerResults containing statistical analysis and quality metrics
            for configured columns in the generated dataset.
    """
    return self._analysis

load_dataset()

Load the generated dataset as a pandas DataFrame.

Returns:

Type Description
DataFrame

A pandas DataFrame containing the full generated dataset.

Source code in packages/data-designer/src/data_designer/interface/results.py
64
65
66
67
68
69
70
def load_dataset(self) -> pd.DataFrame:
    """Load the generated dataset as a pandas DataFrame.

    Returns:
        A pandas DataFrame containing the full generated dataset.
    """
    return self.artifact_storage.load_dataset()

load_processor_dataset(processor_name)

Load the dataset generated by a processor.

This only works for processors that write their artifacts in Parquet format.

Parameters:

Name Type Description Default
processor_name str

The name of the processor to load the dataset from.

required

Returns:

Type Description
DataFrame

A pandas DataFrame containing the dataset generated by the processor.

Source code in packages/data-designer/src/data_designer/interface/results.py
72
73
74
75
76
77
78
79
80
81
82
83
def load_processor_dataset(self, processor_name: str) -> pd.DataFrame:
    """Load the dataset generated by a processor.

    This only works for processors that write their artifacts in Parquet format.

    Args:
        processor_name: The name of the processor to load the dataset from.

    Returns:
        A pandas DataFrame containing the dataset generated by the processor.
    """
    return self.artifact_storage.load_processor_dataset(processor_name)

push_to_hub(repo_id, description, *, token=None, private=False, tags=None)

Push dataset to HuggingFace Hub.

Uploads all artifacts including: - Main parquet batch files (data subset) - Processor output batch files ({processor_name} subsets) - Configuration (builder_config.json) - Metadata (metadata.json) - Auto-generated dataset card (README.md)

Parameters:

Name Type Description Default
repo_id str

HuggingFace repo ID (e.g., "username/my-dataset")

required
description str

Custom description text for the dataset card. Appears after the title.

required
token str | None

HuggingFace API token. If None, the token is automatically resolved from HF_TOKEN environment variable or cached credentials from hf auth login.

None
private bool

Create private repo

False
tags list[str] | None

Additional custom tags for the dataset.

None

Returns:

Type Description
str

URL to the uploaded dataset

Example

results = data_designer.create(config, num_records=1000) description = "This dataset contains synthetic conversations for training chatbots." results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"]) 'https://huggingface.co/datasets/username/my-synthetic-dataset'

Source code in packages/data-designer/src/data_designer/interface/results.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def push_to_hub(
    self,
    repo_id: str,
    description: str,
    *,
    token: str | None = None,
    private: bool = False,
    tags: list[str] | None = None,
) -> str:
    """Push dataset to HuggingFace Hub.

    Uploads all artifacts including:
    - Main parquet batch files (data subset)
    - Processor output batch files ({processor_name} subsets)
    - Configuration (builder_config.json)
    - Metadata (metadata.json)
    - Auto-generated dataset card (README.md)

    Args:
        repo_id: HuggingFace repo ID (e.g., "username/my-dataset")
        description: Custom description text for the dataset card.
            Appears after the title.
        token: HuggingFace API token. If None, the token is automatically
            resolved from HF_TOKEN environment variable or cached credentials
            from `hf auth login`.
        private: Create private repo
        tags: Additional custom tags for the dataset.

    Returns:
        URL to the uploaded dataset

    Example:
        >>> results = data_designer.create(config, num_records=1000)
        >>> description = "This dataset contains synthetic conversations for training chatbots."
        >>> results.push_to_hub("username/my-synthetic-dataset", description, tags=["chatbot", "conversation"])
        'https://huggingface.co/datasets/username/my-synthetic-dataset'
    """
    client = HuggingFaceHubClient(token=token)
    return client.upload_dataset(
        repo_id=repo_id,
        base_dataset_path=self.artifact_storage.base_dataset_path,
        private=private,
        description=description,
        tags=tags,
    )