Skip to content

Column Generators

Column generators execute column generation in the Data Designer engine. A generator receives the upstream data needed for its task, returns row or batch data with generated values added, and reports the generation strategy the scheduler should use.

Related pages: column_configs, Build Your Own, Using Models in Plugins, and Custom Columns.

Configuration

User-facing column configs inherit from SingleColumnConfig and define a unique column_type discriminator. During compilation, the engine may group related configs into multi-column configs for generators that create sampler or seed columns together.

Generation strategy

Column generator base classes return GenerationStrategy values to tell the engine whether they run per row or over a full batch.

Implementation bases

Generators that operate on a full batch can inherit from ColumnGeneratorFullColumn. Row-oriented non-model generators can inherit from ColumnGeneratorCellByCell. Generators that create initial rows use FromScratchColumnGenerator. Model-backed plugin generators should use ColumnGeneratorWithModelRegistry or ColumnGeneratorWithModel; see Using Models in Plugins for authoring guidance.

ColumnGenerator

Bases: ConfigurableTask[TaskConfigT], ABC

Methods:

Name Description
agenerate

Async generate — delegates to sync generate() via thread pool.

generate

Sync generate — overridden by most concrete generators.

log_pre_generation

A shared method to log info before the generator's generate method is called.

Attributes:

Name Type Description
is_llm_bound bool

Whether this generator makes model/API calls during generation.

is_order_dependent bool

Whether this generator's output depends on prior row-group calls.

Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24
25
26
27
28
def __init__(self, config: TaskConfigT, resource_provider: ResourceProvider):
    self._config = self.get_config_type().model_validate(config)
    self._resource_provider = resource_provider
    self._validate()
    self._initialize()

is_llm_bound property

Whether this generator makes model/API calls during generation.

is_order_dependent property

Whether this generator's output depends on prior row-group calls.

Example: SeedDatasetColumnGenerator tracks its position in the seed dataset, so row group N must complete before N+1 starts.

agenerate(data) async

agenerate(data: dict) -> dict
agenerate(data: pd.DataFrame) -> pd.DataFrame

Async generate — delegates to sync generate() via thread pool.

Subclasses with native async support (e.g. ColumnGeneratorWithModelChatCompletion) should override this with a direct async implementation.

Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
113
114
115
116
117
118
119
120
121
async def agenerate(self, data: DataT) -> DataT:
    """Async generate — delegates to sync ``generate()`` via thread pool.

    Subclasses with native async support (e.g. ColumnGeneratorWithModelChatCompletion)
    should override this with a direct async implementation.
    """
    if not self._is_overridden("generate"):
        raise NotImplementedError(f"{type(self).__name__} must implement either generate() or agenerate()")
    return await asyncio.to_thread(self.generate, data.copy())

generate(data)

generate(data: dict) -> dict
generate(data: pd.DataFrame) -> pd.DataFrame

Sync generate — overridden by most concrete generators.

Default bridges to agenerate() for async-first subclasses that only implement agenerate(). Raises NotImplementedError if neither generate() nor agenerate() is overridden.

Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
 96
 97
 98
 99
100
101
102
103
104
105
def generate(self, data: DataT) -> DataT:
    """Sync generate — overridden by most concrete generators.

    Default bridges to ``agenerate()`` for async-first subclasses that only
    implement ``agenerate()``. Raises ``NotImplementedError`` if neither
    ``generate()`` nor ``agenerate()`` is overridden.
    """
    if not self._is_overridden("agenerate"):
        raise NotImplementedError(f"{type(self).__name__} must implement either generate() or agenerate()")
    return _run_coroutine_sync(self.agenerate(data))

log_pre_generation()

A shared method to log info before the generator's generate method is called.

The idea is for dataset builders to call this method for all generators before calling their generate method. This is to avoid logging the same information multiple times when running generators in parallel.

Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
123
124
125
126
127
128
129
def log_pre_generation(self) -> None:
    """A shared method to log info before the generator's `generate` method is called.

    The idea is for dataset builders to call this method for all generators before calling their
    `generate` method. This is to avoid logging the same information multiple times when running
    generators in parallel.
    """

ColumnGeneratorFullColumn

Bases: ColumnGenerator[TaskConfigT], ABC

Base class for column generators that transform a full batch at once.

Override generate to return the complete batch DataFrame after adding generated values. Use this base when generation is vectorizable or when an external API accepts batched input more efficiently than per-row calls.

Methods:

Name Description
generate

Generate an entire batch of row outputs.

Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24
25
26
27
28
def __init__(self, config: TaskConfigT, resource_provider: ResourceProvider):
    self._config = self.get_config_type().model_validate(config)
    self._resource_provider = resource_provider
    self._validate()
    self._initialize()

generate(data) abstractmethod

Generate an entire batch of row outputs.

Parameters:

Name Type Description Default
data DataFrame

DataFrame containing the upstream columns this generator depends on.

required

Returns:

Type Description
DataFrame

DataFrame containing the input columns plus the new column and any side-effect

DataFrame

columns. When config.allow_resize is False, the row count must match

DataFrame

the input; when it is True, the row count may change.

Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
257
258
259
260
261
262
263
264
265
266
267
268
@abstractmethod
def generate(self, data: pd.DataFrame) -> pd.DataFrame:
    """Generate an entire batch of row outputs.

    Args:
        data: DataFrame containing the upstream columns this generator depends on.

    Returns:
        DataFrame containing the input columns plus the new column and any side-effect
        columns. When ``config.allow_resize`` is ``False``, the row count must match
        the input; when it is ``True``, the row count may change.
    """

ColumnGeneratorCellByCell

Bases: ColumnGenerator[TaskConfigT], ABC

Base class for column generators invoked once per row.

Override generate to return the complete row mapping after adding the generated value. The engine calls the generator once per row and may run calls concurrently. Use this base when generation is independent per row (e.g. an LLM call per row, a per-row transform).

Methods:

Name Description
generate

Generate one row's output from a single row's upstream values.

Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24
25
26
27
28
def __init__(self, config: TaskConfigT, resource_provider: ResourceProvider):
    self._config = self.get_config_type().model_validate(config)
    self._resource_provider = resource_provider
    self._validate()
    self._initialize()

generate(data) abstractmethod

Generate one row's output from a single row's upstream values.

Parameters:

Name Type Description Default
data dict

Current row mapping containing the upstream values available to this column.

required

Returns:

Type Description
dict

Complete row mapping with existing keys preserved and the new column value added.

dict

Include declared side-effect columns when the config creates them.

Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
232
233
234
235
236
237
238
239
240
241
242
@abstractmethod
def generate(self, data: dict) -> dict:
    """Generate one row's output from a single row's upstream values.

    Args:
        data: Current row mapping containing the upstream values available to this column.

    Returns:
        Complete row mapping with existing keys preserved and the new column value added.
        Include declared side-effect columns when the config creates them.
    """

FromScratchColumnGenerator

Bases: ColumnGenerator[TaskConfigT], ABC

Methods:

Name Description
agenerate_from_scratch

Async wrapper — wraps sync generate_from_scratch() in a thread.

Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24
25
26
27
28
def __init__(self, config: TaskConfigT, resource_provider: ResourceProvider):
    self._config = self.get_config_type().model_validate(config)
    self._resource_provider = resource_provider
    self._validate()
    self._initialize()

agenerate_from_scratch(num_records) async

Async wrapper — wraps sync generate_from_scratch() in a thread.

Source code in packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py
140
141
142
async def agenerate_from_scratch(self, num_records: int) -> pd.DataFrame:
    """Async wrapper — wraps sync ``generate_from_scratch()`` in a thread."""
    return await asyncio.to_thread(self.generate_from_scratch, num_records)

ColumnGeneratorWithModelRegistry

Bases: ColumnGenerator[TaskConfigT], ABC

Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24
25
26
27
28
def __init__(self, config: TaskConfigT, resource_provider: ResourceProvider):
    self._config = self.get_config_type().model_validate(config)
    self._resource_provider = resource_provider
    self._validate()
    self._initialize()

ColumnGeneratorWithModel

Bases: ColumnGeneratorWithModelRegistry[TaskConfigT], ABC

Source code in packages/data-designer-engine/src/data_designer/engine/configurable_task.py
24
25
26
27
28
def __init__(self, config: TaskConfigT, resource_provider: ResourceProvider):
    self._config = self.get_config_type().model_validate(config)
    self._resource_provider = resource_provider
    self._validate()
    self._initialize()