Skip to content

Seed Readers

Seed readers are engine-side adapters that turn a configured seed source into tabular seed rows. The engine attaches a SeedSource and secret resolver, asks the reader for column names and dataset size, then streams batches into generation.

Related pages: seeds, Seed Datasets, and Build Your Own.

Core Contracts

SeedReader

Bases: ABC, Generic[SourceT]

Base class for reading a seed dataset.

Seeds are read using duckdb. Reader implementations define duckdb connection setup details and how to get a URI that can be queried with duckdb (i.e. "... FROM ...").

The Data Designer engine automatically supplies the appropriate SeedSource and a SecretResolver to use for any secret fields in the config via attach(...). Subclasses that need per-attachment setup can override on_attach(...) without needing to call super().

Methods:

Name Description
attach

Attach a source and secret resolver to the instance.

create_filesystem_context

Create a rooted filesystem context for directory-backed seed readers.

get_column_names

Returns the seed dataset's column names

get_seed_type

Return the seed_type of the source class this reader is generic over.

on_attach

Hook for subclasses that need per-attachment setup.

attach(source, secret_resolver)

Attach a source and secret resolver to the instance.

This is called internally by the engine so that these objects do not need to be provided in the reader's constructor.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
185
186
187
188
189
190
191
192
193
194
def attach(self, source: SourceT, secret_resolver: SecretResolver) -> None:
    """Attach a source and secret resolver to the instance.

    This is called internally by the engine so that these objects do not
    need to be provided in the reader's constructor.
    """
    self._reset_attachment_state()
    self.source = source
    self.secret_resolver = secret_resolver
    self.on_attach()

create_filesystem_context(root_path)

Create a rooted filesystem context for directory-backed seed readers.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
234
235
236
237
238
def create_filesystem_context(self, root_path: Path | str) -> SeedReaderFileSystemContext:
    """Create a rooted filesystem context for directory-backed seed readers."""
    resolved_root_path = Path(root_path).expanduser().resolve()
    rooted_fs = DirFileSystem(path=str(resolved_root_path), fs=LocalFileSystem())
    return SeedReaderFileSystemContext(fs=rooted_fs, root_path=resolved_root_path)

get_column_names()

Returns the seed dataset's column names

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
266
267
268
269
270
271
272
def get_column_names(self) -> list[str]:
    """Returns the seed dataset's column names"""
    self._ensure_attached()
    conn = self._get_duckdb_connection()
    describe_query = f"DESCRIBE SELECT * FROM '{self.get_dataset_uri()}'"
    column_descriptions = conn.execute(describe_query).fetchall()
    return [col[0] for col in column_descriptions]

get_seed_type()

Return the seed_type of the source class this reader is generic over.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def get_seed_type(self) -> str:
    """Return the seed_type of the source class this reader is generic over."""
    # Get the generic type arguments from the reader class
    # Check __orig_bases__ for the generic base class
    for base in getattr(type(self), "__orig_bases__", []):
        origin = get_origin(base)
        if isinstance(origin, type) and issubclass(origin, SeedReader):
            args = get_args(base)
            if args:
                source_cls = get_origin(args[0]) or args[0]
                # Extract seed_type from the source class
                if hasattr(source_cls, "model_fields") and "seed_type" in source_cls.model_fields:
                    field = source_cls.model_fields["seed_type"]
                    default_value = field.default
                    if isinstance(default_value, str):
                        return default_value

    raise SeedReaderError("Reader does not have a valid generic source type with seed_type")

on_attach()

Hook for subclasses that need per-attachment setup.

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
196
197
def on_attach(self) -> None:
    """Hook for subclasses that need per-attachment setup."""

FileSystemSeedReader

Bases: SeedReader[FileSystemSourceT], ABC

Base class for filesystem-derived seed readers.

Plugin authors implement build_manifest(...) to describe the cheap logical rows available under the configured filesystem root. Readers that need expensive enrichment can optionally override hydrate_row(...) to emit one record dict or an iterable of record dicts per manifest row. When emitted records change the manifest schema, output_columns must declare the exact hydrated output schema for each emitted record. The framework owns attachment-scoped filesystem context reuse, manifest sampling, partitioning, randomization, batching, and DuckDB registration details.

SeedReaderFileSystemContext

SeedReaderBatch

Bases: Protocol

SeedReaderBatchReader

Bases: Protocol

PandasSeedReaderBatch

create_seed_reader_output_dataframe

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def create_seed_reader_output_dataframe(
    *,
    records: list[dict[str, Any]],
    output_columns: list[str],
) -> pd.DataFrame:
    if not records:
        return lazy.pd.DataFrame(records, columns=output_columns)

    expected_columns = set(output_columns)
    for row_index, record in enumerate(records):
        record_columns = set(record)
        extra_columns = sorted(record_columns - expected_columns)
        missing_columns = [column for column in output_columns if column not in record]
        if not extra_columns and not missing_columns:
            continue

        message_parts: list[str] = [
            f"Hydrated record at index {row_index} does not match output_columns {output_columns!r}."
        ]
        if missing_columns:
            message_parts.append(f"Missing columns: {missing_columns!r}.")
        if extra_columns:
            message_parts.append(f"Undeclared columns: {extra_columns!r}.")
        message_parts.append("Ensure each record emitted by hydrate_row() matches the declared output schema.")
        raise SeedReaderError(" ".join(message_parts))

    return lazy.pd.DataFrame(records, columns=output_columns)

Built-In Readers

LocalFileSeedReader

Bases: SeedReader[LocalFileSeedSource]

HuggingFaceSeedReader

Bases: SeedReader[HuggingFaceSeedSource]

DataFrameSeedReader

Bases: SeedReader[DataFrameSeedSource]

DirectorySeedReader

Bases: FileSystemSeedReader[DirectorySeedSource]

FileContentsSeedReader

Bases: FileSystemSeedReader[FileContentsSeedSource]

AgentRolloutSeedReader

Bases: FileSystemSeedReader[AgentRolloutSeedSource]

Registry and Errors

SeedReaderRegistry

Source code in packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
651
652
653
654
def __init__(self, readers: Sequence[SeedReader]):
    self._readers: dict[str, SeedReader] = {}
    for reader in readers:
        self.add_reader(reader)

SeedReaderError

Bases: DataDesignerError