Skip to content

Seeds

Seed configs declare existing data used as input during generation. A SeedConfig combines a seed source with optional row sampling and selection settings. Seed source objects declare where seed data comes from; the engine reads them through seed readers.

Use these objects with DataDesignerConfigBuilder.with_seed_dataset(). Related pages: Seed Datasets and seed readers.

Built-in seed sources include local files, Hugging Face paths, in-memory DataFrames, directories, file contents, and agent rollout traces. Plugin seed sources can extend the same discriminated union through the plugin system.

Seed Config

Classes:

Name Description
SeedConfig

Configuration for sampling data from a seed dataset.

SeedConfig

Bases: ConfigBase

Configuration for sampling data from a seed dataset.

Attributes:

Name Type Description
source SeedSourceT

A SeedSource defining where the seed data exists

sampling_strategy SamplingStrategy

Strategy for how to sample rows from the dataset. - ORDERED: Read rows sequentially in their original order. - SHUFFLE: Randomly shuffle rows before sampling. When used with selection_strategy, shuffling occurs within the selected range/partition.

selection_strategy IndexRange | PartitionBlock | None

Optional strategy to select a subset of the dataset. - IndexRange: Select a specific range of indices (e.g., rows 100-200). - PartitionBlock: Select a partition by splitting the dataset into N equal parts. Partition indices are zero-based (index=0 is the first partition, index=1 is the second, etc.).

Examples:

Read rows sequentially from start to end: SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.ORDERED )

Read rows in random order: SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.SHUFFLE )

Read specific index range (rows 100-199): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.ORDERED, selection_strategy=IndexRange(start=100, end=199) )

Read random rows from a specific index range (shuffles within rows 100-199): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.SHUFFLE, selection_strategy=IndexRange(start=100, end=199) )

Read from partition 2 (3rd partition, zero-based) of 5 partitions (20% of dataset): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.ORDERED, selection_strategy=PartitionBlock(index=2, num_partitions=5) )

Read shuffled rows from partition 0 of 10 partitions (shuffles within the partition): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.SHUFFLE, selection_strategy=PartitionBlock(index=0, num_partitions=10) )

Built-In Seed Sources

Classes:

Name Description
FileSystemSeedSource

Base class for seed sources backed by a directory of files.

SeedSource

Base class for seed dataset configurations.

FileSystemSeedSource

Bases: SeedSource, ABC

Base class for seed sources backed by a directory of files.

Use this base when a seed reader needs to enumerate files under a directory on disk and turn each (or groups of them) into seed rows. Concrete plugin configs declare a Literal seed_type and pair with a FileSystemSeedReader implementation.

Attributes:

Name Type Description
path str

Directory containing seed artifacts. Relative paths are resolved from the current working directory when the config is loaded, not from the config file location.

file_pattern str

Case-sensitive filename pattern used to match files under the provided directory. Patterns match basenames only, not relative paths. Defaults to '*'.

recursive bool

Whether to search nested subdirectories under the provided directory for matching files. Defaults to True.

SeedSource

Bases: BaseModel, ABC

Base class for seed dataset configurations.

All subclasses must define a seed_type field with a Literal value. This serves as a discriminated union discriminator.

Attributes:

Name Type Description
seed_type str

Discriminator field that identifies the specific seed source type. Subclasses must override this field with a Literal value.

DataFrame Seed Source