Seeds
Seed configs declare existing data used as input during generation. A SeedConfig combines a seed source with optional row sampling and selection settings. Seed source objects declare where seed data comes from; the engine reads them through seed readers.
Use these objects with DataDesignerConfigBuilder.with_seed_dataset(). Related pages: Seed Datasets and seed readers.
Built-in seed sources include local files, Hugging Face paths, in-memory DataFrames, directories, file contents, and agent rollout traces. Plugin seed sources can extend the same discriminated union through the plugin system.
Seed Config
Classes:
| Name | Description |
|---|---|
SeedConfig |
Configuration for sampling data from a seed dataset. |
SeedConfig
Bases: ConfigBase
Configuration for sampling data from a seed dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
source |
SeedSourceT
|
A SeedSource defining where the seed data exists |
sampling_strategy |
SamplingStrategy
|
Strategy for how to sample rows from the dataset. - ORDERED: Read rows sequentially in their original order. - SHUFFLE: Randomly shuffle rows before sampling. When used with selection_strategy, shuffling occurs within the selected range/partition. |
selection_strategy |
IndexRange | PartitionBlock | None
|
Optional strategy to select a subset of the dataset. - IndexRange: Select a specific range of indices (e.g., rows 100-200). - PartitionBlock: Select a partition by splitting the dataset into N equal parts. Partition indices are zero-based (index=0 is the first partition, index=1 is the second, etc.). |
Examples:
Read rows sequentially from start to end: SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.ORDERED )
Read rows in random order: SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.SHUFFLE )
Read specific index range (rows 100-199): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.ORDERED, selection_strategy=IndexRange(start=100, end=199) )
Read random rows from a specific index range (shuffles within rows 100-199): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.SHUFFLE, selection_strategy=IndexRange(start=100, end=199) )
Read from partition 2 (3rd partition, zero-based) of 5 partitions (20% of dataset): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.ORDERED, selection_strategy=PartitionBlock(index=2, num_partitions=5) )
Read shuffled rows from partition 0 of 10 partitions (shuffles within the partition): SeedConfig( source=LocalFileSeedSource(path="my_data.parquet"), sampling_strategy=SamplingStrategy.SHUFFLE, selection_strategy=PartitionBlock(index=0, num_partitions=10) )
Built-In Seed Sources
Classes:
| Name | Description |
|---|---|
FileSystemSeedSource |
Base class for seed sources backed by a directory of files. |
SeedSource |
Base class for seed dataset configurations. |
FileSystemSeedSource
Bases: SeedSource, ABC
Base class for seed sources backed by a directory of files.
Use this base when a seed reader needs to enumerate files under a directory
on disk and turn each (or groups of them) into seed rows. Concrete plugin
configs declare a Literal seed_type and pair with a
FileSystemSeedReader implementation.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
str
|
Directory containing seed artifacts. Relative paths are resolved from the current working directory when the config is loaded, not from the config file location. |
file_pattern |
str
|
Case-sensitive filename pattern used to match files under
the provided directory. Patterns match basenames only, not relative
paths. Defaults to |
recursive |
bool
|
Whether to search nested subdirectories under the provided
directory for matching files. Defaults to |
SeedSource
Bases: BaseModel, ABC
Base class for seed dataset configurations.
All subclasses must define a seed_type field with a Literal value.
This serves as a discriminated union discriminator.
Attributes:
| Name | Type | Description |
|---|---|---|
seed_type |
str
|
Discriminator field that identifies the specific seed source type.
Subclasses must override this field with a |