Stratified Mix
What it does
- Analyzes the current dataset to see which strata (length, domain, difficulty) are underrepresented.
- Generates a guidance prompt (
stratifiedNextAsk) that can be injected into Teacher LLM requests to bias generation toward those under-filled bins. - Stores target percentages per axis and computes gaps dynamically as the dataset evolves.
Using it
1) Open Stratified Mix (sidebar).
2) Set target percentages for:
- Length: short / medium / long.
- Domain: e.g., security / productivity / general (or whatever metadata domains you use).
- Difficulty: easy / medium / hard (or your custom difficulty labels).
3) Click Normalize on each axis to ensure targets sum to 100%.
4) The tool compares current dataset distribution vs targets and computes gaps.
5) The generated hint (stratifiedNextAsk) summarizes the next ask (domain/difficulty/length) and is written into config.
6) Enable auto inject to automatically include stratifiedNextAsk in Teacher LLM calls (config key: autoInjectStrataPrompt).
Prompt injection behavior
- The app maintains
stratifiedNextAskin config; when auto-inject is on, Teacher prompts include a section like:[Stratified guidance] Next ask (...): Target strata ... generate samples that satisfy all three; reject anything outside these bins... - If you prefer manual control, copy the hint text and paste it into your generation prompt instead of auto-inject.
Tips
- Ensure your records include
metadata.domainandmetadata.difficulty(or analogous keys) so stratification is meaningful. - Adjust targets after large imports to rebalance future generation runs.
- Recompute/refresh the page after major dataset changes to update gaps.