Stratified Mix

What it does

  • Analyzes the current dataset to see which strata (length, domain, difficulty) are underrepresented.
  • Generates a guidance prompt (stratifiedNextAsk) that can be injected into Teacher LLM requests to bias generation toward those under-filled bins.
  • Stores target percentages per axis and computes gaps dynamically as the dataset evolves.

Using it

1) Open Stratified Mix (sidebar). 2) Set target percentages for: - Length: short / medium / long. - Domain: e.g., security / productivity / general (or whatever metadata domains you use). - Difficulty: easy / medium / hard (or your custom difficulty labels). 3) Click Normalize on each axis to ensure targets sum to 100%. 4) The tool compares current dataset distribution vs targets and computes gaps. 5) The generated hint (stratifiedNextAsk) summarizes the next ask (domain/difficulty/length) and is written into config. 6) Enable auto inject to automatically include stratifiedNextAsk in Teacher LLM calls (config key: autoInjectStrataPrompt).

Prompt injection behavior

  • The app maintains stratifiedNextAsk in config; when auto-inject is on, Teacher prompts include a section like: [Stratified guidance] Next ask (...): Target strata ... generate samples that satisfy all three; reject anything outside these bins...
  • If you prefer manual control, copy the hint text and paste it into your generation prompt instead of auto-inject.

Tips

  • Ensure your records include metadata.domain and metadata.difficulty (or analogous keys) so stratification is meaningful.
  • Adjust targets after large imports to rebalance future generation runs.
  • Recompute/refresh the page after major dataset changes to update gaps.