App Pages & Features

Dashboard

High-level metrics, quick links into Generate/Judge/Uploads/Vector DB.
Recent activity and status badges for running pipelines.

Generate

Configure provider/model, sampling params (temperature/top-p/top-k/repetition penalty/max tokens), and concurrency.
Honors dataset profiles for schema/labels; outputs flow to Dataset and can be judged/embedded.

Profiles

Select built-in profiles or custom ones; shows schema fields and evaluation axes.
Template Builder UI for adding/editing profiles (see Data Profiles doc for JSON structure).

Reverse Engineer

Upload single/batch binaries; get static/dynamic analysis and JSONL export.
Progress per file; optional sandbox report upload; see safety notes in Reverse Engineering doc.

Transcribe

YouTube URL or local audio → in-browser Whisper.js for transcription/translation/summarization.
Supports progress, streaming decode, and optional embedding into vector DB.

Deduplication

Exact hash and near-dup (shingle + Jaccard) detection with adjustable cutoff and neighbor caps.
Cluster view with sample previews; activity log for dedupe runs.

Stratified Mix

Set target distributions across length, domain, and difficulty; normalize targets to 100%.
Computes current dataset strata and produces a guidance hint (stratifiedNextAsk) for underrepresented bins.
Can auto-inject the hint into Teacher LLM calls to steer future generations toward gaps.

Uploads

Drag/drop files or folders (PDF/DOCX/TXT/MD/HTML/JSON).
Chunk/overlap controls, embedding model selection, optional Qdrant sync, per-file progress/errors.

Embeddings

Inspect embeddings, sample from Qdrant, visualize scatter/heatmap, and build context from selections.
Sync status indicators and pagination for large sets.

Dataset

Browse/edit generated and ingested samples with schema awareness from the active profile.
Inline editing of instruction/input/output/metadata; filters/search by profile.

Grammar Tools

Grammar editor with AST viewer and test input playground.
Toggle grammar usage for generation; validate grammars before use.

Environment

Set API keys/endpoints (OpenAI, Gemini, OpenRouter, vLLM host, Ollama, custom OpenAI-compatible).
Concurrency knobs, chunking mode, embedding model selection, and danger-zone data reset.

LLM as a Judge

Configure judge/teacher models and concurrency.
Run evaluations across accuracy/completeness/clarity/overall (or profile-specific axes); results feed Close-the-loop.

Close the Loop

Automates regenerate → evaluate cycles to improve samples based on judge scores.
Monitors runs with status and aggregated metrics.

Vector DB Integration

Configure Qdrant host/API key/collection; ensure collection exists with correct vector size.
Trigger resample/load-more from Qdrant; view sync health.

Fine-Tune Export

Export curated datasets to training-ready formats (e.g., JSONL) using the active profile schema.
Include metadata fields and evaluation scores if desired.

« Previous Next »