Machine LearningLLMEvaluationData GenerationProduction Engineering

PocketGuide: Offline Travel LLM, Built Evaluation-First

Most language model projects optimize for scale: larger datasets, more parameters, bigger compute. PocketGuide optimizes for constraints: compute limits, reliability, offline operation, and structured outputs. This inversion of priorities produces a different kind of system. One where evaluation precedes training, where data quality matters more than data quantity, and where measurable improvement is the only metric that counts.

The project emerged from a practical problem: existing travel assistants are either cloud-dependent or unreliable. They hallucinate visa requirements, confuse customs rules, or require internet connectivity. A travel assistant needs to work offline, produce trustworthy information, and operate on modest hardware. This is not a problem that scale solves. It is a problem that careful systems design solves.

Starting with evaluation

The standard approach to language model development is straightforward: collect data, train a model, evaluate the result. PocketGuide inverts this. Evaluation comes first. Before any fine-tuning, before any training data is generated, the project established a 72-example benchmark suite covering seven travel categories: visa, customs, health, budget, itinerary, safety, local culture, across three difficulty levels. This benchmark is fixed. It will not change. Everything else is measured against it.

Baseline evaluation on a pre-adaptation open-source model established reference metrics. The point was not to start with a high score. The point was to have an objective standard, documented and reproducible, against which all future improvements would be measured. This frame makes progress visible in a way that most projects never achieve.

The evaluation framework itself is rigorous. It does not just measure accuracy. It measures specific contract compliance: does the model respect the schema it is supposed to follow? Does it handle edge cases gracefully? Can we parse its output consistently? These are not nice-to-haves. They are requirements. The system validates every response against behavioral contracts: envelope schemas for response structure, specialized payloads for different travel query types. Anything that does not conform fails objectively.

Domain adaptation through synthetic data

With evaluation established, the project moved to data. Generating training data for a travel domain from scratch is expensive. Using generic instruction-tuning datasets produces noise. PocketGuide uses a different approach: teacher-driven synthetic data generation.

The system uses OpenRouter to access multiple models as teachers. A prompt template defines the structure for each of seven travel categories. The generator submits these prompts to teachers, collects responses, and builds a training dataset. But this is not a simple API call loop. The implementation is production-grade: cost-controlled fallback chains (free models first, then paid if necessary), exponential backoff and rate limiting (15 requests per minute) to avoid throttling, dry-run mode for testing without cost, and complete provenance tracking.

Every generated example records its full history: the config snapshot, the exact prompt, which teacher model produced it, token counts, generation timestamp, and whether fallbacks were used. Append-only JSONL format prevents data loss. This infrastructure is not elegant. It is boring. It is also reliable and reproducible, which is the entire point.

The dataset specification targets 120 examples. Small by contemporary standards, massive by domain-specific standards. The prompt planner ensures consistent distribution across categories, difficulty levels, and response types. Initial generation runs produce pass-rate statistics by category and difficulty, enabling visibility into which areas of the domain are harder to teach.

Current state and next steps

The project is not a prototype. It is not exploratory code. The foundation is established and stable. Clean repository structure, deterministic workflows via Makefile, 169 passing tests covering validation logic, data generation, and evaluation infrastructure. The synthetic data pipeline is operational. Teachers are generating examples. The framework is ready.

Current work focuses on data quality. Not all generated examples are equally useful. Some hallucinate, some produce valid but uninformative responses, some violate schemas. The project is implementing deduplication and balancing logic to filter weak examples, prevent leakage between training and evaluation splits, and ensure the training set is genuinely aligned with what the benchmark measures. This is systems work: the unsexy scaffolding that makes training reliable.

Next comes model adaptation. Fine-tuning via LoRA or QLoRA on the cleaned synthetic dataset, with experiment tracking for reproducible ablations. Then rigorous evaluation: baseline vs. adapted model compared objectively on held-out benchmarks, with qualitative failure analysis to guide the next iteration. The project plans multiple cycles of targeted fixes, retraining, and re-evaluation, each one tightening the system based on evidence rather than intuition.

Down the line, deployment realism: model quantization, packaging for offline inference, documentation of resource constraints. Then portfolio finalization. But that is future work. Right now, the project is in the messy middle where systems get built: careful data work, repeated evaluation, iterative improvement.

Design philosophy

This approach reflects a few core commitments. First, evaluation-driven development. Metrics are established early and remain fixed. Improvements are measured objectively. Speculation is replaced with numbers. Second, reproducibility and provenance. Every result is traceable. Every decision is documented. Experiments can be repeated exactly. Third, cost consciousness. Free models are tried first. API budgets are tracked. Fallback chains are explicit. This is not about being cheap. It is about being accountable.

There is also a deeper philosophy here about domain adaptation and scale. Modern LLM discourse assumes that bigger is better: more parameters, more training data, more compute. But large, noisy datasets produce unreliable models. PocketGuide takes a different bet: a small, high-quality dataset adapted to a specific domain produces better behavior in that domain than a huge general-purpose model. This is not a novel idea. It is just unfashionable. The project is testing it anyway.

Progress as iteration

PocketGuide will not ship tomorrow. It will ship when the system reliably answers travel questions better than alternatives, when the offline constraint is real, when the evaluation metrics reflect genuine improvement. These are not arbitrary gates. They are the project's way of saying: we built this to work, not to prove we could build it.

The repository is public. The infrastructure is documented. The progress is measurable. This is how serious systems work: iteration, documentation, accountability, and an honest assessment of what has been completed and what remains.

View on GitHub View Project Page