Building a Domain-Adapted LLM for Travel Guidance
Language models are good at many things. They are not, by default, good at producing reliable, structured outputs under constraints. PocketGuide is a system designed to address that gap in a specific domain: travel planning.
The goal was not to build the most capable travel assistant or the most sophisticated model. It was to demonstrate how domain adaptation works when reliability matters more than coverage, when outputs must follow contracts, and when the system needs to run offline on consumer hardware. This is also not another GPT wrapper with formatted outputs—it involves working directly with model weights through fine-tuning and quantization.
Why travel guidance
Travel planning is a useful domain for exploring structured output generation. The space is large enough to be interesting but constrained enough to evaluate rigorously. Queries have clear types—itineraries, checklists, decision trees, procedures—each with distinct structure. Outputs should acknowledge uncertainty, reference verification steps, and avoid overconfidence.
The practical motivation was simple: existing travel assistants are cloud-dependent and unreliable. They hallucinate visa rules, conflate customs regulations across countries, and fail when connectivity drops. Upcoming international travel made this concrete—being in a country with unreliable internet or expensive data roaming, needing quick answers to straightforward questions, with no trusted offline option. A system that works offline, produces structured outputs, and flags its own uncertainty gaps is more useful than one that optimizes for conversational fluency.
Evaluation before training
Most LLM projects follow a predictable pattern: collect data, train a model, see what happens. PocketGuide inverts this. Evaluation infrastructure was built first.
Before any fine-tuning, I defined output contracts. A JSON envelope schema enforces consistent structure across all responses—summary, assumptions, uncertainty notes, verification steps. Typed payload schemas handle domain-specific outputs for different query types. These contracts are not suggestions. They are validated at inference time with deterministic pass/fail criteria.
A fixed 20-prompt benchmark suite measures three things: parse success (can the output be parsed as valid JSON), schema compliance (does it match the required structure), and uncertainty marker presence (does it acknowledge gaps in knowledge). This benchmark does not change. All training iterations are measured against it.
The baseline model—an unmodified open-source 7B model—established reference metrics. Parse success was 80%. Uncertainty markers appeared 85% of the time. Envelope field compliance was effectively zero. These numbers were not good, but they were objective. Everything that followed would be measured against them.
Synthetic data and teacher models
Training a domain-adapted model requires domain-specific instruction data. Scraping travel forums or hiring annotators is expensive and noisy. PocketGuide uses synthetic generation with quality gating.
The pipeline uses OpenRouter to access teacher models. A spec defines desired diversity: categories (visa, customs, health, budget, itinerary), regions, difficulty levels. The system generates prompts, submits them to teachers, and collects responses. But this is not a naive API loop. Cost control matters. Free models are tried first. Paid models are fallback options with explicit flags. Rate limiting prevents throttling. Every generated example includes full provenance: config snapshot, prompt hash, teacher model ID, token counts, generation timestamp.
The first dataset produced 120 examples. Small by contemporary standards, but every example was inspected. Quality gating filtered weak samples. Deduplication prevented leakage. The goal was not scale. It was signal density.
Five iterations of adaptation
LoRA fine-tuning on Llama-2-7B provided a path to parameter-efficient adaptation. Five training iterations followed, each one targeting failure modes identified through evaluation.
Version 1 established a baseline with the initial synthetic dataset. Parse success stayed at 80%. Uncertainty markers reached 100%, showing the model learned to acknowledge gaps. But envelope field compliance remained low.
Version 2 adjusted dataset quality—harder prompts, stricter acceptance thresholds—and lowered the learning rate. Parse success climbed to 95%. Envelope fields began appearing, but inconsistently.
Version 3 increased training epochs and sequence length. Parse success hit 100% and stayed there. The model now consistently produces valid JSON. Envelope field compliance reached 15%.
Version 4 normalized prompts and payload formatting in the training data. Envelope compliance peaked at 20%, then regressed slightly in version 5 despite extended training.
The pattern was clear: parse success and partial compliance respond to data quality and training duration. Full schema compliance requires architectural changes—constrained decoding, loss weighting, or different objectives. That work is not urgent. The system reached a coherent stopping point.
What the system does now
The adapted model runs in two modes. Hugging Face + PEFT loads LoRA adapters for evaluation and experimentation. llama.cpp handles quantized inference (Q4_K_M) for offline deployment. A registry maps logical model names to GGUF artifacts.
Output contracts are enforced at runtime. Strict parsing validates full schema compliance. Lenient parsing accepts partial structures. Both modes surface what succeeded and what failed, making debugging tractable.
The evaluation framework produces timestamped reports with metrics breakdowns, failure analysis, and example outputs. Every run is reproducible. Every result is traceable.
Why this is a good stopping point
PocketGuide is not abandoned. It reached a milestone where the system is coherent, measurable, and functional within its scope.
Parse success is 100%. The model consistently produces valid JSON. Uncertainty markers appear reliably. Envelope field compliance is partial but documented. The evaluation harness is stable. The synthetic data pipeline is operational. The quantized model runs offline.
The next phase—full schema compliance—requires different techniques. Constrained decoding could enforce structure during generation. Loss weighting could prioritize envelope fields during training. Larger base models might handle structure more naturally. These are extensions, not missing foundations.
Stopping here is intentional. Not every project needs to exhaust every avenue. Demonstrating disciplined iteration, measurable improvement, and honest assessment of what works and what remains challenging is more valuable than incremental optimization.
What comes next
The natural extensions are clear. Constrained decoding or grammar-based generation could enforce schema compliance mechanically. Loss weighting or curriculum learning could teach the model to prioritize required fields. A FastAPI service could wrap the quantized model for persistent inference. Multi-domain benchmarks could test whether the methodology generalizes beyond travel.
These are tractable next steps. They are not urgent. The system demonstrates what it was designed to demonstrate: domain adaptation through evaluation-first design, synthetic instruction tuning with quality control, and iterative improvement based on evidence rather than intuition.
Reflection
PocketGuide is one piece of a broader approach to building ML systems. It prioritizes evaluation over scale, reproducibility over speed, and measured iteration over intuition. The repository is public. The infrastructure is documented. The results are honest.
This project shows how domain adaptation works when constraints shape design, how structured outputs require explicit contracts, and how iterative improvement requires fixed benchmarks. The technical details live in GitHub. This post captures the reasoning behind the system and what was learned by building it.