Teaching Sommo to Reason: Adding a Knowledge Graph to a Wine Recommender

Sat, 23 May 2026 00:00:00 +0000

A few months ago I shipped Sommo v1, a fine-tuned 7B model that knows wine. Sommo v2 is the private model that powers the iOS app. For its size, it is genuinely strong in the wine niche, but only because a lot of disciplined LLM work sits behind it: a custom evaluation harness with domain-specific test sets, regression tests on previous failures, careful prompt and training-data curation, calibration against expert wine references. v2 is the LLM-only ceiling more or less reached.

But every language model, no matter how well-tuned, will occasionally invent a winery, misattribute a grape variety, or confidently recommend something that does not exist. Evals catch the gross failures. The subtle ones (an almost-real producer name, a vintage just out of range) are exactly what slips through judgment-based testing. That is the gap I wanted to close.

I ran this work at the University of Oxford as a water-test: a small, controlled experiment to see whether grounding language models in structured knowledge was worth pursuing at a larger scale. Sommo was the natural test subject. I already had the model and the use case.

The hypothesis: a knowledge graph could fix the residual hallucinations. Not in a hand-wavy “structured data is good” way, but in a concrete, measured way. So I built one, tested it on the same prediction tasks two different ways (logic rules and ML embeddings), and ran the LLM with and without the KG as context.

This post walks through what I did, what I measured, and what I would do differently. The full technical write-up is available as apaper (PDF) if you want the formal version with all the tables, and thesource code is on GitHub .

Why even bother

The honest answer: language models are pattern completers, not databases. Sommo v1 hallucinates. The blog post for v1 said this explicitly. Telling users “the model might invent things” is fine for a hobby project. For an app people pay for, the right answer is to make the model invent fewer things.

There are two standard moves:

More training data. Diminishing returns, expensive, and you cannot retrain on every new producer.
Retrieval-augmented generation. Plug the model into a verified data source so it can ground its answers.

A knowledge graph is the second move done properly. Instead of stuffing raw text snippets into the prompt, you give the model structured facts: this winery exists, it is in this region, this region is in this country, this wine uses this variety, these wines are similar to that one for these reasons.

Knowledge graphs also let you do something language models cannot:deductive reasoning. If a region is in a province, and a province is in a country, then the region is in the country. A small Datalog program captures this in three lines. A 7B model needs millions of examples to approximate it, and even then it will get edge cases wrong.

I wanted to test whether this theoretical advantage actually shows up in measurements. Spoiler: it does, in some places more than others.

The dataset

I started with theWineEnthusiast 130k reviews collection on Kaggle . Flat CSV. One row per review, with columns for country, province, region, winery, variety, points, price, taster, and the review text itself.

After filtering to a tractable slice (France, Italy and Spain; points >= 85; non-null price/region/winery/variety) I had:

34,189 wines
6,181 distinct wineries (after fuzzy-canonicalising names likeChâteau Latour vsChateau Latour)
363 grape varieties
806 named regions
29 provinces
3 countries

Big enough that the geographic hierarchy is interesting, small enough that everything trains in minutes. Critically, the source isnot a knowledge graph. It is a CSV. I had to build the graph myself, which was the point. Reusing Wikidata or DBpedia would have proven nothing.

Constructing the graph

The pipeline is four idempotent stages, all driven from one set of pandas-emitted parquet tables:

Filter and normalise. Strip accents, lower-case, remove the boilerplate prefixes (Château,Domaine,Tenuta,Bodegas,Cantina,Maison,Casa), then fuzzy-cluster winery strings withrapidfuzz token-set ratio >= 92. The 6,958 raw winery strings collapse to 6,181 canonical entities. Vintages come from a regex on the title; there is no vintage column in the source data.
Bucket continuous attributes. Price into bands (<15,15-30,30-60,60-120,>120); points into quality tiers (Good,VeryGood,Outstanding,Classic).
Emit entities and edges. One parquet per node type, one per edge type, surrogate IDs derived from sha1 hashes for stability.
Dual-load. Push the same data into Neo4j (property graph, used for ad-hoc Cypher) and into RDF (N-triples, used by the logic engine). 41,623 nodes and 232,263 relationships in Neo4j; 418,187 triples in RDF, all consistent.

The schema is small but expressive. Nine relation types:producedBy,madeFrom,fromRegion,inProvince,inCountry,reviewedBy,hasVintage,hasPriceBand,hasQualityTier. Plus a derived recursion target,locatedIn, declared asowl:TransitiveProperty.

That last property is the whole point of using RDF here: a recursive rule fills in the geographic closure automatically.

The two solvers

To make the comparison meaningful I defined two tasks both solvers had to attack:

Task A, link prediction: given a query wine, return the top 10 wines it is most similar to. Gold positives = same variety, same province, points within +/-2, price within 25%.
Task B, KG completion: mask 5% ofmadeFrom edges; predict the held-out variety from the rest of the wine’s structure.

A shared evaluation harness drives any solver implementing the sameSolver Protocol. Both solvers see identical splits. Same metrics (Hits@1, Hits@3, Hits@10, MRR). No moving the goalposts.

Logic solver: Datalog rules

Five rules in a.dl file, evaluated by a small fixpoint engine I wrote on top of pandas. The interesting ones:

% R1 (recursive): transitive geographic closure.locatedIn(R,P):-inProvince(R,P).locatedIn(P,C):-inCountry(P,C).locatedIn(X,Z):-locatedIn(X,Y),locatedIn(Y,Z).% R2 (creates new edges): recommend a similar wine.recommend(W1,W2):-sharesVariety(W1,W2),sameProvince(W1,W2),sharesPriceBand(W1,W2),sharesQualityTier(W1,W2),W1!=W2.% R5: a winery is "premium" if it has at least 3 wines% in the Outstanding or Classic quality tier.premiumWinery(Y):-producedBy(W,Y),hasQualityTier(W,Q),Qin{qt_outstanding,qt_classic},count(W)>=3.

ThelocatedIn closure converges in two iterations from 835 base edges to 1,641 derived pairs. R2 materialises 2.6 millionsimilarTo-style edges across the whole graph. R5 derives 247 premium wineries (Zind-Humbrecht, Louis Jadot, Leflaive, Latour, Louis Roederer at the top). All of this in 8.2 seconds end-to-end.

These are not facts I had to put in. They are facts that follow from facts I put in.

ML solver: ComplEx embeddings

For the machine-learning side I trained aComplEx knowledge-graph embedding withPyKEEN . Embedding dim 128 (256 real components), 100 epochs of LCWA training with negative sampling, Adam at 1e-3, batch size 512, seed pinned. Final training loss 0.006.

Why ComplEx? Complex-valued embeddings handle asymmetric relations natively.producedBy,madeFrom,fromRegion are all asymmetric. TransE would struggle.

The model exposes nine relation types over 41,620 entities, totalling 230,554 training triples. The 1,709madeFrom edges that constitute Task B’s test set were masked from training to keep the evaluation honest.

For Task A, the solver scores candidate wines by cosine similarity in the learned embedding space. For Task B, it uses PyKEEN’sscore_t to rank all varieties for a given wine.

What the numbers said

Solver	Task	Hits@1	Hits@3	Hits@10	MRR
random	A	0.0001	0.0003	0.0005	0.0002
random	B	0.0018	0.0047	0.0222	0.0055
logic	A	0.0308	0.0769	0.1831	0.0673
logic	B	0.6975	0.9427	0.9994	0.8213
ml	A	0.0001	0.0005	0.0016	0.0004
ml	B	0.0919	0.2264	0.5073	0.1983

A few things jump out.

On Task B (variety prediction), logic wins overwhelmingly. Hits@1 of 0.70, against a random baseline of 1-in-363. The reason is structural: variety is essentially a deterministic function of (winery, province) for the majority of wines in this slice. A five-line Datalog program captures this exactly. ML reaches Hits@10 = 0.51, which is impressive given it has to learn the same pattern from triples without being told the rule, but still clearly second-best.

On Task A (similarity), logic also wins, but the gap is more interesting. Logic gets Hits@10 = 0.18, a 367x improvement over random. The cap is set by the harness (k = 10), not by the solver’s recall. The recommend-set has thousands of candidates per anchor wine, and presenting only ten of them is naturally lossy. ML, surprisingly, does worse than random on this one. The reason became clear when I dug in: ComplEx similarity scores triple plausibility, not entity proximity in any sense aligned with the gold definition. A learned siamese head trained directly on similarTo edges would close the gap.

To make the contrast concrete, I sampled 200 Task B wines and bucketed them by which solver got the gold variety:

Bucket	Count (n=200)
Both correct	15
Logic only	129
ML only	3
Both wrong	53

129 vs 3. Logic dominates the disagreement set by 43x. The “both wrong” bucket is where the interesting failures live, typically rare varieties (a Sangiovese clone called Prugnolo Gentile shows up here, almost no training signal).

Where ML earns its keep

The numbers above make ML look like a worse logic solver. But there are two things logic cannot do that the embeddings handle naturally.

First,soft generalisation. The Prugnolo Gentile case is informative: ComplEx places the wine close in vector space to other Tuscan reds. The top-1 prediction is wrong (it picks Aglianico) but it is wrong in an interesting way. It has learned that this is some kind of rich southern Italian variety. The five-line Datalog program has no such fallback; it either knows the variety from the winery-province pattern or it does not.

Second,the embeddings are reusable. Once trained, the same vectors can be queried for any nearest-neighbour task: similar wines, similar wineries, similar regions, anomaly detection. The Datalog rules are bespoke to each predicate.

The honest read:logic and ML are complementary, not competitive. ML can teach logic where its thresholds are wrong (R4’s “characteristic variety” cutoff of 50 is too coarse for low-volume varieties, which is exactly where ML’s confident wrong answers cluster). Logic can repair ML by rejecting predictions that violate hard constraints (a Spanish-only variety predicted for a Bordeaux wine is a Datalog query, not a learned property).

The part that actually matters: LLM grounding

This is the experiment I cared about most.

I picked 20 query wines and asked Gemini Flash to recommend three similar wines for each, in two conditions:

Unaided. Just the query wine, no context.
Grounded. The same prompt, plus a list of eight candidate wines pulled from the KG (the logic solver’s top recommendations).

Then I checked every winery the model named against the canonical winery list from the KG, normalising both sides for accent stripping and prefix removal.

Condition	Wineries named	Hallucinated	Rate
Unaided	60	16	26.7%
Grounded	52	4	7.7%

A 3.5x reduction in hallucinated wineries with no change to the model. Same prompt template, same temperature, same model version. The only difference was eight lines of KG-supplied candidate wines.

This is the standard RAG result, demonstrated against this KG and this candidate set. It is also the answer to “should Sommo v3 use a knowledge graph?” The answer is yes. The question now is engineering: real-time vector search over the candidate set, latency budget on the iOS round-trip, what to do when the KG returns nothing useful.

The reverse direction: LLM enriches KG

The graph has zero edges for tasting notes. The CSV has descriptions like“This bright, juicy red shows ripe black cherry, leather and a hint of cinnamon, with firm but ripe tannins.” All of that vocabulary is wasted on a structured pipeline.

So I asked Gemini to extract tasting descriptors from 30 wine descriptions. It returned 180 descriptor tuples drawn from 138 unique terms. Top descriptors:leather, spice, mineral, wood, toast, cinnamon, acidity, pineapple, juicy. Every one of them is a candidate new edge:Wine -hasTastingNote-> Descriptor.

This is the cleanest cooperation pattern I have seen between LLMs and KGs. The LLM does what the structured pipeline cannot (read free text). The KG does what the LLM cannot (verify that named entities actually exist). The two enrich each other in directions that play to their strengths.

What I would do differently

A few things I would change for v2 of this experiment:

Train ML on the actual prediction task. I used ComplEx similarity for Task A because it was easy. A learned siamese head trained directly on the similarTo edges materialised by R2 would almost certainly outperform logic on Task A, by exploiting the soft generalisation that pure cosine similarity throws away.

Push more relations into the graph. The current schema is geographic plus quality plus price. Adding terroir (soil, climate, altitude), winemaking technique (oak ageing, fermentation vessel), and tasting profile (the LLM-extracted descriptors above) would give both solvers more to work with. It would also test whether logic’s dominance on Task B holds up in a less structurally-loaded setting.

Move from offline batches to live queries. Right now the logic engine pre-materialises 2.6 millionrecommend edges. For a real recommender this is wasteful. Most queries touch a tiny fraction. A query-time evaluator over a smaller derived index would scale much further.

Build the actual retrieval layer. The grounding experiment used eight pre-computed candidates. The production version needs ANN search over wine embeddings, with filters for budget, region, and varietal preferences. Standard infrastructure, but the KG decides what gets indexed.

Try it yourself

Everything is open. The full pipeline reproduces from a clean machine via Docker:

git clone https://github.com/gokhanarkan/wine-knowledge-graphcd wine-knowledge-graph# Download winemag-data-130k-v2.csv from# https://www.kaggle.com/datasets/zynicide/wine-reviews# and place it at data/raw/make up# Neo4j + Python containermake smoke# verify deps + datasetmake prep-data# filter, normalise, emit entity tablesmake build-kg# load both stores; render figuremake build-splitsmake logic-derivemake eval-logic-A&& make eval-logic-Bmake ml-installmake ml-triples&& make ml-trainmake eval-ml-A&& make eval-ml-Bmake compare# comparative analysis tables

All deterministic seeds are pinned to 42. The whole thing fits on a laptop.

What this means for Sommo

Sommo v1 was a fine-tuned model. Sommo v2 added proprietary training data, MCP connections, and the eval discipline that pushed accuracy as far as LLM-only techniques will take it. Sommo v3, when it ships, will pair the model with a knowledge graph. Not because graphs are fashionable, but because the numbers say so.

The 26.7% to 7.7% hallucination drop is the headline. The variety-prediction Hits@1 of 0.70 is the proof that even a five-line logic program can outclass a 7B language model on the right kind of structured task. The lesson is not “abandon LLMs”. The lesson is “give them something solid to stand on”.

Wine is just the test domain. The same pattern (LLM for natural language, KG for structured truth, each compensating for the other’s failure modes) applies anywhere you care whether the names your model speaks are real.

URLs in this post

]]>

Knowledge-Graphs - Gökhan Arkan