seekon.me logo
Blog

When your search router picks an empty shelf

Hierarchy-aware category routing for product search

Berke Sokhan

Berke Sokhan

CoFounder, CEO/CTO·
catalogsearch

Many product search systems use a two-stage pipeline: first route the user's query to a product category, then retrieve products inside that category. This post describes a failure mode that emerges when a hierarchical taxonomy is stored as a flat list, the design alternatives we considered, and the solution we shipped.

The problem

Consider a catalog that classifies products with a hierarchical taxonomy such as the Google Product Taxonomy, where every node has a numeric id and a full path:

141 - Cameras & Optics
142 - Cameras & Optics > Cameras
152 - Cameras & Optics > Cameras > Digital Cameras

The search pipeline works in two stages:

  • Category routing. Embed the user query, run a nearest-neighbor search over category embeddings, and optionally let an LLM disambiguate among the top candidates.
  • Product retrieval. Run hybrid retrieval (lexical BM25 plus vector similarity, fused with Reciprocal Rank Fusion) scoped to the matched category with a strict equality filter:
WHERE category_id = :category_id

Now a user searches for "Nikon DSLR camera". The router matches category 142 ("Cameras & Optics > Cameras") -- a perfectly reasonable answer. But all camera products in the catalog are tagged at the leaf, category 152 ("Digital Cameras"). The equality filter finds nothing. The user sees zero results even though the catalog is full of cameras.

The root cause: the taxonomy is hierarchical, but the data model is flat. Every taxonomy line becomes an independent row. The full path survives only as a display string, there is no parent reference, and nothing connects node 142 to node 152. Parent and child categories compete as unrelated siblings in the embedding index, and whichever one wins becomes a hard gate for retrieval.

Design alternatives

We evaluated five approaches, roughly ordered from least to most invasive.

Option 1: Fallback cascade

Keep the flat model. After routing, check whether the matched category has products; if not, find descendants by prefix-matching the stored path string ("Cameras & Optics > Cameras > %") and re-rank them against the query.

  • Pros: pure code change, no migration, ships fast.
  • Cons: string matching on display text is fragile; adds an extra query per search; descendant id lists must be threaded through every retriever.

Option 2: Materialize the hierarchy

Make the tree first-class on the category table:

ALTER TABLE categories
    ADD COLUMN parent_id INTEGER REFERENCES categories(id),
    ADD COLUMN path_ids  INTEGER[];  -- ancestors, root-first, including self

path_ids for "Digital Cameras" becomes [141, 142, 152]. Subtree expansion is then deterministic SQL:

SELECT id FROM categories WHERE :matched_id = ANY(path_ids);

An alternative implementation of the same idea is the PostgreSQL ltree extension; for taxonomies of a few thousand nodes, a plain integer array with a GIN index is sufficient and simpler.

  • Pros: correct by construction; enables breadcrumbs, faceted browse, and rollup counts later.
  • Cons: migration plus backfill; retrieval still needs an IN (descendant_ids) expansion at query time.

Option 3: Denormalize the lineage onto the search index

Build on Option 2, but instead of expanding descendant ids per query, stamp each product's full category lineage onto the denormalized search documents that the retrievers already read:

ALTER TABLE search_documents
    ADD COLUMN category_path_ids INTEGER[];  -- e.g. [141, 142, 152]

CREATE INDEX ON search_documents USING gin (category_path_ids);

Every retriever's category filter becomes a single index-friendly predicate:

WHERE :category_id = ANY(category_path_ids)

A match on parent 142 now transparently returns products tagged at leaf 152, with no per-query subtree lookup and no IN lists.

  • Pros: zero query-time expansion cost; the matched node can be any level of the tree; identical predicate across lexical and vector retrievers.
  • Cons: requires Option 2 anyway; one-time reindex of search documents.

Option 4: Make the router subtree-aware instead of the retrieval

Leave retrieval exact-match, but only ever route to populated categories: restrict the candidate set to categories with products, or annotate LLM candidates with product counts and instruct the model to avoid empty nodes.

  • Pros: confined to the routing layer; no schema changes.
  • Cons: probabilistic -- if the right leaf is not among the top vector candidates, there is nothing to select; counts go stale unless computed live.

Option 5: Soften the category gate

Search across all category candidates above a similarity threshold (category_id IN (...)) and let rank fusion plus a downstream relevance filter discriminate.

  • Pros: eliminates the "one wrong category equals zero results" failure class entirely.
  • Cons: noisier candidate pools; structured filters built around a single category's property schema need reworking; higher LLM cost per search.

The solution we shipped

We combined Options 2 and 3 as the core design, with Option 4's product-count annotation as a cheap addition to the router.

sequenceDiagram
    participant Q as User query
    participant V as Category vector search
    participant S as LLM category selector
    participant R as Hybrid retrieval

    Q->>V: embed query, top-K categories
    Note over V: candidates annotated with<br/>subtree product counts
    V->>S: ambiguous? LLM picks a node (any level)
    S->>R: category_id = 142 (a parent is fine now)
    R->>R: WHERE :id = ANY(category_path_ids)
    Note over R: matches products tagged<br/>at 152, 153, and deeper

The pieces:

  • Hierarchy columns. parent_id and path_ids on the category table, derived once from the taxonomy path strings and maintained by the taxonomy import going forward. Custom categories without a taxonomy path degrade gracefully to single-node trees (path_ids = [id]).
  • Lineage on the search projection. A category_path_ids array, GIN-indexed, on the denormalized documents and property rows that the BM25 and vector retrievers read. It is stamped whenever a product's search projection refreshes.
  • Subtree-aware predicates. Every retriever filter changed from equality to:
   WHERE (:category_id IS NULL
          OR category_id = :category_id
          OR :category_id = ANY(category_path_ids))

The equality clause doubles as a fallback for rows whose lineage has not been backfilled yet, which makes the rollout order-independent: the code can deploy before the data backfill without breaking anything.

  • Subtree-aware structured filters. Query-to-filter extraction (price ceilings, numeric properties, and similar hard constraints) now reads observed property values from the whole subtree, so a parent-level match still sees the property schema of products tagged at the leaves.
  • A smarter router. Category candidates passed to the LLM selector carry subtree product counts. The selector is told that choosing a parent is acceptable (retrieval covers the subtree) but to avoid nodes whose entire subtree is empty when an equally accurate populated candidate exists. The high-confidence shortcut that skips the LLM only fires when the top match's subtree actually contains products.

Rollout

The migration is a single idempotent script: add columns and GIN indexes, derive parent_id by matching each path against its parent prefix, compute path_ids with a recursive CTE, and stamp the lineage onto the search projection tables. No re-embedding is required because the change is purely about filtering, not about what gets embedded.

WITH RECURSIVE tree AS (
    SELECT id, ARRAY[id] AS path_ids
    FROM categories WHERE parent_id IS NULL
    UNION ALL
    SELECT c.id, t.path_ids || c.id
    FROM categories c JOIN tree t ON c.parent_id = t.id
)
UPDATE categories SET path_ids = tree.path_ids
FROM tree WHERE categories.id = tree.id;

Lessons

  • A hard gate amplifies upstream errors. When stage one's output is a strict filter for stage two, any stage-one ambiguity becomes a total failure. Either make the gate tolerant (subtrees, candidate sets) or make stage one structurally aware of what stage two can actually serve.
  • Do not store hierarchy as prose. A path string in a description field is a rendering, not a data model. If the taxonomy is a tree, model the tree; materialized ancestor arrays are cheap and make subtree queries trivial.
  • Denormalize toward the hot path. Expanding descendants per query works but pushes complexity into every retriever. Stamping the lineage onto the search index turns a join-or-expand problem into a single array-containment predicate that one GIN index serves.
  • Soft instructions are not structure. Telling an LLM to "prefer the leaf category" treats a data-model gap as a prompting problem. Give the model facts (product counts, hierarchy) and make the system correct even when the model picks a different level than you expected.