web developer

When Search Breaks in Prod: How I Debug & Fixed Broken Search

Reading time:6 min
Published On

tl;drPrevious slide

I tracked down a production issue where alphanumeric code searches in ArangoSearch failed silently. After deep investigation, I found a misconfigured analyzer. I revamped it with a pipeline setup, fixed streamType, and reindexed the views—restoring accurate search results for users and improving the app’s reliability..

ArangoDB’s ArangoSearch uses analyzers to break up indexed values. By default, the ngram analyzer slices input by byte (streamType "binary"), so it treats each byte as a character. This detail turned out to be crucial: we found that queries for alphanumeric codes (e.g. "C1A20") in a view were returning no results, whereas numeric-only codes did match. For example, given a document like:

{
  "_key": "item1",
  "search_terms": ["C1A20", "B12345", 67890, 123]
}

the query

FOR doc IN myView
  SEARCH doc.search_terms == "C1A20"
  RETURN doc

returned no documents. Yet the query

FOR doc IN myView
  SEARCH doc.search_terms == 67890
  RETURN doc

did match that document. (Numeric codes matched because they were indexed as numbers, not text.)

The discrepancy hinted at an analyzer issue: our view was indexing the search_terms field with a text analyzer that by default used binary mode. In binary mode, multibyte or non-numeric characters can be mishandled. In our case, the letters in "C1A20" were not being broken into searchable tokens.

This bug surfaced after upgrading to ArangoDB 3.12, but the root cause was in the analyzer configuration rather than the upgrade itself.

Root Cause: Analyzer Misconfiguration

We discovered that the analyzer for search_terms was misconfigured. By default, ArangoDB’s ngram analyzer uses streamType: "binary", which treats each byte as a character. For alphanumeric strings like "C1A20", this means the analyzer was slicing at the byte level and failing to generate the expected n-grams. The official docs warn that "binary" mode only handles single-byte characters properly, whereas "utf8" mode treats full Unicode code points as characters.

In our case, the fix was to switch to UTF-8 mode. As one ArangoDB engineer put it: “combine an n-gram Analyzer with a View to enable index-accelerated substring matching... Just make sure you set the streamType to utf8. In other words, we needed an n-gram analyzer with streamType: "utf8" so that letters and digits are handled correctly.

Another issue was that our original analyzer did not preserve the full string. Without preserveOriginal: true, the ngram analyzer would only produce substrings, not the original token itself. We needed the full code (“C1A20”) indexed as a token too, so exact queries would match. Finally, I realized that after changing analyzers, the view needed to be recreated (or re-indexed) to pick up the new analyzer. In ArangoSearch, analyzers are bound in the view definition, and you cannot simply change them on the fly without rebuilding the index.

Troubleshooting Steps

Before arriving at the fix, I tried several approaches:

  • Boosting scores manually. I briefly experimented with AQL scoring functions (e.g. BM25/TFIDF) to force partial matches, but this was a hack and didn’t address the missing tokens issue.
  • Testing analyzer output. I ran TOKENS("C1A20", "<analyzer>") in arangosh to see how the analyzer was tokenizing our codes. In binary mode, it was splitting oddly or not at all.
  • Double-checking the view definition. I confirmed that our ArangoSearch View had the correct field linked. If the analyzer name in the view didn’t match our saved analyzer, searches would silently use the default identity analyzer instead.
  • Recreating the view. I tried “refreshing” the view by dropping and recreating it after saving a new analyzer, since ArangoDB doesn’t auto-update existing indexes with changed analyzers.
  • Verifying feature compatibility. I checked that our ArangoDB version (3.12) supported pipeline analyzers and the ngram features we needed. The upgrade didn’t break anything by itself, but it highlighted the need to set streamType: "utf8". (ArangoDB 3.12 also introduced a new wildcard analyzer, but I solved this without using it.)

None of these attempts alone worked until I focused on the analyzer configuration.

Solution: A UTF-8 Ngram Pipeline Analyzer

The final fix was to define a custom analyzer pipeline that uses identity (to keep the token intact) followed by ngram. This ensures that the full code and its substrings are indexed. I set:

  • streamType: "utf8" so that letters are treated as characters, not raw bytes.
  • preserveOriginal: true so the original code string is also indexed.
  • A suitable n-gram min/max length (e.g. 2–5 for our code lengths).
  • The "frequency", "norm", and "position" features (required for n-gram matching).

Here’s an example arangosh snippet defining the pipeline analyzer:

var analyzers = require("@arangodb/analyzers");
analyzers.save("code_pipeline", "pipeline", {
  pipeline: [
    { type: "identity", properties: {} },
    { type: "ngram", properties: { min: 2, max: 5, preserveOriginal: true, streamType: "utf8" } }
  ]
}, ["frequency", "norm", "position"]);

This creates an analyzer named code_pipeline that first leaves the token unchanged (identity), then generates n-grams on the UTF-8 string. A similar pattern is documented in ArangoDB’s docs: for example, one can build a pipeline chaining a case-normalizer with an n-gram analyzer.

Next, I recreated the ArangoSearch View, linking the search_terms field to this analyzer. For instance:

db._createView("myView", "arangosearch", {
  links: {
    myCollection: {
      fields: {
        search_terms: { analyzers: ["code_pipeline"] }
      }
    }
  }
});

As the docs explain, the links section specifies which analyzers apply to each field. I ran this, and ArangoDB reindexed the collection with the new analyzer (this can take some time on large data).

With this fix in place, the example document was now tokenized like this:

  • Original code "C1A20" (preserved).
  • N-grams of length 2–5, e.g. "C1", "1A", "A2", "20", "C1A", "1A2", etc. (depending on the min/max).

Because preserveOriginal: true, the full token "C1A20" is in the index, so an exact match query hits it. Partial matches work too: a search for "C1A" will match one of the n-grams indexed. In practice I query like this:

FOR doc IN myView
  SEARCH ANALYZER(doc.search_terms == @input, "code_pipeline")
  RETURN doc

Here @input might be "C1A20" or even a shorter string. The same query plan handles both exact and partial lookups thanks to the n-gram tokens.

In my testing, a query such as

FOR doc IN myView
  SEARCH ANALYZER(doc.search_terms == "C1A20", "code_pipeline")
  RETURN doc

now correctly returns the expected document (the original token "C1A20" was indexed). Similarly, searching for "C1A" or "A20" (with length ≥2) also matches because those substrings were indexed as n-grams. This single SEARCH query thus covers both exact and substring cases.

Partial vs Exact Matching

Because the pipeline includes preserveOriginal, the original code is always in the index. For example, indexing "C1A20" with min=2, max=5 and preserveOriginal=true yields tokens like:

  • N-grams: "C1", "1A", "A2", "20", "C1A", "1A2", "A20", "C1A2", "1A20", and
  • Original: "C1A20"

A search for "C1A20" matches the original token. A search for "C1A" (length 3) matches the n-gram "C1A". Even a search like "20" finds the "20" token. I accomplish all this in one analyzer and one query, without needing separate full-text and substring indexes.

Lessons Learned

  • Stream type matters. The default streamType: "binary" can silently break multi-character tokens. Always use "utf8" when indexing text containing letters or multi-byte characters.
  • Recreate or reindex after analyzer changes. ArangoSearch links bind to a specific analyzer at index time. Changing an analyzer’s definition (or replacing it) means you must recreate the view or inverted index. As the docs note, “Collection indexes cannot be changed once created. Therefore, you need to create a new inverted index to index a field differently”. In practice I dropped and recreated the view so that it re-indexed with the new analyzer.
  • Use preserveOriginal for exact matches. If you only use pure n-grams (without preserving the original), short queries might miss the full token. Enabling preserveOriginal: true ensures that exact lookups still hit.
  • Bind analyzers to view fields. Remember to specify the analyzer in the view’s link definition. If you forget, ArangoDB will use its default (identity), leading to unexpected results.
  • Test tokens with TOKENS(). The TOKENS() AQL function was invaluable for verifying what the analyzer was outputting. For example, RETURN TOKENS("C1A20", "code_pipeline") showed me all indexed n-grams and the original token.

In summary, the fix was to define a UTF-8 n-gram pipeline (identity → ngram) with preserveOriginal, bind it in the view, and reindex. This solved the alphanumeric search issue: one robust analyzer/query now handles both substring and exact code searches.