I spent years building search systems that serve hundreds of millions of users. Systems where results need to appear in single-digit milliseconds. Where privacy constraints mean you can't just ship everything to a server and figure it out there. Where a bad result doesn't just frustrate a user — it erodes trust in the entire product.
When people hear "search," they think it's a solved problem. Google exists. Elasticsearch exists. How hard can it be?
Incredibly hard. And the lessons I learned building search at that scale have shaped how I approach every AI system I build today — whether it's a recommendation engine, a document retrieval system, or a conversational AI product for a startup.
Here are the mental models that transferred.
Relevance is not a ranking problem. It's a comprehension problem.
The biggest misconception about search is that it's about ranking results. It's not. Ranking is the easy part. The hard part is understanding what the user actually wants — which is almost never exactly what they typed.
When someone types "coffee near me" into a maps application, are they looking for a café to sit and work, a drive-through for a quick pickup, or a grocery store that sells coffee beans? The three words they typed could mean any of these things. The system's job isn't to rank coffee shops — it's to figure out which kind of coffee experience this particular user wants at this particular moment.
In the search world, this is called "query understanding," and in my experience it consumes more engineering effort than ranking ever does. Spelling correction, intent classification, entity recognition, query expansion, context modeling — all of these happen before the system even thinks about which results to show.
The transfer to every AI product: If your AI system takes user input, you have a query understanding problem. Whether it's a chatbot interpreting a question, a recommendation engine interpreting browsing behavior, or a search bar on your e-commerce site — the hardest engineering is in understanding what the user actually means, not in retrieving or generating the response.
Most startups skip query understanding and jump straight to the response. That's why their AI feels "dumb" even when the underlying models are good.
The best result is the one you don't have to show
Some of the biggest wins I've seen in search aren't about improving result quality — they're about eliminating the need for the user to see results at all.
If someone searches for "Starbucks on Main Street" and there's exactly one Starbucks on Main Street in their city, the best user experience isn't showing them a list of results ranked by relevance. It's taking them directly to that Starbucks. One intent, one result, zero friction.
This requires the query understanding layer to produce not just an interpretation but a confidence score for that interpretation. When confidence is high enough, the system can bypass the results page and act directly.
The transfer: In any AI product, ask yourself: "What would it look like if the system was so confident in its understanding that it could act directly, without showing the user intermediate options?" That question reshapes how you think about your confidence thresholds, your fallback behaviors, and your UX. The goal isn't always "better results." Sometimes it's "fewer results, higher confidence."
Latency is a feature, not a constraint
Building search under strict latency constraints fundamentally shaped how I think about ML architecture. Instant search — results that update as you type each character — needs to respond in single-digit milliseconds. Not because someone chose that number arbitrarily — because at higher latencies, the user perceives lag and the experience feels broken.
This constraint eliminates certain architectural choices. You can't use large transformer models for real-time ranking. You can't make network calls for every keystroke. You have to build systems that are fast enough to feel instant, which means on-device models, aggressive pre-computation, and very careful choices about what to compute in real time versus what to compute ahead of time.
The transfer: Every AI product has a latency budget, even if you haven't defined one. A recommendation engine that takes 3 seconds to load means users see an empty page for 3 seconds. A chatbot that takes 8 seconds to respond feels broken compared to one that starts streaming in 1 second.
Define your latency budget explicitly. Then design your ML architecture around it, not the other way around. I've seen startups build impressive models that are useless in production because nobody asked "how fast does this need to respond?" until it was too late to change the architecture.
Failure modes matter more than success modes
When search works, nobody notices. When search fails, everyone notices. The asymmetry means that optimizing for the failure case is often more impactful than optimizing for the success case.
In production search systems, enormous effort goes into what happens when the system doesn't have a good answer. Spelling correction for misspelled queries. Graceful degradation when the network is unavailable. Fallback strategies when the ML model returns low-confidence results. The "no results" page — which seems like a simple edge case but is actually the moment where you've most completely failed the user.
The transfer: Your AI product's failure mode is your actual product. What happens when the model is wrong? What happens when it's uncertain? What happens when the input is garbage? If you haven't designed those experiences with the same care as the success path, your users will encounter them and judge your entire product by them.
I always tell the startups I work with: show me your error states, your fallback behaviors, your "I don't know" responses. That's where the real engineering quality shows.
Evaluation is harder than building
Building a search system is hard. Knowing whether it's good is harder. "Relevance" is subjective, context-dependent, and changes over time. A result that's perfect at 8 AM (coffee shop for the morning commute) is wrong at 8 PM (user probably wants a restaurant now, not coffee).
High-quality search teams build extensive evaluation frameworks — offline metrics, online A/B testing, human evaluation pipelines, regression detection systems. In my experience, the best teams spend as much engineering effort on measuring quality as they do on improving it.
The transfer: If you can't measure whether your AI is good, you can't improve it. And "accuracy" is almost never the right metric. What's the metric that actually correlates with user satisfaction for your specific product? Defining that metric is one of the highest-leverage things you can do, and it's the thing most startups skip because it's not as exciting as building the model.
The compound effect
None of these lessons is revolutionary in isolation. But together, they compound into a fundamentally different approach to building AI products:
Start with understanding the user's intent, not the model's output. Design for the latency budget your product demands. Invest disproportionately in failure modes. And measure relentlessly.
Every AI system I build today — for startups, for enterprises, for my own products — starts with these principles. They came from years of building search at massive scale, but they apply to any system where a user asks for something and your AI tries to deliver.
Want help with your AI stack?
If this post matches problems you're seeing, we can map the fastest path from architecture decisions to production outcomes.
Talk to Manmeet
Manmeet Singh
Founder & CEO, AIshar Labs · Ex-Apple, Ex-Instacart · 15 AI Patents
Built ML systems at Apple (Search: Maps, Safari, Spotlight) and Instacart (Search, Recommendations, Ranking). Writes about production AI tradeoffs and system design.
Follow on LinkedIn →