Team project, NUST DSA coursework · 2024 – 2025

Academic Search Engine: Trie-indexed, BM25-ranked

A from-scratch search engine over 5,000 research papers. Custom Trie for autocomplete, custom inverted index for retrieval, BM25 with title-boost for ranking. Sub-100 ms query latency.

C++17
CMake 3.10+
httplib
nlohmann/json
Python (PyMuPDF)
React 19 + Vite
Azure VM

The problem

Off-the-shelf search engines hide the DSA that powers them behind black-box APIs. The goal here was to go the other way and build every layer ourselves. Tokenizer, indexer, ranker, server. Every decision in a production retrieval system comes from choices we made, not from an SDK we imported. The corpus is research papers pulled live from OpenAlex; the product is a search UI that returns results in under 100 ms.

What I built

A three-tier architecture with deliberately strict boundaries. Python handles PDF parsing, data ingestion, and tokenization. A C++17 HTTP server handles search and ranking only. A React 19 frontend built with Vite does UI. Each tier has one job, documented in the repo's ARCHITECTURE.md.

The heart of the backend is three custom structures. A Trie that powers autocomplete in under 50 ms. An inverted index keyed by term to posting list for fast retrieval. A JSON-based lexicon mapping terms to index positions. A forward index alongside stores document content for snippet extraction. Ranking uses BM25 with a title boost so matches in titles outrank matches in body text.

The ingestion pipeline runs four sequential stages. Async PDF download from the OpenAlex API. Document renumbering to contiguous IDs. Parsing and tokenization. Metadata extraction for publication dates, citations, and keywords. The result is a ~500 MB on-disk index over 5,000 papers that the C++ server memory-maps and serves alongside the built React SPA on a single port.

[ architecture diagram · ase-arch ]

Pipeline: corpus, tokenizer, forward + inverted indexes, Trie autocomplete, all served by a C++ HTTP server.

Key decisions

Custom Trie over a prefix-hash map
alternative · std::unordered_map<string, vector<string>> of prefixes
A real Trie gives autocomplete suggestions in sorted order for free via in-order traversal, and scales memory better as term count grows past a few thousand. The unordered_map approach balloons once prefixes repeat.
Three structures, not one: forward, inverted, lexicon
alternative · Inverted index only
Inverted alone answers 'which docs contain this term'. The forward index lets us pull document content for snippets without a second round-trip. The lexicon maps terms to positions without scanning sequentially. Every query touches all three, and the overhead is worth the sub-100 ms latency.
BM25 with title boost as the ranker
alternative · Classic TF-IDF scoring
BM25's length normalization and saturation term dominate TF-IDF on academic corpora where abstract length varies wildly. Title-boost captures the human intuition that a matching title is a stronger signal than a matching body paragraph.
Python for ingestion, C++ for serving
alternative · Pure C++ end-to-end
Python's PDF and HTTP libraries (PyMuPDF, requests) made ingestion a two-day job. Rewriting them in C++ would have taken two weeks. The cost is one language boundary, acceptable for a once-per-corpus ingest.
Single-port unified deploy
alternative · Separate API and static hosts
The build copies the React bundle to backend/static/ and the C++ server serves both the API and the SPA on port 8080. One process, one hostname, zero CORS config. Deployment to the Azure VM becomes a single binary and a static directory.

Outcomes

Documents indexed
5,000
Research papers fetched live from OpenAlex
Query latency
< 100 ms
On most queries, measured on the Azure VM
Autocomplete
< 50 ms
Trie walk on warm memory
Index footprint
~500 MB
Forward + inverted + lexicon, ~200 MB runtime memory

The problem

What I built

Key decisions

Outcomes

Links