Skip to content
Back to work

Team project, NUST DSA coursework · 2024 – 2025

Academic Search Engine: Trie-indexed, BM25-ranked

A from-scratch search engine over 5,000 research papers. Custom Trie for autocomplete, custom inverted index for retrieval, BM25 with title-boost for ranking. Sub-100 ms query latency.

  • C++17
  • CMake 3.10+
  • httplib
  • nlohmann/json
  • Python (PyMuPDF)
  • React 19 + Vite
  • Azure VM

The problem

Off-the-shelf search engines hide the DSA that powers them behind black-box APIs. The goal here was to go the other way and build every layer ourselves. Tokenizer, indexer, ranker, server. Every decision in a production retrieval system comes from choices we made, not from an SDK we imported. The corpus is research papers pulled live from OpenAlex; the product is a search UI that returns results in under 100 ms.

What I built

A three-tier architecture with deliberately strict boundaries. Python handles PDF parsing, data ingestion, and tokenization. A C++17 HTTP server handles search and ranking only. A React 19 frontend built with Vite does UI. Each tier has one job, documented in the repo's ARCHITECTURE.md.

The heart of the backend is three custom structures. A Trie that powers autocomplete in under 50 ms. An inverted index keyed by term to posting list for fast retrieval. A JSON-based lexicon mapping terms to index positions. A forward index alongside stores document content for snippet extraction. Ranking uses BM25 with a title boost so matches in titles outrank matches in body text.

The ingestion pipeline runs four sequential stages. Async PDF download from the OpenAlex API. Document renumbering to contiguous IDs. Parsing and tokenization. Metadata extraction for publication dates, citations, and keywords. The result is a ~500 MB on-disk index over 5,000 papers that the C++ server memory-maps and serves alongside the built React SPA on a single port.

Pipeline: corpus, tokenizer, forward + inverted indexes, Trie autocomplete, all served by a C++ HTTP server.

Key decisions

  1. Custom Trie over a prefix-hash map

    alternative · std::unordered_map<string, vector<string>> of prefixes

    A real Trie gives autocomplete suggestions in sorted order for free via in-order traversal, and scales memory better as term count grows past a few thousand. The unordered_map approach balloons once prefixes repeat.

  2. Three structures, not one: forward, inverted, lexicon

    alternative · Inverted index only

    Inverted alone answers 'which docs contain this term'. The forward index lets us pull document content for snippets without a second round-trip. The lexicon maps terms to positions without scanning sequentially. Every query touches all three, and the overhead is worth the sub-100 ms latency.

  3. BM25 with title boost as the ranker

    alternative · Classic TF-IDF scoring

    BM25's length normalization and saturation term dominate TF-IDF on academic corpora where abstract length varies wildly. Title-boost captures the human intuition that a matching title is a stronger signal than a matching body paragraph.

  4. Python for ingestion, C++ for serving

    alternative · Pure C++ end-to-end

    Python's PDF and HTTP libraries (PyMuPDF, requests) made ingestion a two-day job. Rewriting them in C++ would have taken two weeks. The cost is one language boundary, acceptable for a once-per-corpus ingest.

  5. Single-port unified deploy

    alternative · Separate API and static hosts

    The build copies the React bundle to backend/static/ and the C++ server serves both the API and the SPA on port 8080. One process, one hostname, zero CORS config. Deployment to the Azure VM becomes a single binary and a static directory.

Outcomes

  • Documents indexed

    5,000

    Research papers fetched live from OpenAlex

  • Query latency

    < 100 ms

    On most queries, measured on the Azure VM

  • Autocomplete

    < 50 ms

    Trie walk on warm memory

  • Index footprint

    ~500 MB

    Forward + inverted + lexicon, ~200 MB runtime memory

Links