Seekr - A basic search engine in rust
Building a search engine from scratch to understand how search engines work under the hood
Introduction
One day i was wondering how does a search engine work, after looking around the web for a bit i decided to build my own from scratch, in this blog post, we’ll go through the process of figuring out how search engines work on a basic level to implementing one from scratch in rust
To get started, i chose the rust lang documentation to use as a corpus, you can use any other set of documents; the corpus is simply the collection of documents on which the search engine operates.
Parsing the document
Before we start searching through our documents, we need to preprocess them. This involves three main parts:
- Extracting all the relevant text from the documents
- Breaking the texts into searchable chunks (
tokenization
) - Reducing words to their root form (
stemming
)
Our final goal in the pre-processing stage is to prepare the documents to create a frequency map of all the words present in each document. This frequency map will be instrumental when we implement document ranking based on relevancy using TF-IDF
(Term Frequency-Inverse Document Frequency), which we’ll discuss in detail later in this blog post.
Extraction
To parse the documents, i used a lightweight Rust library called nanohtml2text. This is a simple crate with one primary functio, it takes the contents of an HTML file as input and returns the text content of the page, removing all HTML tags.
1
2
3
4
5
6
use nanohtml2text::html2text;
fn read_html_file<P: AsRef<Path>>(file_path: P) -> Result<String> {
let file_content = read_to_string(file_path)?;
Ok(html2text(&file_content))
}
- Inside this function, we first read the entire contents of the file into a
String
usingread_to_string
. - We then pass this content to the
html2text
function, which strips away all HTML tags and returns the plain text
To demonstrate the functionality of html2text
, let’s consider a simple HTML document:
1
2
3
4
5
6
7
8
9
10
11
<html>
<body>
<h1>Welcome to My Blog</h1>
<p>This is a <strong>sample</strong> paragraph with some <em>formatting</em>.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>
When we pass this HTML through the html2text
function, it strips away all the HTML tags, resulting in clean, plain text:
1
2
3
4
5
6
7
Welcome to My Blog
This is a sample paragraph with some formatting.
* Item 1
* Item 2
* Item 3
By using this library, we can easily convert our HTML documents into plain text, which is much easier to process and search through. This step is crucial as it allows us to focus on the actual content of the documents without worrying about HTML structure or formatting.
Tokenization
Now that we have extracted the text content from our documents, the next step is to break this text into searchable chunks, or tokens. This process is called lexing and we’ll implement a lexer to accomplish this task.
Our lexer will be a struct that takes text on creation and implements the Iterator
trait. Each iteration will return a vector of [char]
, where each char slice represents a token.
Our lexer follows the following rules
- Consecutive alphabetic characters form a single token
- Consecutive numeric characters form a single token
- Special characters are treated as individual tokens
- Whitespace is ignored
Let’s see how our lexer works with an example:
1
AI-powered 3D printing in 2024: 100% eco-friendly!
when passed through our lexer, the text gets converted into:
1
2
3
4
5
6
7
8
9
10
11
12
13
AI
-
powered
3D
printing
in
2024
100
%
eco
-
friendly
!
By implementing this lexer, we’ve created a tool that breaks our text into tokens, making it much easier to index and search through our documents. This step is crucial in building our search engine, as it determines how our text will be processed and matched against search queries.
However, we’ve encountered a new challenge: what about words like “forest” and “forests”? They would be considered different tokens, but they mean essentially the same thing. How can we tackle this situation? To solve this problem, we’ll use a technique called stemming.
Stemming
Stemming is the process of reducing words to their root or base form. This step generalizes our search capabilities and can significantly improve search results. For our implementation, we’ll use the Snowball stemming algorithm, which is available through the rust_stemmers crate.
- Related words are stemmed to the same word root
1
2
3
4
5
6
'cat' -> 'cat'
'cats' -> 'cat'
'play' -> 'play'
'playing'-> 'play'
'played'-> 'play'
- Similarly, unrelated words are stemmed to different roots
1
2
3
'fire' -> 'fire'
'fireplace' -> fireplac'
Here’s why stemming is important for our search engine:
- It improves search generalization: Words like “play”, “playing”, and “played” all reduce to the same root “play”, allowing our search engine to match related words more effectively.
- It reduces the number of unique tokens: By converting words to their root form, we decrease the total number of individual tokens we need to index, which can improve both storage efficiency and search speed.
- It maintains meaning: Stemming typically preserves the core meaning of words, ensuring that our search results remain relevant.
Stemming doesn’t account for the context of a word. For example, saw (as in to see) and saw (as in a tool) would stem to the same root. Though this isn’t a significant issue for our use case.
Indexing
When you perform a search, the engine doesn’t sift through every document in real-time; instead, it consults an index—a specialized data structure that acts like a highly optimized map, pointing directly to where the relevant information is stored. We will generate an index of the data in this section of the blog.
Creating the Term Frequency Map
Now that we have the text of our documents tokenized and stemmed, we can move on to the next crucial part of the process: building a frequency map of all the words in each document. This frequency map will form the basis of our search index.
We’ll start by defining a custom type called TermFrequency
, which is essentially a HashMap<String, usize>
. This will store the frequency of each term in a document.
1
type TermFrequency = HashMap<String, usize>
To build this frequency map, we iterate through the tokens in the document and increment the frequency by one if the token is already present, or insert the token into the hashmap if it’s new.
Building the Index
To handle multiple documents, we’ll define a new type called Index. This is a hashmap that maps file paths to their corresponding TermFrequency maps:
1
type Index = HashMap<PathBuf, TermFrequency>;
With this index, we now have a comprehensive map of token frequencies across all our documents. This structure will be instrumental in implementing our search functionality and ranking algorithm
Serailizing And Deserializing
As we’ve discovered, the indexing process can be time-consuming, especially for large document collections. It’s inefficient to re-index all files every time we want to perform a search. A better approach is to store our index on disk for later use. This is where serialization comes in handy.
Serializing the Index
We’ll use the serde_json crate to serialize our index map into a JSON
file and save it on disk.
1
2
3
4
5
fn serialize_index(index: &Index, index_path: &PathBuf) {
println!("Serializing index to {:?}", index_path);
let index_file = File::create(index_path).expect("Could not create index file");
serde_json::to_writer(index_file, index).expect("Could not serialize index file");
}
This function converts our Index
into a serializable format and then writes it to a JSON
file.
Deserializing the Index
When we need to use the index, we can deserialize it from the JSON file:
1
2
3
4
5
fn deserialize_index(index_path: &PathBuf) -> Index {
println!("Deserializing index from {:?}", index_path);
let index_file = File::open(index_path).expect("Could not open index file");
serde_json::from_reader(index_file).expect("Could not deserialize index file")
}
This function reads the JSON
file, deserializes it into our Index
type.
Enhanching the Index
To further optimize our search engine, we can enhance the structure of our index. Instead of just storing the term frequency maps, we’ll also include additional metadata for each document:
1
2
3
4
5
6
7
8
9
10
struct Doc {
term_frequency: TermFrequency,
count: usize,
last_modified: SystemTime,
}
pub struct Index {
tfd: TermFrequencyPerDoc,
df: DocFrequency,
}
Here’s what each part of the index structure represents:
TermFrequencyPerDoc
(tfd
): This is a hashmap that maps file paths to theDoc
struct, which contains the term frequency map, the total number of terms in the document, and the last modified timestamp.DocFrequency
(df
): This is a hashmap that stores the document frequency, which is the number of documents a term appears in. This information will be useful for our ranking algorithm later.
With this enhanced index structure, the serialized index would look something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"tfd": {
"/path/to/document1.html": {
"term_frequency": {
"TERM1": 2,
"TERM2": 1,
"..."
},
"count": 3,
"last_modified": "2023-08-21T12:34:56Z"
},
"..."
},
"df": {
"TERM1": 1,
"..."
}
}
By including additional metadata in our index, we can optimize various aspects of our search engine,
Optimizing Document Counts
One key benefit of storing the document count in the Doc
struct is that it eliminates the need to calculate the count by iterating over the entire term frequency map. This pre-computed value can be accessed directly, improving the performance of various operations in our search engine.
Incremental Indexing
The last_modified
timestamp allows for efficient incremental indexing. When updating the index, we can compare each file’s last modified time against the last indexing date. This way, we only need to re-index documents that have been changed or added since the last update.
With this enhanced index structure, we’re ready to implement the search functionality and ranking algorithm.
Ranking Pages
Now that we have our comprehensive index structure in place, it’s time to put it to use and implement the search functionality. The key to an effective search engine is not just finding the relevant documents, but ranking them in a way that presents the most useful results to the user.
Understanding TF-IDF
For our search engine, we’ll be using the well-known TF-IDF (Term Frequency-Inverse Document Frequency) algorithm to score and rank the documents. TF-IDF is a widely-used approach that determines the relevance of a document to a given search query. It considers both the frequency of the query terms within the document and how unique those terms are across the entire document collection.
This approach is based on the principle that the relevance of a document to a search query is directly proportional to how frequently the query terms appear in that document and inversely proportional to how common those terms are across the entire document collection.
If a term is frequent in all documents, it is considered noise and is effectively filtered out, as it does not contribute meaningfully to distinguishing one document from another.
Term Frequency (TF): This represents the frequency of a term within a document, normalized by the total number of terms in that document.
The term frequency (TF) of a term $t$ in a document $d$ is defined as:
\[\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}\]The more often a term appears in a document, the higher its term frequency.
Inverse Document Frequency (IDF): This represents the inverse of the number of documents containing a given term. Terms that appear in many documents are less informative for ranking purposes than terms that appear in only a few documents.
The inverse document frequency (IDF) of a term $t$ is defined as:
\[\text{IDF}(t, D) = \log \left(\frac{N}{|\{d \in D : t \in d\}|}\right)\]where:
- $ N $ is the total number of documents in the corpus $ D $.
- ${d \in D : t \in d}$ is the number of documents where the term $ t $ appears (i.e., $ \text{DF}(t) $).
The TF-IDF score for a term $ t $ in a document $ d $ is calculated as:
\[\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)\]This combines the local relevance of the term within the document (TF) with its global importance across the corpus (IDF).
By combining the term frequency and inverse document frequency, the TF-IDF score provides a way to identify documents that are both relevant to the search query and unique or distinctive within the collection.
Multi-threading
Knowing that the indexing process can often be a performance bottleneck, we’ve will leverage Rust’s built-in multi-threading capabilities to optimize our search engine’s architecture.
Instead of running everything on the main thread, we created a separate thread dedicated to the indexing process. This allows the indexing to run in the background, while the main thread is free to serve the web-based user interface.
Here’s how we implement this multi-threaded approach:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
let index = Arc::new(Mutex::new(Index::new()));
// Spawn a new thread to handle the indexing process
{
let index = Arc::clone(&index);
thread::spawn(move || {
let index = Arc::clone(&index);
process_folder(read_dir(folder_name).unwrap(), &index)
.map_err(|e| {
eprintln!("ERROR: {}", e);
std::process::exit(1);
})
.unwrap();
let index = index.lock().unwrap();
serialize_index(&index, &index_path);
});
}
// Start the server on the main thread
start_server(Arc::clone(&index), "127.0.0.1:8000".to_string());
In this implementation, we create an Arc<Mutex<Index>>
to share the index data structure between the main thread and the indexing thread. The main thread spawns a new thread to handle the indexing process, which runs in the background.
While the indexing is happening, the main thread is free to start the web server and serve the user interface. As the indexing progresses, the index is updated and the search results are automatically refreshed.
To ensure safe concurrent access to the index, we use the Mutex
to protect the index data structure. This allows multiple threads to access and update the index without race conditions or other synchronization issues.
Multithreading enables able to significantly improve the overall performance and responsiveness of our search engine. The indexing process no longer blocks the main thread so users can start searching and interacting with the system almost immediately, even as the index continues to be built in the background.
Now that the backend of our search engine is complete, it’s time to focus on the user-facing web interface. Our goal is to create a simple, yet intuitive and responsive web application that allows users to easily search through the document corpus and view the results.
UI
To provide a user-friendly interface for our search engine, we’ll be using the tinyhttp crate to serve a simple front-end website. This will give our users a clean and intuitive way to interact with our search capabilities.
The front-end will consist of a simple HTML
, CSS
, and JavaScript
-based webpage. The core functionality will be a search input box and a submit button. When the user submits a query, the JavaScript
code will make a POST request to the backend, passing the search terms.
On the backend, we’ll use tinyhttp
to listen for these POST requests. When a request is received, we process the search query, calculate the TF-IDF scores, and return the top 10 most relevant results in a JSON format.
The front-end will then take the JSON response and render the search results on the page, displaying the file paths and their corresponding relevance scores.
Conclusion
In this blog post, we’ve gone through the process of building a search engine from scratch using the Rust programming language. We started by understanding the basic components of a search engine and then implemented each of them step-by-step:
- Document Preprocessing: We extracted the text content from HTML documents, tokenized the text, and performed stemming to normalize the words.
- Index Construction: We built an efficient index structure that maps document paths to their term frequencies and document frequencies.
- Ranking with TF-IDF: We implemented the TF-IDF algorithm to calculate relevance scores and rank the search results.
- Concurrent Architecture: We leveraged Rust’s multi-threading capabilities to run the indexing process in the background while serving the web interface.
- Web Interface: We built a simple yet responsive web page to allow users to search and view the results.
Working on this project has been an exciting and educational experience. By embarking on this journey, we’ve learned about some of Rust lang’s strengths and also deepened our understanding of search engine fundamentals along the way.
I hope that this blog post has inspired you to explore the world of search engines and consider using Rust for your own projects. Happy coding!