Searching encrypted content

Published April 2nd, 2025

So, I'm building an a journal app called Jadebook and I ran into a bit of an issue. It has search functionality because you obviously want to be able to find things within your entries. But the problem was encryption. The whole point of encryption is to make things harder to find, it's directly countering the idea of search.

So, after a bunch of researching, I wrote down a couple solutions to the problem. First, I wanted to figure out how the industry handles this. Apps like Notion were the focus since Jadebook works in a similar way.

The Industry Standard

Notion is inherently difficult to "hack" since all the data is held in blocks. Thus, when you pull data, it's more like a collection of ids rather than plain text content. However, Jadebook doesn't have a billion dollars and Notion's way of data storing is extremely difficult to scale.

Overlooking the infrastructure differences, from the knowledge I gathered, Notion encrypts all the data using specific encryption keys. When you send a search request, Notion encrypts the search query in the same way the content is encrypted. This is fairly common and can be called searchable encryption. There are a couple way to pull this off but it's not as secure.

It should be known that while encryption is great, a key component of security is to make sure the data is never accessed in an unauthorised way to begin with. Using Jadebook as an example, everything is encrypted and the encryption/decryption process takes place on the server meaning the keys, the data and the process are never public. Even to trigger the actions, an auth check, rate limit check and a data check using Zod occurs to make sure there is no malicious intent.

The reason I mention that is because there are instances where it's okay to have a lower level of encryption. There is something called SOC2 compliance and while Jadebook doesn't have it (because you need an agency to do it for you), it checks the whole infrastructure rather than just the data for a reason.

The Ideas

I had a couple ways to approach the idea. The first being, keywords and summaries. Rather than searching the content, what if we extract some common keywords and then run the search on that instead. Plus, Jadebook uses AI in a couple places and one of the features is that it'll generate AI summaries for the entries. So, we have 2 things that we can search, but if you write 800 - 1000 words, the summary and keywords just isn't enough for a good experience.

The second idea I had was to decrease the encryption. Jadebook uses AES-256 encryption and to make things more difficult, every entry is encrypted in a unique way. So, what if we just decrease the encryption? Well, I came across something called homomorphic encryption which allow computations on encrypted data without decrypting it. However, this seemed extremely complex and seemed to have performance implications.

There was the idea of client-side encryption but that would be absolutely terrible for performance so it got scraped real quick.

Imagine you search for Apple and the website pulled every single journal entry, decrypted it, and then search through for the word Apple. That would scale horribly and the performance would suck.

The Solution

What if we don't search the content at all? This is where semantic search comes in. Traditionally, you'd be searching for a keyword but semantic search looks for the intent behind a search. It'll make more sense when we look at how semantic search works.

Semantic Search

Semantic search focuses on understanding the meaning and intent behind a query rather than just matching keywords. It uses techniques like natural language processing (NLP) and machine learning to analyze context and relationships between words, delivering more relevant results. For example, searching for "affordable smartphones with good cameras" would prioritize results that align with the intent, even if exact keywords are absent.

This is way better than traditional search because when a user searches there content, they are usually trying to describe something since it's not often you know the exact word you wrote.

Vectors

Vectors are mathematical representations of data in a multi-dimensional space. In semantic search, text (like words or sentences) are converted into vectors, where similar meanings are placed closer together in this space. For instance, "dog" and "canine" would have vectors near each other because they share similar meanings.

This works great because my database, Supabase (based on PostgresSQL), handles vectors natively. Like you create a create specifically for storing vectors and the database will optimise the retrieval and storage for those columns.

Embeddings

So, we know semantic search uses intent, we also know that the intent is stored as a vector but how do we turn text content into vectors? Well, we bring in AI again. There are a couple AI models specifically made to generate embeddings (which are vector representation of the text).

There are options from Google, OpenAI and a couple more on HuggingFace too. A thing to note about embeddings is how many dimensions they create. The more vectors we have, the more accurate the search results are. But it comes with decreased performance and increased storage.

Embeddings aren't too expensive, OpenAI has one that costs $0.02 per million tokens which is pretty cheap for AI.

Using Supabase

Supabase has been on the AI and vector game. For starters, that have a local model that you can use called gte-small which generates 384 vectors (for comparison, OpenAI's small model generates 1536 vectors). Then you can simply create an edge function like the following:

1// Setup type definitions for built-in Supabase Runtime APIs
2import "jsr:@supabase/functions-js/edge-runtime.d.ts";
3
4const session = new Supabase.ai.Session("gte-small");
5
6Deno.serve(async (req) => {
7  try {
8    // Extract input string from JSON body
9    const json = await req.json();
10
11    // Generate the embedding from the user input
12    const embedding = await session.run(json.input, {
13      mean_pool: true,
14      normalize: true,
15    });
16
17    // Return the embedding
18    return new Response(
19      JSON.stringify({ embedding }),
20      { headers: { "Content-Type": "application/json" } },
21    );
22  } catch (error) {
23    console.error("Error generating embedding:", error);
24
25    return new Response(
26      JSON.stringify({ error: "Failed to generate embedding" }),
27      { headers: { "Content-Type": "application/json" }, status: 500 },
28    );
29  }
30});

Now 384 vectors is fairly small but remember that most journal entries are going to be relatively small, we also want to reduce the cost as much as possible and storing vectors is actually not recommend for security. With embeddings, it's possible to extract the content so the lower amounts of information you store, the better it is for performance, storage, costs, and security.

Conclusion

Anyway, just thought I put this out there. It's actually pretty cool and dealing with vectors and embeddings lowkey made me feel pretty smart. Plus it's extremely cheap in the way I implemented it using Supabase since I only pay for the edge function invocations, 2 million are already given to me on the pro tier and additional invocations cost $2 per 1 Million. For a couple bucks (that's if I manage to call the edge functions 2 million times) I added semantic search and solved the searching encrypted content problem.

Back to blog

The blog