I created a spamfilter that runs in the browser for my personal email.
When Google Workspaces started charging money, I switched to another email provider: Tutanota. They are great in many ways, but the spam filter is not very good. I got dozens of spam messages each day. Often I get the same spam message multiple times. These show up in my inbox, even after I reported an email as spam. I was getting tired of this, and decided to do something about it. I made my own client-side spam filter.
The filtering logic doesn’t have to be very complex. I can often determine whether a message is spam just by skimming the subject and the sender, and I figure an automated filter could do the same.
Reverse engineering the API
There’s a global
tutao variable that offers an entry into the Tutanota internals.
Most web applications have such a global variable, but this is not necessarily the case. It is entirely possible to implement an application without using global variables.
The code to retrieve the inbox from the current mailbox is quite straightforward:
tutao.locator.mailModel.getMailboxDetails().then( details => inbox = details.folders.getSystemFolderByType("1") );
However, it is less straightforward to retrieve the mails in this folder. The inbox does have a
mails property, but it is an identifier, not a list of emails. Instead, we have to pass identifier to the
EntityClient to load a list of mails. Emails, contacts, and calendar items are all forms of entities, and these are all loaded in the same way.
EntityClient has functionality to load emails (or other entities) starting from a certain email, again indicated with an identifier. This is a common pattern used for pagination. It makes it easy to only retrieve and learn new emails. We remember the identifier from the last email we saw, and request only newer emails.
Many functions within Tutanota and the classifier are asynchronous. Asynchronous function return a Promise, and I handled that explicitly by calling
then() on the promise and creating new promises to return. I feel like using
async keywords better could improve things, and I am missing some understanding here how this works exactly.
getSystemFolderByType("1") and not
getSystemFolderByType(MailFolderType.INBOX): the values from
Running script on the page
const script = document.createElement('script'); script.setAttribute('src', chrome.runtime.getURL('page.js')); document.body.appendChild(script);
The plugin adds a button to the UI, which selects all spam. This button needs to be added just after the Tutanota app is finished loading. I tried to hook into the Tutanota app to get a notification when this is the case, but in the end I decided to keep polling for a
<div> with the correct name. Less elegant, but it works.
Naive Bayesian classifier
Thomas Bayes invented some statistics rules, which today help us to filter spam. Specifically, when a word occurs more often in spam than in normal messages, it gives us some information on whether a message is spam if it contains that word. A Bayesian classifier can automatically learn which words fit which categories, and then assign some probability that a message belongs to a certain category. It is naive in the sense that it considers the words and probabilities to be independent: it calculates the probability for “vicodin” and “pills” separately, even though these may in practice only occur together.
classifier.learn('Buy Vicodin Pills Today', 'spam'); classifier.learn('Schedule week 38', 'ham'); classifier.categorize('Cheap Vicodin Pills'); // => 'spam'
categorize function calculates the probability of each category, and returns the most likely category. This may give unexpected results if it is not very sure about the category. If it is 1% certain it’s ham and 2% certain it’s spam, it will be marked as spam, even though it basically doesn’t know. This gave unexpected results while developing, but it doesn’t really give problems in practice.
In the example above, we pass a sentence to the API, but the classifier learns on words. The input is split into words by a tokenizer. I began with passing the subject to the default tokenizer, but later on I changed this in a couple of ways. First, a special token is added when the subject is empty. An empty subject of course does not match any words and would be hard to classify, but the fact that it’s empty is a good sign that it’s spam. To inform the classifier of this, I add a
__emptySubject token when the subject is empty. Second, I don’t split the recipient address into separate words. That something is sent to email@example.com is useful for the classifier, and shouldn’t be confused with a subject that contains “Sjoerd” and “GitHub”. Third, I made the tokenizer case-insensitive.
After changing the tokenizer, the classifier does not work correctly anymore and needs to be retrained. Before making the tokenizer case insensitive, it learned about “GitHub” and “VICODIN”, and now it only sees mails about “github” and “vicodin”. I forgot this step a couple of times, and then it looks like the change to the tokenizer made the classifier a lot worse.
I store the classifier in the browser’s localStorage, so that it keeps persisted and I can use and update it every time I open the mail app.
The spam filter works well. The Bayesian filtering was easier than I expected. Integrating with the Tutanota app was the biggest challenge. Because the Tutanota app does not have a well-defined API, I feel like my spam filter could stop working when they push a new update of their app. But until that time, I have an easy way to handle my spam problem.