On Search: Stopwords

Here’s a Google search for a famous phrase, to be or not to be; give it a try and see what happens. When you look at word frequencies, it appears that there are a few words that appear unreasonably often and carry unreasonably little information. They are called “stopwords,” and this (brief) eighth bead in the On Search necklace considers them.

What many search engines have done with stopwords is, well, nothing. That is to say, they don’t get indexed, and you can’t search for them. The theory is that they’re expensive to index because there are so many of them, and they carry little useful information.

The Numbers · Let’s start with some numbers. Just before I started writing this, a collection of ad-hoc Perl and sed and sort alleged that ongoing as of then contained 169,908 words of text, comprising 15,370 unique words. Here are the most common twenty-six in decreasing order of frequency. The column labels are a bit short to keep the table from spreading, so to amplify: “Count” means the number of times this word appears, “Running” means the total occurrence of this word and all those above it, and “%” is “Running” as a percentage of all 169,908 words in ongoing.

Word	Count	Running	%
the	8886	8886	5.2
and	5499	14385	8.5
a	4576	18961	11.2
to	4466	23427	13.8
of	4406	27833	16.4
in	2821	30654	18.0
i	2500	33154	19.5
is	2423	35577	20.9
that	2354	37931	22.3
it	1943	39874	23.5
on	1577	41451	24.4
you	1505	42956	25.3
this	1499	44455	26.2
for	1469	45924	27.0
but	1126	47050	27.7
with	1111	48161	28.3
are	1077	49238	29.0
have	921	50159	29.5
be	909	51068	30.1
at	836	51904	30.5
or	833	52737	31.0
as	793	53530	31.5
was	789	54319	32.0
so	763	55082	32.4
if	699	55781	32.8
out	686	56467	33.2
not	679	57146	33.6

Why Stop? · The numbers tell the story. By leaving out the 26 most common words, we account for a third of all word occurrences. (If you haven’t read the write-up on how search indices work, you might want to take a side-trip there now.) Each word occurrence requires that you create, store, and search one posting. Most of the space-cost of search is in postings, and most of the compute time is reading and merging postings lists. Consider that occurrences of “the” comprise almost 5% of the total. If you’re running something like Google, with billions of documents and hundreds of billions of words, you’re looking at many billions of postings you can get rid of by discarding stopwords. Consider the task of doing a set intersection on the billions of matches to “to,” “be,” “or,” and so on. It’s no surprise, really, that you get that polite little note from Google about all of Hamlet’s words except “not” being too common to be useful.

Why Not Stop? · Of course, skipping the stopwords comes at a cost; for example, “to be or not to be.” Another amusing example is the well-known retail chain “The Limited,” which is going to be pretty hard to find in a database that doesn't index “the.”

And as we’ve come to expect from Google, they’re not stupid. They do in fact index the stopwords, and you can search for "to be or not to be" just fine. See the quotes around the string? This is a phrase search. I won’t go into the details, but the cost of combining huge lists of common-word postings is immensely cheaper for a phrase search than doing a simple AND or OR. You can find The Limited just fine, too.

In fact, by putting + characters in front of each of the words, you can make Google claim to do a search for each of the words separately, although when I do the arithmetic in my head, I find it hard to believe they’re actually processing that many billions of postings.

Interestingly, I note that in the first Google search in the article, to be or not to be with no quotes or pluses, the one word it’s willing to search on is “not,” the least common of the most common (in ongoing anyhow).

The bottom line: refusing, by default, to search for common words is good usability practice; when I search for “Lord of the Rings,” nobody misses the two words in the middle. But simply leaving words out of your index because they’re common is a bug, not a feature.

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

July 11, 2003
· Technology (90 fragments)
· · Search (67 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!