<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns:og='https://ogp.me/ns#' lang='en'>
<head>
<title>ongoing by Tim Bray &#xb7; On Search: Metadata</title>
<meta name='viewport' content='width=device-width, initial-scale=1.0, shrink-to-fit=no'/>
<meta property='og:site_name' content='ongoing by Tim Bray'/>
<meta property='og:title' content='On Search: Metadata'/>
<meta property='og:image' content='/ongoing/misc/podcast-default.jpg'/>
<meta property='og:type' content='website'/>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/>
<link rel='stylesheet' type='text/css' media='screen' title='serif' href='/ongoing/serif.css' />
<script type='text/javascript' src='//use.typekit.net/ugm7uwx.js'></script>
<script type='text/javascript'>try{Typekit.load();}catch(e){}</script>
<script type='text/javascript' src='/ongoing/ongoing.js'></script>
<link rel='alternate' type='application/atom+xml' title='Atom (full content)' href='/ongoing/ongoing.atom' />
<!-- Generated from XML source code using Perl, Expat, Emacs, Mysql, Ruby, Java, and ImageMagick.  Industrial-strength technology, baby. -->
</head><body itemscope='' itemtype='http://schema.org/Blog'>
<div id='payload'>
<div id='banner'><h1 itemprop='name'>On Search: Metadata</h1><div id='search'><form action="https://www.google.com/search" target="_parent">Search <input size="20" name="as_q" /><input type="hidden" name="hl" value="en" /><input type="hidden" name="ie" value="UTF-8" /><input type="hidden" name="btnG" value="Google+Search" /><input type="hidden" name="as_qdr" value="all" /><input type="hidden" name="as_occt" value="any" /><input type="hidden" name="as_dt" value="i" /><input type="hidden" name="as_sitesearch" value="tbray.org" /></form></div></div>
<div id='center-and-right'><div id='centercontent'>
<p itemprop='description'>In the Webâ€™s early years, the overwhelming favorite among search engines
was Yahoo.  
Today itâ€™s Google.
Neither has actually had better text search technology than the
competition.
They won because they used metadata effectively to make
their services more useful.
In this ninth <cite>On Search</cite> episode, a survey of what metadata is,
where it comes from, and how to use it.</p>

<p>Metadata is technically â€œinformation about informationâ€ and you
can start a fistfight in the bar at any XML or Content-Management conference
about whatâ€™s data and whatâ€™s metadata.
In the context of search, metadata is anything that you know about the
documents youâ€™re searching beyond the words they contain.
With 
<a href='/ongoing/When/200x/2003/04/09/SemanticMarkup'>descriptive markup</a>,
itâ€™s easy enough to store a documentâ€™s metadata right 
inside it (consider HTMLâ€™s <code>&lt;META></code> tag).</p>

<p class='p1'><span class='h2'>Yahoo</span> &#xb7; 
Back when everyone searched at Yahoo, the usual result list looked quite a
bit different.  
If I typed in â€œdonkey,â€ before the pointers to Web pages there would
be a few pointers to categories in the Yahoo taxonomy that contained the word
â€œDonkey.â€</p>

<p>This worked really well, because if the Yahoo editor had classified
<cite>Diseases of the Horse Family</cite> or <cite>The Asses of the British
Isles</cite> under a donkey-related category, Iâ€™d find them even though
â€œdonkeyâ€ wasnâ€™t in the title.</p>

<p>In effect, Yahoo maintained one useful piece of metadata about each page
in the engine: <i>What is this about?</i>.
This is a real value-add for the searcher.</p>

<p class='p1'><span class='h2'>Google</span> &#xb7; 
Google, like Yahoo, maintains one key metadata field about each item it
indexes: the well-known PageRank, essentially a measure of how many other
pages point to it.
They make use of it very simply, to order the result list with high
Page-Ranks at the top.</p>

<p class='p1'><span class='h2'>Conclusions?</span> &#xb7; 
Google seized search leadership from Yahoo;
can we conclude that itâ€™s more
important to know how popular something is than to know what itâ€™s about?
If youâ€™d told me that ten years ago I would have had a hard time
believing it, but the evidence seems pretty compelling.
Note that Google actually does have some subject metadata via their
integration with the <a href='http://dmoz.org'>Open Directory Project</a>,
but they donâ€™t push it that hard, and the volunteer-staffed,
highly-political, AOL-semi-orphan ODP is fairly weak reed to lean on 
anyhow.</p>

<p>On the other hand, Google has always been way more focused on search than
Yahoo has, and isnâ€™t always trying to get in front of you with stock
prices and news and weather and so on.
More important, even if it turns out that popularity is the key thing for
Internet search, the Internet is a very special place, and itâ€™s quite
unlikely that popularity is the killer metadatum for the whole universe of
search applications.</p>

<p>I believe, though, in the other obvious conclusion: that the number-one
way to make search work better is to bring some metadata to bear on the
problem. 
This really shouldnâ€™t be surprising: 
As Iâ€™ve <a href='/ongoing/When/200x/2003/06/24/IntelligentSearch'>discussed
before</a>, itâ€™s really hard to 
make search engines act much smarter than they do today.
So instead, letâ€™s reinforce them with externally-supplied metadata.</p>

<p class='p1'><span class='h2'>Where Does Metadata Come From?</span> &#xb7; 
Those Yahoo and Google metadata offerings, while really quite
different, have one important thing in common: both are expensive.
Yahoo has for years employed a team of editors to sort websites into their
subject hierarchy by hand.
And Googleâ€™s immense rooms full of machines humming away computing
PageRanks twenty-four hours a day are a legend in our industry.</p>

<p>In my experience, this is typical.  Put another way: <em>There is no cheap
metadata</em>.
Of course, if we could use computers to compute the metadata like Google does,
that would be immensely cheaper than having employees do it.
And a lot of smart people have invested a lot of effort and money into the
problem of deriving metadata from data, but itâ€™s a hard one.
(Still, we should be on the lookout for opportunities; more later).</p>

<p>Many people in the content-management and knowledge-management trades have
noticed this, and concluded that the trick is to gather metadata upstream.
Remember how Microsoft Word, out of the box, used to pop up a dialog every
time you created a new document and encourage you to provide a little
metadata?
Most people immediately said â€œMake this go away!â€ and I don't think
Word has done this (by default) for years.</p>

<p>Historically, the difficulty of collecting metadata at source has been
generally large enough to outweigh the (potentially huge) benefits from
collecting it.
But I for one am not ready to give up on this approach.
There are, after all, domains where metadata is at the core of the business
proposition, and the process works there.
For examples, the editorial staff who produce the <cite>Wall Street
Journnal</cite> add metadata as they go along, identifying people, companies,
stock ticker symbols, and so on.</p>

<p class='p1'><span class='h2'>If You Collect Metadata By Hand</span> &#xb7; 
The most important lesson Iâ€™ve learned, is: <em>Donâ€™t try to
collect too much.</em>  You might, just might, get people, when theyâ€™re
interacting with your intranet, to label their information by project and
title; but more than a couple of fields and people will just bypass the
process.</p>

<p>This is harder than it looks. 
When you decide in principle that metadata should be collected, it will
develop that many stakeholders have short-lists of the fields they need
to make this worthwhile.
You can easily end up with a â€œshortâ€ list of a dozen
or more fields that constitute the â€œabsolute minimumâ€ that people
think you must have.
And if you adopt it, youâ€™re deadd, because except in special circumstances
(e.g. the <cite>WSJ</cite>),
people just will not take the time to do this.</p>

<p class='p1'><span class='h2'>Automatic Metadata</span> &#xb7; 
Obviously, there are some metadata items the computer will give you for
free: a filename, created/modified dates, who created it, what kind of file
(HTML, Excel, PowerPoint), how big it is.
These can be handy for search applications and since theyâ€™re free, you
should collect them and make them available.</p>

<p>The second category of machine-generated metadata is what
â€œAutocategorizationâ€ software does.  These are the companies like
(in alphabetical order) Autonomy, Gavagai, Semio, Stratify, and Vivisimo; they
all promise to take your raw data and either generate or fill-in a subject
taxonomy telling you what itâ€™s about.</p>

<p>Sometimes they work, sometimes they donâ€™t, and sometimes it can be
puzzling figuring out whether theyâ€™re going to work or not.
But they are <em>not</em> an exception to the no-cheap-metadata rule; this is
software thatâ€™s generally expensive to buy and expensive to deploy.</p>

<p class='p1'><span class='h3'>Donâ€™t Neglect Your Logfiles</span> &#xb7; 
Thereâ€™s one kind of automatic metadata that I think doesnâ€™t get the
respect it deserves: the contents of your logfiles.
Hereâ€™s the most obvious example: unless youâ€™ve been throwing away
your internal Web server log files, you already know which are the most
popular items on the Intranet.
It wouldâ€™t be that hard to boil them down (occasionally, on a batch basis,
this doesnâ€™t need to be real-time) and develop your own internal
â€œPopRankâ€ based on what gets downloaded the most.
It might not be as sexy as PageRank, but if I search the Intranet for
material on expense policies, you can bet Iâ€™m going to find a lot, and
if two or three stand out because theyâ€™re the ones everyone ends up
reading, you might save a lot of people a lot of time.</p>

<p class='p1'><span class='h2'>Care, Feeding, and Using</span> &#xb7; 
Once youâ€™ve got some metadata, since itâ€™s expensive, you should
take good care of it.
This almost always means putting it in a relational database. 
As I mentioned above, debates over the meta-ness of data can get religious,
but in practice, Iâ€™ve observed that while data itself (for example XML
or video) often resists being forced into rows and columns, metadata usually
lines up happily.
Even <span class="o">ongoing</span> has a little MySQL database sitting off to the side of all the
XML-encoded entries, tracking a bunch of useful facts about them, including
some (e.g. the title) that are replicated inside the data.</p>

<p>And of course youâ€™ll want to put this goodness to work.
One obvious way is to have a query screen, so that people can search for
resources by author, date, title, and so on, not just brute-force full-text.
But what youâ€™d really like is to learn from Yahoo and Google, and have
the metadata just there, silently helping.
For example, to use in ranking your results.</p>

<p>Another thing you could do <code>&lt;commercial-plug></code>is call up 
<a href='http://www.antarctica.net'>Antarctica</a>, our Visual Net product
takes metadata and gives search a Graphical User Interface just like your
personal computer has.<code>&lt;/commercial-plug></code></p>

<p class='p1'><span class='h2'>In the API</span> &#xb7; 
This means that if youâ€™re going to design an API for a search engine
(something I plan to do eventually in this series) youâ€™re going to need
to include entry-points not just for searching and adding words to the
full-text index, but also for adding, maintaining, and using the metadata
that drives the search.</p>

<p class='p1'><span class='h2'>The Web and the Semantic Web</span> &#xb7; 
One of the Webâ€™s distinguishing features is that thereâ€™s a big
gaping hole where the metadata ought to be.
The Web has resources, identified by URI, and you can ask for
â€œrepresentations,â€ which come with some metadata, but the metadata is
about the representation, not the resource.
This is probably a bit abstract for those who donâ€™t 
<a href='/ongoing/What/Technology/Web/TAG/'>wrestle professionally</a> with
Web Architecture, so an exampleâ€™s in order:
Suppose you read an online news story from your desktop computer at 9AM.
You get a Web page with some metadata telling you that itâ€™s in HTML and
is in English and ISO-Latin-8859-encoded and canâ€™t be cached and so on.
Suppose, at noon, on the road, you hit the same story from the
minibrowser in your cellphone.
The server cleverly notices this is a small-screen device and sends the same
information in WAP or simplified HTML or some such thing, with metadata
saying what it is (which is completely different from the metadata you got
with the PC Browser version).</p>

<p>So, given a URI, the Web has no built-in way to ask questions about it,
for example â€œWhat is this about?â€ or â€œWhen does it expire?â€
or â€œIs this suitable for children?â€ or â€œIs this good?â€</p>

<p>The Semantic Web project is trying to make the whole Web smarter and more
machine-readable, and obviously this is never going to happen without
metadata.
So a lot of really smart people are working hard to develop good ways to
encode, organize, and interchange metadata keyed by URIs.
Of course, these peopleâ€™s dreams arenâ€™t about mere search, theyâ€™re
about managing your schedule and your medical treatments and your shopping
and your supply chain.
All of which is fine; but if the Semantic Web ever takes off, there is going
to be a whole lot more metadata available about a whole lot of stuff.</p>

<p>As a side-effect, I expect that all the search services of the world will
become a lot richer, a lot smarter, and a lot more fun to use.
But weâ€™re not there yet.</p>

<p class='p1'><span class='h3'>A Word On Our Sponsor</span> &#xb7; 
This is a sponsored essay.  It is brought to you by 
<a href='http://www.bchydro.com/'>the local power
company</a>, who arranged a complete power failure in Antarcticaâ€™s offices
this afternoon, so I took advantage of battery power to type this in.
Powerâ€™s back, itâ€™s back to work we go.</p>

<hr />
<div id='commentHere'></div>
<div id='footer'><p class='footer'><b>Updated: 2003/07/29</b></p>
</div>
</div>

<div id='rightcontent'><div class='oo'><a id='to-home' href='https://www.tbray.org/ongoing/'><span id='home'>ongoing</span></a></div>
<div>
<div class='principles'>
<a href='/ongoing/WhatItIs'>What this is</a> &#xb7;
<a href='/ongoing/ongoing.atom'><img title="Subscribe to ongoing" alt="Subscribe to ongoing" src="/ongoing/Feed.png"/></a><br/>
<a href='/ongoing/Truth'>Truth</a> &#xb7;
<a href='/ongoing/Biz'>Biz</a> &#xb7;
<a href='/ongoing/Tech'>Tech</a></div>
<a href='/ongoing/misc/Tim'>author</a> &#xb7;
<a href='http://www.textuality.com/BillBray/'>Dad</a><br/>
<a href='/ongoing/misc/Colophon'>colophon</a> &#xb7;
<a href='/ongoing/misc/Copyright'>rights</a>
</div>
<div id='potd'><a id='tnA' href='/ongoing/goto-potd/'><img id='tnI' src='/ongoing/potd.png' alt='picture of the day' /></a></div>
<div id='cats'>
<a href='/ongoing/When/200x/2003/07/'>July</a> <a href='/ongoing/When/200x/2003/07/29/'>29</a>, <a href='/ongoing/When/200x/2003/'>2003</a><br/> &#xb7; <a href='/ongoing/What/Technology'>Technology</a><span class='more'> (90 fragments)</span>
<br/> &#xb7; &#xb7; <a href='/ongoing/What/Technology/Search'>Search</a><span class='more'> (66 more)</span>
</div>

<div class="employ">
<p>By <a rel="author" href="/ongoing/misc/Tim">Tim Bray</a>.</p>
<p>The opinions expressed here <br/>
are my own, and no other party<br/>
necessarily agrees with them.</p>
<p>A full disclosure of my<br/>
professional interests is<br/> 
on the <a href='/ongoing/misc/Tim'>author</a> page.</p>
<p>Iâ€™m on <a rel="me" href="https://cosocial.ca/@timbray">Mastodon</a>!</p>
</div>



</div>
</div>
</div>

</body>
</html>