What
 · Technology
 · · Search

Power Web Site · I propose a new definition. A site which is designed as the primary Web property for a person, place, or thing is a power site if the person, place, or thing has a Wikipedia entry but, in popular search engines, the site ranks above that Wikipedia entry. There aren’t very many. But they follow simple patterns ...
[14 comments]  
The End of the Golden Age? · For a few years now, the Internet has been just insanely useful. Everything is there and you can find it when you need it. But Google is working less and less well, and I’ve spotted another potential crack in its foundations. Will we look back on this as the time when it all started to fall apart? ...
[15 comments]  
Mahalo Funnies · Someone named “C.K. Sample III” emailed me with an invite to try something called Mahalo, which I gather is a hot subject among the A-listers. I was on a boring telecon, so I fired up Firefox and gave it a try. Hey, it said “Mahalo's search results only include great links.” The results are hilarious ...
[7 comments]  
NASDAQ: JAVA · Wow, they switched the ticker. It will be little surprise to hear that the internal conversation has been sustained and loud. While there have been negatives along the lines of “OMG WTF PHB!?!?”, most of the internal talk has echoed what they’re saying out in the blogosphere. I’d like to add a couple of points I haven’t seen elsewhere, one each on the pro and con side ...
[7 comments]  
Finding Things · That’s the title of my chapter in Beautiful Code, which seems now to be out, not that I’ve actually seen a copy. What’s amusing me today is that Finding Things is the chapter they’ve picked to post as a free PDF download. So, in the event that you’re interested in the subject but don’t care about what Kernihan and Bentley and Petzold and Stein and Dongarra and Cantrill and Matsumoto and all the others have to say, you can avoiding buying the book and doing Amnesty International a favor. I have to say that the Table of Contents looks pretty impressive.
[2 comments]  
search.technorati.com · Now, this is what I’ve always wanted. I’m feeling kind of unhappy with myself; time after time, Dave Sifry has showed me some new frippery they’re rolling out at Technorati and I’ve said “Yeah, that’s kind of cool, maybe you could twiddle X” even though it didn’t turn my personal crank that much. My problem has been that I was assuming that the way I want to use Technorati is unusual. I use it for vanity feeds of course, but the when I go to the site, I only ever want to ask two questions: “What are they saying about ?” and “Where was that article I saw recently about ?” The new, very Google-flavored search.technorati.com does those things, and that’s all it does. Plus, it seems a whole lot faster ...
[4 comments]  
Safe Tea · Now, that’s weird. As I’ve reported before, one of the top sources of ongoing traffic is Google image searches for “tea”; the pretty picture from Tea is the #1 match. Only, sometimes not. It turns out that if you turn SafeSearch off, as in “Yeah Google, it’s OK to show me the most appalling filth if you find it”, well, my little red teapot vanishes from the results page. Turn SafeSearch back to “Moderate” and there it is. Is my composition regarded as so chaste that it must be kept away from pages where Sex Might Happen? Granted that there is no naked flesh, but the pot and the cup are both kind of curvy. Life is full of mysteries.
[6 comments]  
Tag Scheme? · In Atom, categories have schemes. What scheme should we use for tags? ...
[26 comments]  
What I Search · Most people use computers mostly for information storage. Which means that most people do a lot of searching. My most common search is the Web as a whole, via either Google or Yahoo!, I try to switch back & forth from time to time. Next most common would be email via GMail, my own slow mailgrep, and maybe some year Spotlight. After that would be the Web Event Stream via Technorati (for me, text all the time, tags almost never). Bringing up the rear would be a certain amount of filesystem searching (Spotlight sort of works, except for my email) and grepping source code. I wonder if I’m typical?
[9 comments]  
Stimulating Pictures · I don’t know why this tickles me so much, but it does: ongoing is getting a couple of thousand visits a week from people searching for “tea” or “cup of coffee” on Google Images. The piece entitled, well, Tea, is number one! And for “cup of coffee”, A Damn Fine Cup of Coffee is #5. Interestingly, neither of them show up anywhere near the top in the image-search functions of either Yahoo! or Microsoft’s Live Search. In fact, typing those words in and seeing what the three engines produce is kind of interesting (Live Search notably includes actress Téa Leone).
[2 comments]  
Intelligent Search (Again) · I see where Powerset is going to provide the next generation of Internet Search, and has raised some money from some apparently-smart people. Me, I’ve seen this movie before. Good luck to ’em; they’ll need it.
[2 comments]  
Grass, Tea, Church, Search · I gather there are people out there—lots of people—whose livelihood more or less depends on their Google search rank. Herewith some thoughts on why this is scary ...
 
Microformats Search · The Microformats kids all have really great hair, and the coolest acronyms; still, up till now, it’s all only occasionally seemed plausible to me. But this new Technorati microformat search thing, I look at it and for the first time really think “This could be big”. For example, look at kitchen.technorati.com/event/search/vancouver (even the URI is interesting). It looks like some “sort-by” buttons and authority and keyword filters would improve things; and if this catches on some Monday the spammers will be there by Wednesday. But still, we could be seeing it happen: small pieces combining to produce something really, really big. [Disclosure: I have a conflict of interest with respect to Technorati.]
 
Bloglines · So, Bloglines has launched their blog search thing and, of all the blog search engines I have tried, this is one of them. Paul Querna says that it’s better because it doesn’t do tags. Uh, OK. Anyhow, congrats; now that this triumph has been recorded, maybe that will free up some Bloglines cycles for fixing the actual core offering that makes them interesting, that millions of people use to cruise the blogosphere, that I used to recommend to everyone, and that Sam Ruby just broke again? I would really like to be a friend of Bloglines.
 
The Future Search Market · Recently, I learned that search providers pay for traffic, which makes all sorts of sense in a world where they’re offering approximately equal levels of service. So, where to from here? I can see the opportunity to build a near-perfect market. (Please note for the record that in this piece, I agree with Nicholas Carr) ...
 
Search For Sale · In response to yesterday’s Buying Search Traffic, Russell Beattie (who works for Yahoo) writes: Search is already determined by who pays the most for it! Everywhere you see a search box with a Google logo, be sure that there's a competitor out there that will pay for the same spot—because search advertising is so monetizable. Google is everywhere because they're paying for it. Wow, I had no idea. Now, this is just one person’s voice, but I’m running it because I think Russell is probably in a position to know, and is honest in my experience. Anyone else want to confirm or refute? [Ah, Om Malik was on the story back in September.]
 
Buying Search Traffic · On impulse, I just twiddled the ongoing software on my staging server so that when you do a search in the little box up at the top, it goes to Yahoo not Google. I ran a bunch of searches, and in terms of result quality, there was nothing to choose from between them. Yahoo seemed a little fresher; on this Sunday it had Friday’s entries pretty well indexed, while Google was only half there; they’re both OK for Thursday. So, at this moment in time, my search box, and a zillion others like it, are pointing to Google just because that’s the way we set it up, and it’s actual real work to go changing production systems, and the competition so far isn’t significantly better. I have no idea what the proportion of search coming through this kind of thing is, as compared to the volume going through the search-engine home pages. I bet that if you count the toolbars on the browsers, it’s getting up there. Via Google’s AdSense For Search, you can already get paid for sending searches to Google. I won’t use it, though, because if I read the terms and conditions correctly, you have to include a Google logo. Screw that; I like my minimalist little search box, and nobody but me and my employer get any branding here. I’m sure Yahoo has a competitive offering, but I haven’t tracked it down. I’ll tell you one thing for sure though; if the search engines retain their quality-of-service parity, pretty soon the traffic will be dealt out totally based on who’s willing to pay the most for it. Where can I buy shares in Firefox?
 
Web Tracking Snapshot · There are many services that claim to be “blog search”, but that’s the wrong way to think about it. There are a (very) few occasions when I want to go and search for “what’s new on X”, and there are lots of ways to do that (the new Sphere is looking good in that space). But what I want to do 24/7, as long as the computer is turned on, is what I call Web Tracking: being told right away when there’s something new on the Web that I care about. I subscribe to a lot of Web Tracking services; herewith a snapshot of my impressions ...
 
Scoble ♥ RDF · Check out Scoble’s speculation on The Perfect Search: he’d like to find a hotel in New York with free WiFi, a good view, and good food, in a particular price-range. Rob, meet Tim Berners-Lee; Tim, meet Rob. Rob wants the Semantic Web. In particular, today’s freshest SemWeb flavor is something called SPARQL; see Kendall Clark’s human-readable intro. SPARQL is an answer to the question “What if I want to do SQL-like querying when I know perfectly well that everybody will be using their own incompatible database schema?” I’ve been a SemWeb skeptic, but I look at SPARQL and I think: Suppose you could assemble a ton of property-value pairs about web sites, and suppose on the front end you could build a nice responsive query page that allowed you to compose queries like Scoble’s hotel search; well then, SPARQL would be more or less exactly what you need to bridge the gap. Hey, isn’t Guha’s Alpiri project more or less that back-end? And isn’t Guha working at Google now? Hmmmmmm...
 
Buggy Google Blog Feeds · So Google has blog search. Summary: It’s fast, it’s reasonably complete, it’s stripped-down in the typical Google style, the result ranking needs work, the time window is way too deep. They’re also providing feeds, which is good, but the feeds are horribly buggy [Quick response; One big bug’s already fixed!] ...
 
How Big? Who Cares? · I’m really happy that Danny Sullivan comprehensively blew off this latest round of competitive chest-beating about search engine index size. As Danny says, “It’s absurd. It’s annoying. It’s a friggin’ waste of time.” And it’s pitiful because right now, Yahoo and MSN are showing signs of giving Google the first serious run for its money in recent memory. So those guys should stay away from these juvenile distractions.
 
Puzzling Search Study · I glanced at Tristan Louis’ Search Engine Comparison, thinking it was interesting but not very useful. I was surprised to see a few other bloggers discussing it as though it meant something. The number of pages that the various engines claim to have indexed, and the number they claim to return for any search, really don’t mean much. First of all, nobody’s got the time to look at more than a few dozen results—studies show that most people will never look past the first page. Secondly, even if you wanted to look at all the results, the engines probably couldn’t show them to you anyhow. Third, what matters is whether you get what you’re looking for. Almost all the modern engines do a pretty damn good job of getting you something appropriate and useful in the first handful of results. Who cares about the next million?
 
Technorati, Tags, Semantics · Hey, the Technorati beta is up. Looks much nicer, though I wish they’d lose the dude with the megaphone; goatees are so 1993. (Hey look, Technorati and Newsweek, sitting in a tree.) Among other things, the technorationals are making a concerted effort to prove that my doubts about tagging are misplaced—so are Shirky et al at You’re It!. It’s become obvious that tags are useful enough as a place to park search words for pictures & music & other stuff that doesn’t have words to search. Furthermore, I’ve heard a dozen compelling stories from people who are using tags to organize their own information and track trends; so it’s looking like the answers are: Yes, tagging is useful; No, it’s not a replacement for full-text search, even partially. On the subject of search, Sun’s Search Guy Steve Green is trying to push over the boundary between search and semantics.
 
Search Engine Rankings · Recently, someone from a Google competitor told me that they were catching up, within a few percentage points. I didn’t believe that at all, but I decided that intuition is boring and hard data is interesting. So I went and ran search engine rankings for ongoing weekly through 2005. The numbers are surprising, to say the least. [Update: Thought-provoking feedback, and some conclusions] [And more feedback from Search Engine Watch.]. ...
 
Yahoo! Search FUSE · John Battelle reports on a conversation with Y!’s Jeff Weiner; it sounds like John heard more or less the same things I did. At the time I didn’t say too much about Jeff’s remarks, but I think that John’s piece, while good, bypassed a real interesting part. Y!’s rallying cry is FUSE: Find, Use, Share, and Expand. So do you think they can beat Google at finding or using? Well maybe, but I wouldn’t want to bet a business on it. But how about Share and Expand? Y! has relationships with a lot of people out there: email relationships, finance relationships, chatter relationships, you name it. Suppose they can make it real easy and attractive for the people in all those relationships to put some back; to Share stuff and Expand the Web. Being first in line to help everyone Find and Use that good new stuff? Sounds like a plausible line of attack to me.
 
Talking to Yahoo · I had a good talk yesterday with Jeff Weiner, Senior VP of Search and Marketplace over at Yahoo! I shouldn’t pass on what Jeff said; anyhow if he wants to talk to the world, he has a blog. But I can talk about what I said: first, Y! should be watching the Atom protocol work like a hawk, because they have two choices: either they try to beat everyone else out there and build the world’s greatest authoring tool, or they get behind a standardized protocol and let the cellphone guys and PDA people and let everyone compete to do it. Second, we were talking about improving search in general; near as I can tell, there isn’t a huge quality gap between Y!, Google, and MSN, and it’s hard to believe that any of them can sustainably get much ahead of the rest. On the other hand, I think Y! has a good chance to take on Google in the advertising space, both AdSense and AdWords, and maybe win. They know a whole lot of stuff about a whole lot of people; for example, they know my stock-market portfolio and what weather forecasts and maps I look up; they probably have more information about more individuals than anyone else in the business. On behalf of all those advertising sellers and buyers: it would sure be nice to see some competition. Maybe even some transparency.
 
Still Wondering About Tags · This whole related-tags thing has been around for a month, but Dave Sifry says it’s official. I went and tried a half-dozen and the results were all over the map. I think I spot a pattern where things that are more or less steady-state are lame (Vancouver, prostitution), while it works well on current events: (Firefox, DeLay, Gomery). Which is intuitively plausible. But my question from last month still stands: Are tags useful? Are there any questions you want to ask, or jobs you want to do, where tags are part of the solution, and clearly work better than old-fashioned search? I really want to believe that tagging is big, a game-changer, but the longer I go on asking this question and not getting an answer, the more nervous I get.
 
A Cherry-Tomato Winner · The challenge has been met, and the crimson-vegetable award goes to... Google! It can now find images correctly based on metadata; for example Saskatchewan snow with plants, Tanya King, sweet pea shadow, Hogoromo dinner, and so on. Neither MSN nor Yahoo search can do this.
 
Crosstalk · Dear readers, honesty (and the story I’m about to tell) require that I spill the beans: there is lots of stuff here on ongoing that you can’t see. Since I have a pretty good writing environment, I compose lots of little pieces of one kind or another and then “semi-publish” them; they’re out there on the Web at an address that looks a lot like that of the fragment you’re now reading, but there are no pointers to them; security by obscurity, but good enough for my purposes. Last night I semi-published something and emailed a few people asking them to look at it. A couple hours later, I wondered if they had, and checked the server logs. I saw myself (twice, I’d corrected it), and... the Googlebot. What the hell? Did I accidentally press “publish”? No. Bafflement. Is my browser telling Google where I’m going?!? Unlikely... ah! Even semi-published pieces have the ads, which are a Javascript callback to somewhere in the Googleplex. So, that’s what’s happening... when AdSense displays on a page, at least sometimes, it tips off the robot army. So anyone who’s running AdSense gets indexed first & fastest. I can’t prove it, but it’s the simplest explanation, and it makes all sorts of sense. [Update: If you look real closely at that robot (for geeks, at the “User-Agent” field), it’s not quite the same as the normal googlebot; apparently this beast is just reading the text of the page to figure out what contextually appropriate ads to display. Thanks to all the people who wrote to point this out.]
 
Do Tags Work? · I was sitting up and got pinged by Dave Sifry about Technorati’s new related-tags feature; Technorati thinks that Baseball is related to Sports, MLB, Football, Basketball, Natural Philosophy (gotta love that), and tickets. Some don’t work that well, but the idea is compelling. I’ve been thinking about this stuff a lot, and I have a question: Do tags work? It shouldn’t be too hard to find out ...
 
Picture Search and Gravel Hauling · If you’re in Florida near Inglis-Yankeetown and you want to haul dirt, rent a truck from Tim Bray. No relation, but I gotta say that’s a nice-looking truck. I found it using the new Ask.com picture search, which has a much nicer presentation than Google. However, it (like all other search engines) sadly fails the Cherry-Tomato Challenge.
 
Real Information Retrieval · Summary: find a Real Librarian. The narrative includes demographic trends and Bo Diddley ...
 
Who’s Searching · I see that Forrester’s excellent Charlene Li is expecting MSN search to gain on Google. Her argument sounds plausible, so I went and checked my logfiles. Since Sunday, I’ve had 1,222 people arrive at ongoing via Google, 166 via search.yahoo.com, and 49 via MSN. If it gets a little closer, I’ll start having to run a regular Search Market Share graph along with my Browser Market Share offering.
 
Green On Search · Care about search in general? Then you probably should start reading Steve Green; he’s in Sun Labs and knows more about search technology than just about anybody, way more than me. Plus, he’s amusing.
 
The Cherry-Tomato Challenge · I have recently adjusted the ongoing software so that each and every image has descriptive text in both its alt and title attributes. This is good accessibility practice and should also make it possible for search engines to find my pictures. But they can’t. The image here, which is in a file whose name, unhelpfully, is IMGP0990.png, is correctly labeled in both title and alt attributes as “Sunlit cherry tomatoes on white-painted wood.” I just now visited John Battelle’s helpful list of search engines and lots of them offered “image search” capabilities, but not one turned that picture up when I searched for “sunlit cherry tomatoes”. (Lots of them turn the page up when you do an ordinary text search.) How hard can it be? I hereby promise that when I find a credible general-purpose Web Image Search tool that leads me to that picture via “sunlit cherry tomatoes”, I will publish a rave review here and do my best to spread the word.
 
Bot Droppings · I was idly watching my server logfiles today, pretty quiet on Sunday afternoon so it was mostly just the crawlers, and observed some puzzling behavior from the Googlebot. So I ran a few reports ...
 
On Search: Sorting Result Lists · I was talking to someone building a search engine and he was moaning about sorting result lists in real time, only you don’t have to. Anyone who’s built a big search engine eventually works this out, but posting it here might save a few minutes for some future developer. The idea is, you should never have to do an O(N·log(N)) sort on a result list. [Update: Experimental verification.] ...
 
Pix MisGoogled · Google indexes pictures all wrong. Here at ongoing, I used to store all my own pictures with nice names I invented on the spur of the moment. Sometime last year, I realized my cameras were thoughtfully giving each shot a nice guaranteed-unique name, so I just started using that; for example, that slug is in a file named IMG_2663.jpg. But, I’m careful to always supply an appropriate alt text, like so: Vancouver slug. It turns out that Google pretty much ignores the alt text, which is irritating, so you’ll find my roses and prairiescapes and Foo Campers in Google only with a lot of effort. What’s really weird is that Google does put a lot of weight on the actual file-name. The reason I noticed this is that in any given week, the most popular image on ongoing is the picture of Diablo found here, apparently because it’s in a file called diablo.jpg. But you know, there are lots of pictures of Diablo out there, and not that many of the chapels at Brussels Airport. So, Google could do a lot better here. [Note: I’m talking about Google image search here, not regular search. It’s still broken.]
 
Report From the Intel Community · This has nothing to do with a California chip maker. Rather, it’s about a trip I recently took to a conference called Intelink, where the people gather who run one of the world’s biggest and most interesting intranets; the one that serves the community of U.S. Intelligence professionals ...
 
Another “Intelligent Search” Skyrocket · In the On Search series, I wrote a piece called Intelligence that explained why intelligent search is hard, but that it is so eagerly desired that there are predictable flurries of excitement every so often over the next, uh, pretender. This time, Cringely has been sucked in. Well, not entirely, he loads up with caveats too, but it’s a little sad to see one of the really big-name writers point to such tattered hype. Earth to Bob: the problem with AI isn’t that the “A” part isn’t fast enough, it’s that we don’t understand the “I” part. I wonder what it takes for some obscure little company peddling a dream that has been around the track so many times to get airtime with this guy? Cringely needs to pull up his game a bit: in the last couple of weeks, he was the only person on the planet to conclude that the Sun-Microsoft deal was somehow bad for the Java Desktop System; not that he actually advanced any arguments on the subject, just proclaimed it. The people in Redmond are smarter than Bob and I’m pretty sure that the deal isn’t making them worry less about this.
 
What People Care About · Herewith the top couple of dozen search strings that brought people to ongoing, sampled over the last few days. Let this be a lesson to you on what you can write about without developing a “certain reputation” ...
 
Googlestorm · Sometimes you glance at your server logfile and say “Huh?” Click for the picture; impressive on one hand, irritating on the other. [Update: Eureka! Figured it out; fixed the problem.] ...
 
Cleanup Plus Search · Another batch of ongoing housekeeping. I added a search field up and to your right, which just outsources the problem to Google. Eventually there’ll be something with an ongoing look coming out of this. Also I fixed a long-standing bug in the date display, which convinced me the whole date-hierarchy subsystem was basically broken so I re-did it, check it out. Quite likely I broke something, if so let me know, my email address is on the front page of a Google search for my name. Also, IE6 was refusing to render ' properly for reasons I couldn’t figure out, so I skated around that.
 
Self-Limiting · I was talking today to this really smart guy named Jonathan Leblang who works for A9, and he said “You know, Google’s success may conceal a death warrant.” I said “Huh?” He said “Well, the most useful Web pages used to be the ones that aggregated a bunch of useful links, and so people would point to those and Google would find them. Nowadays, why would anyone go to the work to put a page like that together if you can just rely on Google to find stuff?” Hadn’t thought about it that way.
 
Search Variables · Scoble thinks he’d like to have access to all the variables behind search engine result rankings, and Battelle agrees. Hmm... these are smart guys, but I think they’re both wrong on this one. Experience shows that most users won’t even open up an “advanced search” facility, they just want to type their 1.3 words into the search window and let the search tech do its stuff. And I’m one of them. I bet that most times, I’m going to get good results with less fuss & bother by carefully selecting the search terms I type in than by fiddling with knobs on the side of the engine. Because human language offers a much more subtler and more sophisticated set of controls and variables than any software I’ve seen.
 
On Search, the Series · This series of essays on the construction, deployment and use of search technology (by which I mean primarily “full-text” search) was written between June and December of 2003. It has fifteen instalments not including this table of contents ...
 
Turn On Search · This is the last in my series of On Search essays. I’ve written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I’d like to change this part of the world. In short, I’d like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works. This essay is about what that software should look like. Early next year I’ll write something on how it might get built ...
 
John Battelle on Search · It turns out that John Battelle, who’s made his mark at a bunch of different pretty-good industry publications, whom I met at the Foo Camp, and who interviewed me about it for Business 2.0, is one of the few people in the world who obsess about search technology as much as I do. I just stumbled across his excellent Searchblog; anyone who’s taken the trouble to plow through my essays on the subject will probably find a subscription worthwhile.
 
On Search: Robots · Robots—also known as spiders—are programs that retrieve Web resources, using the embedded hyperlinks to automate the process. Robots retrieve data for all sorts of purposes, but they were invented mostly to drive search engines. Herewith a tour through Robot Village ...
 
On Search: XML · Searching is all about text, and the proportion of all the world’s text that is XML keeps getting higher and higher. So if you’re going to do search, at some point you’re going to have to think about searching XML. Herewith a survey of some of the issues and problems (which, like other essays as we approach the end of On Search, contains opinions among the reportage) ...
 
On Search: Interfaces · Herewith an investigation of how search software ought to interact with the outside world. I’ll start with a look at the current state of the art, and propose another (I think better) approach. This is, I think, the third-last On Search piece, so a few words at the meta level about that ...
 
On Search: Result Ranking · If you’re searching a big database, unless you’re lucky you’re usually going to get a lot more matches to any given query than you want to look through. So it really matters what order that result list is in. Google got to be famous in large part because they do a good job on this; the stuff near the top of their list is usually about what you want, and if you don’t see what you need near the top of the list, it may not be out there. Herewith some remarks on how to go about sorting result lists. In general, the news is not very good; however, there are some promising techniques that are under-explored ...
 
I Ferment Not · The other day I looked at ongoing’s top referers and not for the first time, saw that a Google search for “fermentation” was right up there. What happened was, I wrote a piece called Language Fermentation back in May that was severe programming-language-theory geekery, and, well, a lot of programmming-language geeks linked to it, and now any luckless student of zymurgy or budding opimian is going to find ongoing at #2 in their result-list. While this is evidence that Google’s PageRank sometimes goes off the rails, the first entry in that result list is (quite properly) the Journal of Fermentation and Bioengineering.
 
On Search: I18n · The “On Search” series resumes with this look at the issues that arise in search when (as you must) you deal with words from around the worlds written in the characters that the people around the world use. I18n stands for “internationalization.” ...
 
What is Nasty? · Questions Google answered by sending people to ongoing: What is a colophon? What is a namespace? What is a web tag? What is a wife-beater? What is binary search? What is character string? What is "DC" day? What is fermentation? What is java languages essay? What is measured numbers? What is nasty? What is natural language query? What is ongoing? What is peaw? What is precision and recall? What is Price Server? What is product management? What is refactoring software? What is rocket science? What is semantics? What is sharecropping? What is stax-volt? What is text, really? What is the antonym of kiosk? What is the best way for writing binary tree search? What is the dewey decimal call number for Islam? What is the extension of a rddl document? What is the fastest string manipulation language? What is the use of a book thought alice without pictures or conversation? What is throughput? What is tribalism? What is unicode? What is wi-fi? What is XML API technology? What is xml rpc REST?
 
On Search: Metadata · In the Web’s early years, the overwhelming favorite among search engines was Yahoo. Today it’s Google. Neither has actually had better text search technology than the competition. They won because they used metadata effectively to make their services more useful. In this ninth On Search episode, a survey of what metadata is, where it comes from, and how to use it ...
 
On Search: Stopwords · Here’s a Google search for a famous phrase, to be or not to be; give it a try and see what happens. When you look at word frequencies, it appears that there are a few words that appear unreasonably often and carry unreasonably little information. They are called “stopwords,” and this (brief) eighth bead in the On Search necklace considers them ...
 
On Search: UI Archeology · This chapter of the On Search saga is a side-trip; a look at an unusual search user interface I built a dozen years ago. One of the reasons it didn’t catch on back then was that there wasn’t enough XML in the world. Now that there is, maybe this bit of legacy code will provoke an idea or two. Just maybe, it contains some ideas that will be useful to the folk who are wondering how to make the power of XPath and XQuery useful to ordinary people ...
 
On Search: Squirmy Words · In this, the sixth instalment of my search saga, a survey of the fuzzy edges of words and their meanings and the (surprisingly moderate) consequences for search systems ...
 
On Search: Intelligence · Here’s the problem: searching for words isn’t really what you want to do. You’d like to search for ideas, for concepts, for solutions, for answers. Instead, your typical search engine moronically sorts through its postings, and tries to solve your problems by looking at which words appear where, and how often, and so on. What we’d really like is an intelligent search engine. This essay is mostly about why we’re not likely to get one any time soon ...
 
On Search: Precision and Recall · Searching is a branch of computer programming, which is supposed to be a quantitative discipline and a member of the engineering family. That means we should have metrics: measures of how good our search techniques are. Otherwise, how can we ever measure improvements in one system or the differences between two systems? “Precision” and “recall” are the most common measures of search performance. But they’re not as helpful as we’d like ...
 
On Search: Basic Basics · In this ongoing safari through the Search hinterland, I had thought next to talk about popular features of search engines and their costs and benefits and so on. But I think that everything else I want to cover will be easier if there’s a shared view of the machinery making it all go. So here’s a tour through the basics of search-engine engineering ...
 
On Search: The Users · Herewith Chapter Two of the search travelogue. Between late 1994 and early 1996 I was occupied full-time and then some building and running one of the first Web search engines, the long-departed Open Text Index. There weren’t many million-hits-a-day sites back then. When you’re running that kind of thing, you spend a lot of time watching your logs to figure out what your users are doing and what makes them happy. There are two lessons that loom larger than all the others put together ...
 
On Search: Backgrounder · This is the first of a series on search, by which I mean full-text search. Anyone who uses computers now uses search pretty well every day, so this is an important chunk of our technology spectrum. This piece covers the business and history angles; future instalments will explain how search engines work and the interfaces to them. I plan to conclude with a description of the next search engine, which doesn’t exist yet but someone ought to start building.
(Updated: Microsoft Indexing found.
Slashdot search explained
)
 ...
 
The Death of Scholarship? · Some maze of twisty little blogpassages led me to this study of Student Searching Behavior. It's really long and wordy, but the soundbite is that when students are asked to look up something relevant to their academic work, 45% of them go to Google, 10% of them go to the local library catalog, and the rest scatter among other search engines. I like Google as much as the next person, but I still find this really disturbing, especially that 10% figure ...
 
The Natural Language Query Fallacy · This is provoked by an article at Mary Jo Foley's Microsoft Watch, often quite a useful place. She suggests that one of the wonders of Longhorn, the mighty Windows-to-be, is that users “will be able to type in commands (such as, ‘Find all the spreadsheets I generated last year that included sales data from Bob Jones’), and Longhorn will auto-magically return the results.” I'm wondering why this is supposed to be a good idea ...
 
The Google vs. Blogs Controversy · I see that Wired has picked up on Andrew Orlowski's overwrought attempt to create a news story about the effect of the admittedly-incestuous blog network on Google results. I'm not sure how representative ongoing is, but a look at my data doesn't really suggest there's much here to be concerned about ...
 
author · Dad · software · colophon · rights
Random image, linked to its containing fragment

By

I am an employee of Amazon.com, but the opinions expressed here are my own, and no other party necessarily agrees with them.

A full disclosure of my professional interests is on the author page.