[This fragment is available in an audio version.]

Here’s how to get into a lot of trouble:
Suppose you (like me) love the intellectual wealth found in free-form text on the Internet.
And (like me) are a reasonably competent programmer.
And (like me) have derived value and pleasure searching Twitter.
And (like me) you look at this nifty new Fediverse thing and see that it has nice Web APIs so you could build an app to vacuum up all the stories and laments and cheers and dunks and love letters and index ’em and let everyone search ’em and find wonderful things! So you lurch into the Mastodon conversation, all excited, and blurt out “Hey folks, I’m gonna index all this stuff and let the world in!”
That’s when you get your face torn off.

Contents · This (too long, sorry) essay does the following:

  1. Surveys the current opposition to Fediverse search.
    Tl;dr: Privacy!

  2. Describes the push-back from experienced Web-heads.
    Tl;dr: Huh?

  3. Outlines Mastodon’s current search capabilities.
    Tl;dr: Not terrible.

  4. Describes my position.
    Tl;dr: It’s unethical to ignore privacy concerns.

  5. Criticizes Mastodon’s current privacy capabilities.
    Tl;dr: Pretty terrible.

  6. Argues that this is a social/legal problem, not a technology problem.

  7. Offers specific policy and legal recommendations to improve the Fediverse privacy posture.

  8. Paints a picture of what success looks like.
    Tl;dr: Good privacy, useful search.

Anti-search · It’s like this: When you post to your blog or your public Twitter account, your words and pictures instantly join your eternal public record, available to everyone who loves or hates you or doesn’t care. Who can build search engines, not to mention ML models and adTech systems and really anything else, to help the world track and follow and analyze and sell things to you.

And, if you’re vulnerable, attack you, shame you, doxx you, SWAT you, try to kill you.

The people who built Mastodon, and the ones operating large parts of it, do not want that to happen again. Full-text search (with limited exceptions) has, as a matter of choice, been left out of the software. Why? Let me give the stage to:

So, should you sally forth as related in the first paragraph above, people will say nasty things to you and tell you to please stop working on your project. Should you proceed anyhow, they will take strong measures to block you and put any instance you seem to represent at risk of de-federation. They’re serious.

I’m not speaking hypothetically. In the dying days of 2022 I watched in real-time as this eager young fellow bounced onto the stage and said he had this new full-text thing he was about to launch, it would index all the instances your instance was federated with and it was carefully built to penetrate various Mastodon blockages. And anyone who didn’t want to be scraped and indexed had to opt-out. (He also claimed it was going to be available only to “genuine admins”.)

It did not go over well. The hostility and anger among the admins was palpable, and the next day there were people following up on the thread talking about de-federating the dude’s whole instance if that was the kind of person there.

This Open Letter from the Mastodon Community is another example of eager information-harvesters running into rage.

So, don’t say you weren’t warned.

Pro-search · Perhaps you find this attitude surprising? I did, initially, and many Web veterans’ reactions range from disdainful to hostile.

Here is Alex Stamos: “I find the arguments against officially supported Fediverse search pretty tedious, as you have to be really naive to believe that a bunch of bad-faith actors aren’t already quietly archiving everything…”

Here is Ben Adida: “I probably should then rephrase to: great search is going to happen or Fediverse might well remain a niche app.”

And here is a project called Mastinator slamming the door on its way out: “The Fediverse has some big problem coming.”

There are lots more of these reactions, and they all say more or less the same thing: “Search is good, and you can’t stop it, and people are crawling your data anyhow.”

I’m a bit puzzled by that “But people are already doing it” argument. Yes, Mastodon traffic either is already or soon will be captured and filed permanently as in forever in certain government offices with addresses near Washington DC and Beijing, and quite likely one or two sketchy Peter-Thiel-financed “data aggregation” companies. That’s extremely hard to prevent but isn’t really the problem: The problem would be a public search engine that Gamergaters and Kiwifarmers use to hunt down vulnerable targets.

What Mastodon does now · Just to be clear, Mastodon offers a perfectly decent search capability. You can search hashcodes, and what’s even cooler, you can follow them like you do another person. I like this but it does tend to leave too many posts #bulging #with #ugly #hashcodes like a crazed corporate SEO vampire.

You can search your own posts and a few other useful things. So it’s not as though there’s blanket condemnation of the idea of search, just a whole lot of concern about what’s allowed and how it’s used.

Where I stand · I think privacy is good and ignoring the issue is unethical. People should be able to converse without their every word landing on a permanent global un-erasable indexed public record. Call me crazy.

Disclosure: I’ve personally been unashamedly exuberantly public on social media since the first time I stumbled onto, um, MySpace? Orkut? Can’t remember.

I like a high-intensity stream full of well-connected voices, and I like being able to get a lot of people’s attention when I have something to say that I think is important.

But my vibe shouldn’t be the only vibe on the menu. Some people just want to talk about stuff with a few people, they don’t want to be influencers or to mainline the zeitgeist.

Some people are from groups endangered by online hate and violence, or experience precarity such that they just can’t afford to have every word on the permanent record. Some people are just shy.

I am a hyperoverentitled thick-skinned white boy who can laugh publicly at online assholes without much concern for consequences. It’s crazy to think that social media should be exclusively optimized for people like me.

There are problems · To start with, notwithstanding all the above, I’d like more search too. That’s not a big problem because I think there’s a path forward that’s useful and still preserves the current privacy-centric Mastodon values.

Then there’s the big problem…

Mastodon’s privacy story is terrible! · Seriously. Unless you take special specific measures, every little snippet you post on Mastodon has a URL and anyone can fetch it with a Web browser or computer program and then… well, do whatever the hell they want with it. Mastodon as it stands today is not built to protect privacy.

You can get a sort of weak partial privacy if you:

  1. Post in “Friends only” mode (which can be done per-post or as a default).

  2. Protect your account so you get to approve or deny anyone who wants to follow you.

  3. Get lucky, as in none of your followers republish your posts to the world or gateway them to the alt-right.

This will probably keep you out of some rando’s public-search-engine experiment.

But it doesn’t matter, because the vast majority of people on Mastodon don’t understand the difference between its sharing modes and probably don’t protect their account, because why should anyone have to do that?

And anyhow, we’re all…

Missing the point · The point is, we’re not trying to solve a technical problem here, we’re trying to solve a social problem. We don’t want people to do certain irritating and dangerous things with data scraped off the Fediverse. So, when there are things that people can do but shouldn’t, what tools do we usually apply? Hmm… I guess when I said “social” I meant “legal”.

So here’s a question: When I publish something, who is licensed to fetch it or, having fetched it, store it and process it, or having stored and processed it, share the results with the world, or with an employer or customer?

Mastodon doesn’t help here. When you retrieve a post, you don’t have to log in to Mastodon first, so any terms and conditions you might have agreed to don’t apply. You also don’t have to click through a terms-of-service pop-up. When you follow somebody, at no point (that I’ve seen anyhow) do you get notified of how they’d like their posts to be treated.

So why shouldn’t you feel free to go ahead and share what you’ve received to the world or, if you’re a Search weenie, write a program to follow people and index their posts?

Suggestions · Stated in the most general possible way: The Fediverse needs to get its content-licensing shit together.

I have ideas about how this might be done, which I’m about to offer, but I Am Not A Lawyer and I am especially not a copyright or intellectual-property specialist; so take these as lightweight amateur suggestions designed only to start conversation.

Disclaimers in place, I propose the following. (Note that some of these proposals are not fully compatible with each other.)

  1. A server should deliver posts only to people logged into the instance, or to other instances it is federated with.

  2. Servers should deliver posts only after a click-through acknowledging the license covering those posts.

  3. The Fediverse needs to work with IP lawyers, and maybe Creative Commons, to build a menu of licenses that people can choose to apply to their posts.

  4. When you follow someone, you should be forced to acknowledge their default content license, and re-acknowledge if and when they change the default.

  5. The choice of default content license for an instance is very important and needs to be communicated clearly in human language not legal jargon, at the time of registration.

    Many members of the current admin community would like it if the default license were always highly restrictive such that you’d have to explicitly opt in to making your posts eligible for mass harvesting. I can see their point, but if I’m building an instance for people who get paid to be public, for example journalists or DevRel people, I’d probably pick the opposite default.

  6. The content-license menu should have a lot of options. Some line up pretty well with Mastodon’s current categories: “public”, “unlisted”, and “followers only”. But I can imagine finer-grained exclusions, such as allowing full-text indexing but only for accounts on the same instance, or allowing use for search but no other applications. (No ML model building!)

  7. I’m also pretty sure that content licensing should have a temporal component. That is to say “Yes, harvest this and use it, but only for two weeks beginning now.” Mastodon already has optional built-in scheduled post deletion and this would have to be consistent with that.

I’m pretty sure I’m missing important dimensions. And I’m totally sure that creating the dialogues necessary to support this constitute a UX designer’s nightmare.

Most important, I’m convinced that this is a conversation that the Fediverse leaders need to start having, and start having now.

What success looks like · I’d like it if nobody were ever deterred from conversing with people they know for fear that people they don’t know will use their words to attack them. I’d like it to be legally difficult to put everyone’s everyday conversations to work in service to the advertising industry. I’d like to reduce the discomfort people in marginalized groups feel venturing forth into public conversation.

But… I’d also like to search the world’s conversation to find out what’s happening right now. How are things going around Bakhmut? How are people feeling about the latest shows in the Sierra Ferrell tour? What’s being posted about the World Cup semifinal? Are the British Tories about to knife another idiot leader?

And especially this: How are they doing on fixing the winter power outage in Saskatchewan, where my elderly mother lives? Not hypothetical; it happened the evening of December 27th, and being able to track the status with Twitter search meant I didn’t have to organize an emergency intervention from two time-zones away.

I’d also be interested in a certain amount of historic search: What exactly were world leaders saying around last February 24th? How did they describe that new AWS feature at re:Invent 2017? And so on; but only if I’m confident the people who posted what I’m searching are comfortable with them being used this way.

I think we should be able to get there. But it’s not a technology problem.



Contributions

Comment feed for ongoing:Comments feed

From: Andrew Reilly (Jan 02 2023, at 13:18)

Historically, the role of capturing ephemeral conversations or pronouncements and publishing them with explanatory context has been the job of journalism.

You could argue that journalism's time is past, and that its business model is broken, and that it doesn't scale to the number of interesting events and concerns in the world, and that search just does all of that work "better" in some sense. Perhaps it does, but I'm not sure.

What the journalism model does do though is provide an established legal framework for this archival, reporting and republishing process. You know who to sue if you don't like the way you've been quoted, or the context in which you've been put, or whatever, and most nations have somewhat appropriate laws to address the harm issues that you raise, when "traditional" publishing is involved.

[link]

From: Mark Gardner (Jan 02 2023, at 14:51)

Very good, very thoughtful, and very early on equating the fediverse with the 8,000-pound elephant in the room.

So in addition to the deep dive into Mastodon technology and culture(s), you also have to map the same concerns across every other type of system that can either federate directly via ActivityPub or via a bridge from other protocols and formats. These inevitably have their own technical solutions in various stages of half-bakery to what you convincingly show is a social problem.

I’m not suggesting we need to boil the ocean to make some meaningful progress, only that a content licensing framework that both honors privacy wishes while making the world’s conversations searchable has to include everywhere the world is having them.

[link]

From: Corbulo (Jan 02 2023, at 15:33)

It's hilariously naive to reject a search function. Whatever vision (there are many) they have for the Fediverse it wont be tenable without a search function. The principle behind it is 'decentralization', but decentralization is heavily dependent on indexing. Thats why Google was made and became so popular. A good search engine is less critical to a centralized net.

I'm increasingly wondering if there may be other motivations behind mastodon. No search function+decentralization=hive of scum and villainy.

[link]

From: Julie Goldberg (Jan 02 2023, at 22:17)

Thanks for the thoughtful review of this issue.

I disagree with those who argue that hashtags are a substitute for search because this does not comport with how people create and search hashtags.

Traditional indexing (catalogs, databases) employs controlled vocabulary because people don’t naturally converge on the same names for the same idea, even within one language.

An organization or brand may share the exact hashtags people can use to follow or discuss an event (i.e. #NJASL2023, #AvatarMovie, #WorldCupQatar), but left to our own devices, we will rarely settle on one useful term to describe a multifaceted event.

To say that hashtags are easily searchable (true!) assumes that people will guess the hashtags that others are using (unlikely without a full-text search to help users figure out what hashtags others are using to label the desired content!)

When Old Order Orthodox Mastodonians have scolded me by saying that we don’t need search because we have hashtags, it feels like gaslighting. You don’t need to have taken a graduate class in human information behavior to know that people simply don’t work that way.

[link]

From: PHenry (Jan 03 2023, at 04:30)

I think part of the problem is that too many people don't understand that Mastodon, as initially envisioned, wasn't meant to replace Twitter or even necessarily compete with Twitter. Twitter is, as made fairly clear here, meant for global reach and to talk about anything and everything and can serve as a consumption-only feed. You don't have to post to Twitter to enjoy Twitter or find it useful. Mastodon wasn't meant to be like that. It was meant for you to participate and engage in discussion. Its increasingly becoming clear however, that the more people that jump on board, the more they want it to be like Twitter. And they're not 'wrong'. It's just that some instances like the feeling of a group of friends that can discuss things openly without worrying about others who may only serve to add toxicity to the discussion. And its not that they want to exclude the world, but they also don't want just anybody to join either. So, to question the motivations behind Mastodon as being shady as one other commenter made is missing the point of Mastodon all together. It's viewing Mastodon through Twitter-colored glasses. And Mastodon *is* going to have to support the growing desire to be more "Twitter"-like, but it's going to have to do so carefully and I do feel many of the suggestions made could serve as useful concepts to build off of to do so.

[link]

From: Tom Boutell (Jan 03 2023, at 04:43)

As one of the enterprising crazy kids (*) who had the very same impulse, F'd Around by polling his vastly smaller number of followers, and also Found Out that people do not want this, I greatly appreciate your detailed analysis.

One thing that stands out to me: people hashtag stuff because they want it to be found. Would a search engine that only indexed hashtags, but did so across many more instances, be regarded as welcome? I guess it's not that different if you do that hashtag search on mastodon.social, but paradoxically smaller instances (which are nice to avoid centralization) have impoverished search capabilities (because there is not enough of a followed-by-people-who-are-here effect) and would maybe benefit from a centralized search engine.

Buuut it's an edge case and the subtleties probably wouldn't be appreciated. If I'm even right, that is.

(*) "Kid" only in the loosest possible sense

[link]

From: Nimish G (Jan 03 2023, at 06:08)

I'm new to Mastodon but I really love the 'protect people from real-world threats' thinking they have going in. My mind and body are still processing it.

I've learned to be afraid of expressing certain social problems on social media because of the (at best) trolling backlash and (at worst) genuine credible threats that come from it... threats that disproportionately affect certain social groups, sadly.

Most attempts to protect such groups, traditionally, have fallen on to callous, detached, pseudo-academic ears that say "well, that's nice but we have a right to harvest data. Our curiosity outweighs your right to express yourself without threat."

The usual tech response is usually to treat an attempt to protect people as the same as information censorship and fight it with the same tenacity. I've gotten used to this too, and have just accepted that I'm not allowed to express any real, personal social problems online because I run the threat of the alt-right and the detached pseudo-academic techies that will defend their right to harass in the name of 'research' or 'free speech' or something.

I honestly thought that's where you were going with this piece, but somewhere in there you did something amazing: you recognized the concerns as legitimate.

I cannot state how wonderful of an act this is!

As a tech-head myself, I have a lot of thoughts about technical details, whether it be licensing or encryption or clearer UI, but rather than going in to that just now, I want to thank you for seeing these concerns as legitimate (and not antagonistic) and trying to think of ways to address those concerns as opposed to treating them as invalid or looking for a fun tech way to negate them.

That's true progress in my mind, and if more people have that, then I think there will be an awesome solution to be had :)

[link]

From: PeterL (Jan 04 2023, at 10:25)

Google can find things I wrote on Usenet in the late 1980s ... some is pretty stupid in retrospect, but it's just part of technical discussions about compilers programming languages, so I'm unlikely to get into trouble for that (also, my email address has changed a dozen times since then even if my name hasn't).

So, I don't see a problem with indexing certain kinds of discussions, even if their content might be mildly embarrassing. The problem comes with socially-sensitive discussions. And people in such discussions aren't likely to understand the implications of clicking on a button when they post that says "Allow search engines to find what you just wrote?"

(FWIW, I find Twitter search to be mostly useless; and when I was using Usenet there were no search engines. So, perhaps, there isn't much value in searching Mastodon, and the status quo is fine?)

[link]

From: Mark Nottingham (Jan 04 2023, at 18:44)

I don't think copyright is going to give you what you want here. Offering a search service is typically viewed as 'transformative' and therefore qualifies as fair use -- e.g., 17 USCS § 107 in the US. See eg Authors Guild v. Google, Inc.

[link]

From: Yesplease (Jan 12 2023, at 01:15)

I would love to see licensing integrated to the metadata on every interaction. Including potentially the ability to limit what licenses are allowed on replies in-thread.

Flickr did licensing for photos right (unlike Instagram), its time for small distributed text to join...

Legal council required, good luck!

Licenses can never stop the trolls/theil-elons & snoops, but they lay a framework for legitimate use.

[link]

From: hamish (Jan 17 2023, at 01:04)

This is a (quite) question of #openweb vs #closedweb

The Fediverse worked/s because it is #4opens at a protocol and #UX implementation, it’s a #openweb project, this is a good thing. Yes, the is a history of well worded “white lies” about privacy and safety that have helped to change social behavers. A kinda not bad thing.

One of the #4opens is #opendata, every post has a URL, the is no encryption at all, in this all “privacy” is based on trust, a good thing.

I agree with the need for #openlicence for content, it’s one of the #4opens, a good thing.

Where I worry, a little, is some people seeing this (often unthinkingly) as an opertunerty to push #closedweb thinking into a #openweb project. This is a small problem that is good to keep an eye on :slight_smile:

My though is to set a permissive CC content licence as default (the power of default) and let users change this to completely open or a more restricted licence, then “trust” coders/people to respect this.

[link]

From: Simon Gray (Jan 29 2023, at 23:52)

I think your summary of the issues is good, your analysis of the need for solutions is pertinent, but sadly I think your proposed solutions are…

…‘analogous to the state of the European-accessible web post-PECD/GDPR’ :)

Srsly, web users already have far too many buttons to click to accept Terms and Conditions in order to access content as it is, we don’t need to be adding *more* barriers!

How about instead we adopt simpler solutions:

* Add an additional toggle on the content sharing settings so users can select logged-in-only or not, and

* We accept that conversations on the Fediverse are ephemeral, and by default purge the content database of any instance after a ‘suitably ephemeral period’ of something like three months; if people want their words archived permanently, that’s what blogs are for

And then full text searches on the public database can be permitted to meet the useful case of being able to find out what’s going on in a disaster zone whilst it’s happened.

[link]

author · Dad
colophon · rights
picture of the day
December 30, 2022
· The World (148 fragments)
· · Social Media (12 more)

By .

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!