Should libraries run search engines? It seems that the original point of a library was to organize human knowledge and culture for the public benefit. The benefit has been great, but libraries aren't the main tool for finding information now. Search engines are. The notion that a library needs to be for books only is an arbitrary limitation. This oversight allowed corporations to move into that traditionally non-profit role, and I'm not convinced they've done a particularly good job.

There did used to be more overlap between information search and information access/management in the past.

So I don't think you are having a controversial idea here.

@onepict This post was inspired mostly by a take I've seen on the internet several times that goes something like "Libraries would be considered crazy if proposed today, so what good stuff are we missing out on because we're not already doing it?" And to me this felt like an example of even a thing that libraries specifically could be doing that we're missing out on. So yeah, the non-controversialness is sort-of a feature!

I do wonder whats happening though, at uni as much research in compsci for information search came from the libraries. Like the original algorithms for search for information management, and for stuff like working out to do OCR but for handwriting. Which wasn't that successful as projects like transcribe Bentham relies on crowd working. Just how did the disciplines get so separated?

My graduate project was trying do do OCR on existing documents that had just been scanned in. In the early 2000s. That was fun, like there were no real java implementations of it and the libraries for OCR were proprietary and a lot of money. But I still have some of the papers somewhere on some of the handwriting OCR early research somewhere.

@onepict @distractedmosfet I have a friend who studied to become a librarian, and you may be interested to know how much of that is compsci.

TL;DR, this is largely already happening, it's just targeting things other than websites as primary sources.

I'm not sure websites are that good a target, either. They're often pretty rubbish.

Interesting dilemma, actually. What value is there in making rubbish but accessibly secondary/tertiary sources discoverable?

@onepict @distractedmosfet So, related,

I recall that Back In The Day™️ whenever you ran into HTML it would be explained in terms of SGML (fair), and that invariably led to mention of Dublin Core. Also e.g. LaTeX at the time seemed to mention DC often.

At the latest since the ill-fated XHTML attempt, DC dropped off the radar.

HOWEVER, my librarian friend was utterly unsurprised by it, and more surprised that I as a pure compsci person knew what it was about.

@onepict @distractedmosfet It probably helps that one of my other friends, Dan Brickley, runs ... there's no direct connection to DC, but indirectly both draw on RDF historically, and RDF is sort of where the web world and DC formalized that it's about data and not HTML so much. There was a lot of parallel and cross-fertilising stuff going on there since the 90s.

I've basically been in constant contact with people concerned with formalizing how to describe resources.


@onepict @distractedmosfet All of which is to say that *for this kind of formalized description*, there is a lot of software support already, and this is extensively used in libraries as well.

Where websites are mostly different is that they tend to be ad-hoc, informal sources of information that are much harder to even describe formally because it's not necessarily clear what these things *are*. Is a blog post by a doctor a medical resource or an opinion? Is it both?

Web search engines are...

· · Web · 1 · 0 · 2

@onepict @distractedmosfet ... basically not so interested in this kind of "is-a" kind of categorization, but libraries tend to take that seriously. So what's there in tech is fundamentally different in some ways, even if there is search tech involved.

(Deleted and redrafted to add to this.)

@onepict @distractedmosfet I'm also very interested in this kind of thing from my point of view. It's abundantly clear that computers do better with categories provided by schemata, but the web and search engines also demonstrate clearly that most people don't care. is interesting to me because it's specifically aimed at bridging that gap: it provides schema keywords with which you can e.g. decorate your website content such that it looks more structured to crawlers and...

@onepict @distractedmosfet ... therefore becomes more of a well-defined thing for search engines. But most of that is going to happen outside of the user's view who is just writing a blog post or some such.

(It's no surprise that Dan runs the project while being a Google employee; Google benefits from websites looking more structured to their crawler, of course.)

@onepict @distractedmosfet So this is a little bit of a rambling comment thread; the main point being, I think, is that there is already a bunch of tech in libraries that would provide for search engines

A secondary point is that the main difference is how libraries and search engines look at different resources and why, and a third is that it's somewhat possible to bridge this.

As to whether it'd be a good idea for libraries to run search engines, well, I don't know. Yes and no?

I do wonder how much Google disrupted information management and categorisation with it's search engine development?

Like not just changing the market, but separating the disciplines. As well as the use of AI/neural nets for search in some research. I remember some of my lecturers and project accessory really not liking the use of neural nets for search as they felt you couldn't debug how a search result was arrived at compared to traditional methods

@onepict Yeah...

I don't know. I mean, what I sort of wanted to get at before is that it's probably a good and a bad thing what's happened here. Should Google be in control of it, no, but that's a different thread (or the main one, and we're on the side track).

Because what I *also* distinctly remember is how terrible search was before. It was mostly luck that got you anywhere. I was also the kind of kid still looking in the library index for key words I might be...


@onepict ... interested in and largely remember disappointment.

Honestly, I think a mixture of formal/traditional categorisation and more statistical ones (let's face it, AI these days is mostly statistics) is probably not bad. The specific thing we're seeing nowadays, well, could likely see some improvements. But having a mixture is not something I'd want to change too much.


@onepict I suppose one can subdivide categorization methods also into who does the bulk of the work.

With both e.g. hashtags and markup, it's the author of content that makes a claim that the content is relevant to a particular topic or has a particular form. They could be lying.

Machine learning might bring some kind of neutrality back into it that more traditional approaches might also have, but more efficiently. Then again, we've all heard of AI bias now.


@onepict Someone mentioned federated search in another comment; it's largely been my view in the last years that what this should be is sharing an index. How each instance comes by the index may be up to them, and could permit for more or less formal approaches.

On the client side, such a federated search should permit relatively fluid adjustment of the weights each index is given to.

I think that might be an interesting thing to work on. I'm not sure if it already exists.


Sign in to participate in the conversation
Finkhäuser Social

A private instance for the Finkhäuser family.