Should libraries run search engines? It seems that the original point of a library was to organize human knowledge and culture for the public benefit. The benefit has been great, but libraries aren't the main tool for finding information now. Search engines are. The notion that a library needs to be for books only is an arbitrary limitation. This oversight allowed corporations to move into that traditionally non-profit role, and I'm not convinced they've done a particularly good job.
There did used to be more overlap between information search and information access/management in the past.
So I don't think you are having a controversial idea here.
@onepict This post was inspired mostly by a take I've seen on the internet several times that goes something like "Libraries would be considered crazy if proposed today, so what good stuff are we missing out on because we're not already doing it?" And to me this felt like an example of even a thing that libraries specifically could be doing that we're missing out on. So yeah, the non-controversialness is sort-of a feature!
I do wonder whats happening though, at uni as much research in compsci for information search came from the libraries. Like the original algorithms for search for information management, and for stuff like working out to do OCR but for handwriting. Which wasn't that successful as projects like transcribe Bentham relies on crowd working. Just how did the disciplines get so separated?
My graduate project was trying do do OCR on existing documents that had just been scanned in. In the early 2000s. That was fun, like there were no real java implementations of it and the libraries for OCR were proprietary and a lot of money. But I still have some of the papers somewhere on some of the handwriting OCR early research somewhere.
TL;DR, this is largely already happening, it's just targeting things other than websites as primary sources.
I'm not sure websites are that good a target, either. They're often pretty rubbish.
Interesting dilemma, actually. What value is there in making rubbish but accessibly secondary/tertiary sources discoverable?
I recall that Back In The Day™️ whenever you ran into HTML it would be explained in terms of SGML (fair), and that invariably led to mention of Dublin Core. Also e.g. LaTeX at the time seemed to mention DC often.
At the latest since the ill-fated XHTML attempt, DC dropped off the radar.
HOWEVER, my librarian friend was utterly unsurprised by it, and more surprised that I as a pure compsci person knew what it was about.
@onepict @distractedmosfet It probably helps that one of my other friends, Dan Brickley, runs https://schema.org ... there's no direct connection to DC, but indirectly both draw on RDF historically, and RDF is sort of where the web world and DC formalized that it's about data and not HTML so much. There was a lot of parallel and cross-fertilising stuff going on there since the 90s.
I've basically been in constant contact with people concerned with formalizing how to describe resources.
Where websites are mostly different is that they tend to be ad-hoc, informal sources of information that are much harder to even describe formally because it's not necessarily clear what these things *are*. Is a blog post by a doctor a medical resource or an opinion? Is it both?
Web search engines are...
I do wonder how much Google disrupted information management and categorisation with it's search engine development?
Like not just changing the market, but separating the disciplines. As well as the use of AI/neural nets for search in some research. I remember some of my lecturers and project accessory really not liking the use of neural nets for search as they felt you couldn't debug how a search result was arrived at compared to traditional methods
@onepict ... interested in and largely remember disappointment.
Honestly, I think a mixture of formal/traditional categorisation and more statistical ones (let's face it, AI these days is mostly statistics) is probably not bad. The specific thing we're seeing nowadays, well, could likely see some improvements. But having a mixture is not something I'd want to change too much.
@onepict I suppose one can subdivide categorization methods also into who does the bulk of the work.
With both e.g. hashtags and schema.org-like markup, it's the author of content that makes a claim that the content is relevant to a particular topic or has a particular form. They could be lying.
Machine learning might bring some kind of neutrality back into it that more traditional approaches might also have, but more efficiently. Then again, we've all heard of AI bias now.
@onepict Someone mentioned federated search in another comment; it's largely been my view in the last years that what this should be is sharing an index. How each instance comes by the index may be up to them, and could permit for more or less formal approaches.
On the client side, such a federated search should permit relatively fluid adjustment of the weights each index is given to.
I think that might be an interesting thing to work on. I'm not sure if it already exists.
A private instance for the Finkhäuser family.