Should libraries run search engines? It seems that the original point of a library was to organize human knowledge and culture for the public benefit. The benefit has been great, but libraries aren't the main tool for finding information now. Search engines are. The notion that a library needs to be for books only is an arbitrary limitation. This oversight allowed corporations to move into that traditionally non-profit role, and I'm not convinced they've done a particularly good job.
There did used to be more overlap between information search and information access/management in the past.
So I don't think you are having a controversial idea here.
@onepict This post was inspired mostly by a take I've seen on the internet several times that goes something like "Libraries would be considered crazy if proposed today, so what good stuff are we missing out on because we're not already doing it?" And to me this felt like an example of even a thing that libraries specifically could be doing that we're missing out on. So yeah, the non-controversialness is sort-of a feature!
I do wonder whats happening though, at uni as much research in compsci for information search came from the libraries. Like the original algorithms for search for information management, and for stuff like working out to do OCR but for handwriting. Which wasn't that successful as projects like transcribe Bentham relies on crowd working. Just how did the disciplines get so separated?
My graduate project was trying do do OCR on existing documents that had just been scanned in. In the early 2000s. That was fun, like there were no real java implementations of it and the libraries for OCR were proprietary and a lot of money. But I still have some of the papers somewhere on some of the handwriting OCR early research somewhere.
TL;DR, this is largely already happening, it's just targeting things other than websites as primary sources.
I'm not sure websites are that good a target, either. They're often pretty rubbish.
Interesting dilemma, actually. What value is there in making rubbish but accessibly secondary/tertiary sources discoverable?
I recall that Back In The Day™️ whenever you ran into HTML it would be explained in terms of SGML (fair), and that invariably led to mention of Dublin Core. Also e.g. LaTeX at the time seemed to mention DC often.
At the latest since the ill-fated XHTML attempt, DC dropped off the radar.
HOWEVER, my librarian friend was utterly unsurprised by it, and more surprised that I as a pure compsci person knew what it was about.
Where websites are mostly different is that they tend to be ad-hoc, informal sources of information that are much harder to even describe formally because it's not necessarily clear what these things *are*. Is a blog post by a doctor a medical resource or an opinion? Is it both?
Web search engines are...
@onepict @distractedmosfet I'm also very interested in this kind of thing from my #interpeer point of view. It's abundantly clear that computers do better with categories provided by schemata, but the web and search engines also demonstrate clearly that most people don't care.
Schema.org is interesting to me because it's specifically aimed at bridging that gap: it provides schema keywords with which you can e.g. decorate your website content such that it looks more structured to crawlers and...
@onepict @distractedmosfet ... therefore becomes more of a well-defined thing for search engines. But most of that is going to happen outside of the user's view who is just writing a blog post or some such.
(It's no surprise that Dan runs the project while being a Google employee; Google benefits from websites looking more structured to their crawler, of course.)
@onepict @distractedmosfet So this is a little bit of a rambling comment thread; the main point being, I think, is that there is already a bunch of tech in libraries that would provide for search engines
A secondary point is that the main difference is how libraries and search engines look at different resources and why, and a third is that it's somewhat possible to bridge this.
As to whether it'd be a good idea for libraries to run search engines, well, I don't know. Yes and no?
I do wonder how much Google disrupted information management and categorisation with it's search engine development?
Like not just changing the market, but separating the disciplines. As well as the use of AI/neural nets for search in some research. I remember some of my lecturers and project accessory really not liking the use of neural nets for search as they felt you couldn't debug how a search result was arrived at compared to traditional methods
I don't know. I mean, what I sort of wanted to get at before is that it's probably a good and a bad thing what's happened here. Should Google be in control of it, no, but that's a different thread (or the main one, and we're on the side track).
Because what I *also* distinctly remember is how terrible search was before. It was mostly luck that got you anywhere. I was also the kind of kid still looking in the library index for key words I might be...
@onepict ... interested in and largely remember disappointment.
Honestly, I think a mixture of formal/traditional categorisation and more statistical ones (let's face it, AI these days is mostly statistics) is probably not bad. The specific thing we're seeing nowadays, well, could likely see some improvements. But having a mixture is not something I'd want to change too much.
@onepict I suppose one can subdivide categorization methods also into who does the bulk of the work.
With both e.g. hashtags and schema.org-like markup, it's the author of content that makes a claim that the content is relevant to a particular topic or has a particular form. They could be lying.
Machine learning might bring some kind of neutrality back into it that more traditional approaches might also have, but more efficiently. Then again, we've all heard of AI bias now.
@onepict Someone mentioned federated search in another comment; it's largely been my view in the last years that what this should be is sharing an index. How each instance comes by the index may be up to them, and could permit for more or less formal approaches.
On the client side, such a federated search should permit relatively fluid adjustment of the weights each index is given to.
I think that might be an interesting thing to work on. I'm not sure if it already exists.
A private instance for the Finkhäuser family.