Testing the Summon Research Assistant
Early this spring, Ex Libris released the Summon “Research Assistant.” This search tool is Retrieval Augmented Generation, using an LLM tool (OpenAI’s GPT–4o mini at time of writing) to search and summarize metadata in their Summon/Primo Central Discovery Index.
We did a library-wide test mid-semester and decided that it’s not appropriate to turn it on now. We may do so when some bugs are worked out. Even then, it is not a tool we’d leave linked in the header, promote as-is, or teach without significant caveats (see Reflection).
Brief Overview of the Tool
This overview is for the Summon version, though I believe that the Primo version is pretty similar and it has some of the same limitations.
From the documentation:
- Query Conversion – The user’s question is sent to the LLM, where it is converted to a Boolean query that contains a number of variations of the query, connected with an OR. If the query is non-English, some of the variations will be in the query language, and the other variations will be in English.
- Results Retrieval – The Boolean query is sent to CDI to retrieve the results.
- Re-ranking – The top results (up to 30) are re-ranked using embeddings to identify five sources that best address the user’s query.
- Overview Creation – The top five results are sent to the LLM with the instructions to create the overview with inline references, based on the abstracts.
- Response Delivery – The overview and sources are returned to the user in the response.
There is one major caveat to the above, also in the documentation, which is the content scope. Once you get through all the exceptions,1 only a slice of the CDI could make it into the top 5 results. Most notably, records from any of the following content providers are not included:
- APA,
- DataCite,
- Elsevier,
- JSTOR, and
- Conde Nast.
These would be in the results you get when clicking through to “View related results,” but they could not make it into the “Top 5.”
Positive Findings
I would summarize the overall findings as: extremely mixed. As I said up front, we had enough concerns that we didn’t want to simply turn on the tool and encourage our wider base to try it out.
Sometimes, people got really interesting or useful results. When it worked well, we found the query generation could come up with search strings that we wouldn’t have thought of but got good results. I found some electronic resources about quilts that I didn’t know we had – which is saying something!
Some of the ways the tool rephrased research questions as suggested “related research questions” were also useful. A few people suggested that this could be used to help students think about the different ways one can design and phrase a search.
The summaries generally seemed accurate to the record abstracts. I appreciated that they were cited in a way that let me identify which item was the source of which assertion.2
We also had many concerns.
Massive Content Gaps (and Additions)
The content gaps are a dealbreaker all on their own. No JSTOR? No Elsevier? No APA? Whole disiplines are missing. While they do show up in the “View related results,” those first 5 results matter a lot in a user’s experience and shape expectations of what a further search would contain. If someone is in a field for which those are important databases, it would irresponsible to send them to this tool.
The need for abstracts significantly limits which kinds of results get included. Many of our MARC records do not have abstracts. For others, one may infer the contents of the book from a table of contents note, but this requires levels of abstraction and inference which a human can perform but this tool doesn’t.
Then there’s the flip side of coverage. This is based on the Ex Libris CDI (minus the massive content gaps), which includes everything that we could potentially activate. At time of writing, it still doesn’t seem possible to scope to just our holdings (and include our own MARC). This means results include not only the good stuff we’d be happy to get for a patron via ILL but also whatever cruft has made its way into the billion+ item index. And that’s not a hypothetical problem. In one search we did during the session, so much potential content was in excluded JSTOR collections that a top 5 result on the RAG page was an apparently LLM-generated Arabic bookseller’s site.3
LLM Parsing / Phrasing
The next issue we encountered was that sometimes the LLM handled queries in unexpected4 ways.
Unexpected Questions
First, the Research Assistant is built to only answer a specific type of question. While all search tools can be described that way, anyone who’s worked for more than 30 seconds with actual humans knows that they don’t always use things in the way we intend. That’s why we build things like “best bet” style canned responses to handle searches for library hours or materials with complicated access (like the Wall Street Journal).
- It was not programmed to do anything with single word searches. A search for “genetics,” for example got the “We couldn’t generate an answer for your question” response. There wasn’t any kind of error-handling on the Ex Libris side to turn it into some kind of “I would like to know about [keyword],” even as a suggestion provided in the error message. For all my critiques of LLMs themselves, sometimes it’s just poor edge case handling.
- Then there were the meta questions. Colleagues who staff our Ask-a-Librarian brought in a few that they’ve gotten: “Do you have The Atlantic?” or “What is on course reserve for XXXXX?” In both of those cases, the tool was not able to detect that this was not the kind of question it was programmed to answer. In both cases, it returned a few random materials and generated stochastic responses which were, of course, completely inaccurate.
LLM-Induced Biases
Then there were issues introduced by the nature of LLMs – how they tokenize and what kind of data they’re trained on:
- A liaison librarian reported asking about notable authors from Mauritius and being given results for notable authors from Mauritania. I would guess this is a combination of stemming and lack of responses for Mauritius. But they are two very distinct countries, in completely different regions of a continent (or off the continent).
- Another bias-triggering question related to Islamic law and abortion. The output used language specific to 20th/21st evangelical Christianity. Because LLMs are configured not to output the same result twice, we could not replicate it, but instead got a variety of different phrasings of results of varying quality. This is a (not-unexpected) bias introduced by the data the LLM was trained on. Notably, it was not coming from the language of the abstracts.
Balancing Safety and Inquiry
Note: While I was finishing this blog post, the ACRLog published a post going into more detail about topics blocked by the “safeguards”. I brought this to our library-wide discussion but I’m going to refer readers to the above. Basically, if you ask about some topics, you won’t get a response. Even though some of these are the exact kind of thing we expect our students to be researching.5
When the Summon listserv was discussing this issue in the spring, I went and found the OpenAI Azure documentation for content filtering. They have a set of different areas that people can configure:
- Hate and fairness
- Sexual
- Violence
- Self-Harm
- Protected material
- Copyrighted text (actual text of articles, etc.) which can be output without proper citation
- Code harvested from repositories and returned without citation
- User prompt attacks
- Indirect attacks
- Groundedness (how closely it sticks to training data and how much it goes into statistically probable text output)
Configuration levels can be set at low, medium, and high for each. I shared the link and list of areas on the listserv and asked about which the Research Assistant uses but did not get an answer from Ex Libris.
Steps to Delivery
This next part relates to the idea of the Research Assistant itself, along with Ex Libris’s implementation.
Very, very few of our patrons need just a summary of materials (and, again, results of only materials which happen to have an abstract, and of only the abstract not the actual materials). Plenty of our patrons don’t need that at all. Unless they’re going to copy-paste the summary into their paper and call it a day, they actually need to get and use the materials.
So once they’ve found something interesting, what are their next steps?
Well, first you click the item.
Then you click Get It.
Then THAT opens a Quick Look view.
Then you click the title link on the item in the Quick Look View.
And oh look this was in the CDI but not in our holdings, so it’s sent me to an ILL page (this was not planned, just how it happened).
Maybe ExLibris missed the memo, but we’ve actually been working pretty hard to streamline pathways for our patrons. The fewer clicks the better. This is a massive step backward.
Reflection
I doubt this would be of any utility for grad students or faculty except as another way of constructing query strings. I do think it’s possible to teach with this tool, as with many other equally but differently broken tools. I would not recommend it at a survey course level. Is it better than other tools they’re probably already using? Perhaps, but the bar is in hell.
Optimal use requires:
- Students to be in a discipline where there’s decent coverage.
- Students to know that topical and coverage limitations exist.
- Students to understand the summaries are the equivalent of reading 5 abstracts at once and that there may be very important materials in the piece itself.
- Students to actually click through to the full list of results.
- Ex Libris to let us search only our own index (due to the cruft issue).
- Ex Libris to redesign the interface with a shorter path to materials.
Its greatest strength as a tool is probably the LLM to query translation and recommendations for related searches. When it works. But with all those caveats?
I am not optimistic.
-
FWIW, I totally understand and support not including News content in this thing. First, our researchers are generally looking for scholarly resources of some kind. Second, bias city. ↩︎
-
These citations are to the abstract vs. the actual contents. This could cause issues if people try to shortcut by just copy-pasting, since we’re reliant on the abstract to reliably represent the contents (though there’s also no page # citation). ↩︎
-
A colleague who is fluent in Arabic hypothesized that was not a real bookstore because many small things about the language and site (when we clicked through) were wrong. ↩︎
-
Ben Zhao’s closing keynote for OpenRepositories goes into how these kinds of issues could be expected. So I’ll say “unexpected” from the user’s POV but also I cannot recommend his talk highly enough. Watch it. ↩︎
-
Whether ChatGPT can appropriately summarize 5 abstracts from materials related to the Tulsa Race Riots or the Nakba is a whole separate question. ↩︎