Database – eclectic librarian

#ERcamp13 at George Washington University

This is going to be long and not my usual style of conference notetaking. Because this was an unconference, there really wasn’t much in the way of prepared presentations, except for the lightening talks in the morning. What follows below the jump is what I captured from the conversations, often simply questions posed that were left open for anyone to answer, or at least consider.

Some of the good aspects of the unconference style was the free-form nature of the discussions. We generally stayed on topic, but even when we didn’t, it was about a relevant or important thing that lead to the tangents, so there were still plenty of things to take away. However, this format also requires someone present who is prepared to seed the conversation if it lulls or dies and no one steps in to start a new topic.

Also, if a session is designed to be a conversation around a topic, it will fall flat if it becomes all about one person or the quirks of their own institution. I had to work pretty hard on that one during the session I led, particularly when it seemed that the problem I was hoping to discuss wasn’t an issue for several of the folks present because of how they handle the workflow.

Some of the best conversations I had were during the gathering/breakfast time as well as lunch, lending even more to the unconference ethos of learning from each other as peers.

Anyway, here are my notes.

Continue reading “#ERcamp13 at George Washington University”

NASIG 2012: A Model for Electronic Resources Assessment

Presenter: Sarah Sutton, Texas A&M University-Corpus Christi

Began the model with the trigger event — a resource comes up for renewal. Then she began looking at what information is needed to make the decision.

For A&I databases, the primary data pieces are the searches and sessions from the COUNTER release 3 reports. For full-text resources, the primary data pieces are the full-text downloads also from the COUNTER reports. In addition to COUNTER and other publisher supplied usage data, she looks at local data points. Link-outs from the a-to-z list of databases tells her what resources her users are consciously choosing to use, and not necessarily something they arrive at via a discovery service or Google. She’s able to pull this from the content management system they use.

Once the data has been collected, it can be compared to the baseline. She created a spreadsheet listing all of the resources, with a column each for searches, sessions, downloads, and link-outs. The baseline set of core resources was based on a combination of high link-outs and high usage. These were grouped by similar numbers/type of resource. Next, she calculated the cost/use for each of the four use types, as well as the percentage of change in use over time.

After the baseline is established, she compares the renewing resource to that baseline. This isn’t always a yes or no answer, but more of a yes or maybe answer. Often more analysis is needed if it is tending towards no. More data may include overlap analysis (unique to your library collection), citation lists (unique titles — compare them with a list of highly-cited journals at your institution or faculty requests or appear on a core title list), journal-level usage of the unique titles, and impact factors of the unique titles.

Audience question: What about qualitative data? Talk to your users. Does not have a suggestion for how to incorporate that into the model without increasing the length of time in the review process.

Audience question: How much staff time does this take? Most of the work is in setting up the baseline. The rest depends on how much additional investigation is needed.

[I had several conversations with folks after this session who expressed concern with the method used for determining the baseline. Namely, that it excludes A&I resources and assumes that usage data is accurate. I would caution anyone from wholesale adopting this as the only method of determining renewals. Without conversation and relationships with faculty/departments, we may not truly understand what the numbers are telling us.]

ER&L 2012: Knockdown/Dragout Webscale Discovery Service vs. Niche Databases — Data-Driven Evaluation Methods

Speaker: Anne Prestamo

You will not hear the magic rational that will allow you to cancel all your A&I databases. The last three years of analysis at her institution has resulted in only two cancelations.

Background: she was a science librarian before becoming an administrator, and has a great appreciation for A&I searching.

Scenario: a subject-specific database with low use had been accessed on a per-search basis, but going forward it would be sole-sourced and subscription based. Given that, their cost per search was going to increase significantly. They wanted to know if Summon would provide a significant enough overlap to replace the database.

Arguments: it’s key to the discipline, specialized search functionality, unique indexing, etc… but there’s no data to support how these unique features are being used. Subject searches in the catalog were only 5% of what was being done, and most of them came from staff computers. So, are our users actually using the controlled vocabularies of these specialized databases. Finally, librarians think they just need to promote these more, but sadly, that ship’s already sailed.

Beyond usage data, you can also look at overlap with your discovery service, and also identify unique titles. For those, you’ll need to consider local holdings, ILL data, impact factors, language, format, and publication history.

Once they did all of that, they found that 92% of the titles were indexed in their discovery service. The depth of the backfile may be an issue, depending on the subject area. Also, you may need to look at the level of indexing (cover to cover vs. selective). In the end, they found that 8% of the titles not included, they owned most of them in print and they were rather old. 15% of the 8% had impact factors, which may or may not be relevant, but it is something to consider. And, most of the titles were non-English. They also found that there were no ILL requests for the non-owned unique titles, and less than half were scholarly and currently being published.

NASIG 2010: Making E-Serials Holdings Data Transferable–Applying the KBART Recommended Practice

Presenters: Jason Price, Claremont Colleges Library and SCELC Consortium

KBART stands for Knowledge Bases and Related Tools (a NISO group). Standards and best practices are challenging to move forward, so why should we “back this horse” versus something else?

This group is a collection of publishers, aggregators, knowledge base vendors, and librarians who want to create a universally acceptable holdings data format. Phase one of the report came out in January of this year, and the endorsement phase begins this month.

KBART expresses title level coverage by date and volume/issue. It’s a single solution for sharing holdings data across the scholarly communications supply chain. Essentially, it’s a simple metadata exchange format.

It wasn’t a simple process to get to this schema. They thought about all of the data in knowledge bases, how data is transferred to and from other sources, and the role of licensing in this process. When a publisher produces content, it flows to hosts/databases then gateways then knowledge bases and then catalogs/lists/guides.

When a user has a citation, they initiate a process that queries the knowledge base, which returns a list of access points. However, this breaks down when the holdings information is incorrect or even worse, when it’s missing. We get stuck with a lot of inaccuracies and manual work. At some point, it gets to be too much to keep up with.

Everyone is working with the same kind of data, albeit slightly customized at a local level. If we can begin to move toward a standard way of distributing the data, we can then look at automating this process.

KBART is the end to our role as translators – no more badgering publishers for complete lists, no more teasing out title changes (including former titles and ISSNs), no more waiting for the knowledge base data team to translate the data, and no more out of date access lists.

What can librarians do? Learn more about KBART. Insist on “knowing” what you’re buying (require annual delivery of a useable holdings list before you pay). Enable publisher sales staff to make the case to their companies – show them that use goes up when it’s accurately represented in link resolvers. Follow up with continued requests as necessary.

The American Institute of Physics implemented the KBART standard on their own, and they’ve now officially joined the group. On the other hand “A Big Publisher” recognizes the problem, but they need to establish the priority of the change, which includes getting their hosting service to make appropriate changes. So, they need to hear from all of us about this.

Publishers who are interested should review the requirements and format the content availability data to meet those requirements. Check your work and make it available to customers. And, of course, register as a KBART member.

Vision for KBART: Currently in phase one of standardizing. Phase two is more content type coverage. Phase three is a dream of incorporating metadata distribution for consortia and institutional level holdings based on what is accessible from a particular IP.

NASIG 2010: Linked Data and Libraries

Presenter: Eric Miller, Zepheira, LCC

Nowadays, we understand what the web is and the impact it has had on information sharing, but before it was developed, it was in a “vague but exciting” stage and few understood it. When we got started with the web, we really didn’t know what we were doing, but more importantly, the web was being developed so that it was flexible enough for smarter and more creative people to do amazing things.

“What did your website look like when you were in the fourth grade?” Kids are growing up with the web and it’s hard for them to comprehend life without it. [Dang, I’m old.]

This talk will be about linked data, its legacy, and how libraries can lead linked data. We have a huge opportunity to weave libraries into the fabric of libraries, and vice versa.

About five years ago, the BBC started making their content available in a service that allowed others to use and remix the delivery of the content in new ways. Rather than developing alternative platforms and creating new spaces, they focus on generating good content and letting someone else frame it. Other sources like NPR, the World Bank, and Data.gov are doing the same sorts of things. Within the library community, these things are happening, as well. OCLC’s APIs are getting easier to use, and several national libraries are putting their OPACs on the web with APIs.

Obama’s open government initiative is another one of those “vague but exciting” things, and it charged agencies to come up with their own methods of making their content available via the web. Agencies are now struggling with the same issues and desires that libraries have been tackling for years. We need to recognize our potential role in moving this forward.

Linked data is a best practice for sharing data, connecting data, and uses the semantic web. Rather than leaving the data in their current formats, let’s put them together in ways they can be used on the wider web. It’s not the databases that make the web possible, it’s the web that makes the databases usable.

Human computation can be put to use in ways that assist computers to make information more usable. Captcha systems are great for blocking automated programs when needed, and by using human computation to decipher scanned text that is undecipherable by computers, ReCaptcha has been able to turn unusable data into a fantastic digital repository of old documents.

LEGOs have been around for decades, and their simple design ensures that new blocks work with old blocks. Most kids end up dumping all of their sets into one bucket, so no matter where the individual building blocks come from, they can be put together and rebuild in any way you can imagine. We could do this with our blocks of data, if they are designed well enough to fit together universally.

Our current applications, for the most part, are not designed to allow for the portability of data. We need to rethink application design so that the data becomes more portable. Web applications have, by neccesity, had to have some amount of portability. Users are becoming more empowered to use the data provided to them in their own way, and if they don’t get that from your service/product, then they go elsewhere.

Digital preservation repositories are discussing ways to open up their data so that users can remix and mashup data to meet their needs. This requires new ways of archiving, cataloging, and supplying the content. Allow users to select the facets of the data that they are interested in. Provide options for visualizing the raw data in a systematic way.

Linked data platforms create identifiers for every aspect of the data they contain, and these are the primary keys that join data together. Other content that is created can be combined to enhance the data generated by agencies and libraries, but we don’t share the identifiers well enough to allow others to properly link their content.

Web architecture starts with web identifiers. We can use URLs to identify things other than just documents, but we need to be consistent and we can’t change the URL structures if we want it to be persistent. A lack of trust in identifiers is slowing down linked data. Libraries have the opportunity to leverage our trust and data to provide control points and best practices for identifier curation.

A lot of work is happening in W3C. Libraries should be more involved in the conversation.

Enable human computation by providing the necessary identifiers back to data. Empower your users to use your data, and build a community around it. Don’t worry about creating the best system — wrap and expose your data using the web as a platform.

ER&L 2010: Comparison Complexities – the challenges of automating cost-per-use data management

Speakers: Jesse Koennecke & Bill Kara

We have the use reports, but it’s harder to pull in the acquisitions information because of the systems it lives in and the different subscription/purchase models. Cornell had a cut in staffing and an immediate need to assess their resources, so they began to triage statistics cost/use requests. They are not doing systematic or comprehensive reviews of all usage and cost per use.

In the past, they have tried doing manual preparation of reports (merging files, adding data), but that’s time-consuming. They’ve had to set up processes to consistently record data from year to year. Some vendor solutions have been partially successful, and they are looking to emerging options as well. Non-publisher data such as link resolver use data and proxy logs might be sufficient for some resources, or for adding a layer to the COUNTER information to possibly explain some use. All of this has required certain skill sets (databases, spreadsheets, etc.)

Currently, they are working on managing expectations. They need to define the product that their users (selectors, administrators) can expect on a regular basis, what they can handle on request, and what might need a cost/benefit decision. In order to get accurate time estimates for the work, they looked at 17 of their larger publisher-based accounts (not aggregated collections) to get an idea of patterns and unique issues. As an unfortunate side effect, every time they look at something, they get an idea of even more projects they will need to do.

The matrix they use includes: paid titles v. total titles, differences among publishers/accounts, license period, cancellations/swaps allowed, frontfile/backfile, payment data location (package, title, membership), and use data location and standard. Some of the challenges with usage data include non-COUNTER compliance or no data at all, multiple platforms for the same title, combined subscriptions and/or title changes, titles transferred between publishers, and subscribed content v. purchased content. Cost data depends on the nature of the account and the nature of the package.

For packages, you can divide the single line item by the total use, but that doesn’t help the selectors assess the individual subset of titles relevant to their areas/budgets. This gets more complicated when you have packages and individual titles from a single publisher.

Future possibilities: better automated matching of cost and use data, with some useful data elements such as multiple cost or price points, and formulas for various subscription models. They would also like to consolidate accounts within a single publisher to reduce confusion. Also, they need more documentation so that it’s not just in the minds of long-term staff.

ER&L 2010: We’ve Got Data – Now What Do We Do With It? Applying Standards to Assess Information Resources

Speakers: Mary Feeney, Ping Situ, and Jim Martin

They had a budget cut (surprise surprise), so they had to asses what to cut using the data they had. Complicating this was a change in organizational structure. In addition, they adopted the BYU project management model. Also, they had to sort out a common approach to assessment across all of the disciplines/resources.

They used their ILLs to gather stats about print resource use. They hired Scholarly Stats to gather their online resource stats, and for publishers/vendors not in Scholarly Stats, they gathered data directly from the vendors/publishers. Their process involved creating spreadsheets of resources by type, and then divided up the work of filling in the info. Potential cancellations were then provided to interested parties for feedback.

Quality standards:

60% of monographs need to show at least one use in the last four years – this was used to apply cuts to the firm orders book budget, which impacts the flexibility for making one-time purchases with remaining funds and the book money was shifted to serial/subscription lines
95% of individual journal titles need to show use in the last three years (both in-house and full-text downloads) – LJUR data was used to add to the data collected about print titles
dual format subscriptions required a hybrid approach, and they compared the costs with the online-only model – one might think that switching to online only would be a no-brainer, but licensing issues complicate the matter
cost per use of ejournal packages will not exceed twice the cost of ILL articles

One problem with their approach was with the existing procedures that resulted in not capturing data about all print journals. They also need to include local document delivery requests in future analysis. They need to better integrate the assessment of the use of materials in aggregator databases, particularly since users are inherently lazy and will go the easiest route to the content.

Aggregator databases are difficult to compare, and often the ISSN lists are incomplete. And, it’s difficult to compare based on title by title holdings coverage. It’s useful for long-term use comparison, but not this immediate project. Other problems with aggregator databases include duplication, embargos, and completeness of coverage of a title. They used SerSol’s overlap analysis tool to get an idea of duplication. It’s a time-consuming project, so they don’t plan to continue with it for all of their resources.

What if you don’t have any data or the data you have doesn’t have a quality standard? They relied on subject specialists and other members of the campus to assess the value of those resources.

CiL 2008: What’s New With Federated Search

Speakers: Frank Cervone & Jeff Wisniewski

Cervone gave a brief over-view of federated searching, with Wisniewski giving a demonstration of how it works in the real world (aka University of Pittsburgh library) using WebFeat. UofP library has a basic search front and center on their home page, and then a more advanced searching option under Find Articles. They don’t have a Database A-Z list because users either don’t know what database means in this context or can’t pick from the hundreds available.

Cervone demonstrated the trends in using meta search, which seems to go up and down, but over-all is going up. The cyclical aspect due to quarter terms was fascinating to see — more dramatic than what one might find with semester terms. Searches go up towards mid-terms and finals, then drop back down afterwards.

According to a College & Research Libraries article from November 2007, federated search results were not much different from native database searches. It also found that faculty rated results of federated searching much higher than librarians, which begs the question, “Who are we trying to satisfy — faculty/students or librarians.”

Part of why librarians are still unconvinced is because vendors are shooting themselves in the foot in the way they try to sell their products. Yes, federated search tools cannot search all possible databases, but our users are only concerned that they search the relevant databases that they need. De-duplication is virtually impossible and depends on the quality of the source data. There are other ways that vendors promote their products in ways that can be refuted, but the presenters didn’t spend much time on them.

The relationships between products and vendors is incestuous, and the options for federated searching are decreasing. There are a few open source options, though: LibraryFind, dbWiz, Masterkey, and Open Translators (provides connectors to databases, but you have to create the interface). Part of why open source options are being developed is because commercial vendors aren’t responding quickly to library needs.

LibraryFind has a two-click find workflow, making it quicker to get to the full-text. It also can index local collections, which would be handy for libraries who are going local.

dbWiz is a part of a larger ERM tool. It has an older, clunkier interface than LibraryFind. It doesn’t merge the results.

Masterkey can search 100 databases at a time, processing and returning hits at the rate of 2000 records per second, de-duped (as much as it can) and ranked by relevance. It can also do faceted browsing by library-defined elements. The interface can be as simple or complicated as you want it to be.

Federated searching as a stand-alone product is becoming passe as new products for interfacing with the OPAC are being developed, which can incorporate other library databases. vufind, WorldCat local, Encore, Primo, and Aquabrowser are just a few of the tools available. NextGen library interfaces aim to bring all library content together. However, they don’t integrate article-level information with the items in your catalog and local collections very well.

Side note: Microsoft Enterprise Search is doing a bit more than Google in integrating a wide range of information sources.

Trends: Choices from vendors is rapidly shrinking. Some progress in standards implementation. Visual search (like Grokker) is increasingly being used. Some movement to more holistic content discovery. Commercial products are becoming more affordable, making them available to institutions of all sizes of budgets.

Federated Search Blog for vendor-neutral info, if you’re interested.

Samantha Brennan on I’ve been published!November 30, 2020
What a fascinating sport. We'd love to have you back anytime! Welcome!
FY19 conferences, an update – eclectic librarian on FY19 conferencesJanuary 4, 2019
[…] was very excited to finally have approval to attend the Timberline Acquisitions Institute this year, but turns out […]
quantified self, an addendum – eclectic librarian on the quantified selfMarch 27, 2018
[…] I shared a list of apps and tools I’m using to monitor and track things, mainly health-related. Well, my…