The Bigger Picture: Creating a Statistics Dashboard That Ties Collection Building to Research
Speaker: Shannon Tharp, University of Wyoming
How can they tie the collection building efforts with the university’s research output? Need to articulate value to the stakeholders and advocate for budget increases.
She used Tableau to develop the dashboard and visualizations. Started with a broad overview of collections and then have expanded from there. The visualizations include a narrative and an intuitive interface to access more information.
The dashboard also includes qualitative interviews of faculty and research staff. They are tentatively calling this “faculty talk” and plan to have it up soon, with rotating interviews displaying. They are thinking about including graduate and undergraduate student interviews as well.
(e)Book Snapshot: Print and eBook Use in an Academic Library Consortium
Speaker: Joanna Voss, OhioLINK
What can we do to continue to meet the needs of students and faculty through the print to electronic book transition? Are there any patterns or trends in their use that will help? Anecdotally we hear about users preferring print to electronic. How do we find data to support this and to help them?
They cleaned up the data using Excel and OpenRefine, and then used Tableau for the analysis and visualization. OpenRefine is good for really messy data.
A Brief History of PaperStats
Speaker: Whitney Bates-Gomez, Memorial Sloan Kettering Cancer Center
Web-based tool for generating cost-per-use reports. It’s currently in beta and only working with JR1 reports. It works most of the time for COUNTER and SUSHI reports, but not always. The costs function requires you to upload the costs in a CSV format, and they were able to get that data from their subscription agent.
But, too bad for you, it’s going away at the end of the spring, but there might be a revised version out there some day. It’s through PubGet and Copyright Clearance Center decided to not renew their support.
Speakers: Michael Levine-Clark (University of Denver) & Kari Paulson (ProQuest eBrary/EBL)
ProQuest is looking at usage data across the eBrary and EBL platforms as they are working to merge them together. To help interpret the data, they asked Levine-Clark to look at it as well. This is more of a proof-of-concept than a final conclusion.
They looked at 750,000 ebooks initially, narrowing it down for some aspects. He asked several questions, from the importance of quality to disciplinary preferences to best practices for measuring use, and various tangential questions related to these.
They looked at eBrary data from 2010-2013Q3 and EBL data from 2011-2013Q3. They used only the titles with an LC call number, and separate analysis of those titles that come from university presses specifically.
Usage was defined in three ways: sessions, views (count of page views), and downloads (entire book). Due to the variations in the data sets (number of years, number of customers, platforms), they could not easily compare the usage information between eBrary and EBL.
Do higher quality ebooks get used more? He used university press books as a measure of quality, though he recognizes this is not the best measure. For titles with at least one session, he found that the rate of use was fairly comparable, but slightly higher for university press books. The session counts and page views in eBrary was significantly higher for UP books, but not as much with EBL. In fact, consistently use was higher for UP books across the categories, but this may be because there are more UP books selected by libraries, thus increasing their availability.
What does usage look like across broad disciplines? Humanities, Social Sciences, and STEM were broken out and grouped by their call number ranges. He excluded A & Z (general) as well as G (too interdisciplinary) out of the equation. The social sciences were the highest in sessions and views on eBrary, but humanities win the downloads. For EBL, the social sciences win all categories. When he looked at actions per session, STEM had higher views, but all downloaded at about the same rate on both platforms.
How do you measure predicted use? He used the percentage of books in an LC class relative to the total books available. If the percentage of a use metric is lower then it is not meeting expected use, and vice versa. H, L, G, N, and D were all better than expected. Q, F, P, K and U were worse than expected.
How about breadth versus depth? This gets complicated. Better to find the slides and look at the graphs. The results map well to the predicted use outcomes.
Can we determine the level of immersion in a book? If more pages are viewed per session in a subject area, does that mean the users spend more time reading or just look at more pages? Medicine (R), History of the Americas (F), and Technology (T) appear to be used at a much higher rate within a session than other areas, despite performing poorly in breadth versus depth assessment. In other words, they may not be used much per title, but each session is longer and involves more actions than others.
How do we use these observations to build better collections and better serve our users?
Books with call numbers tend to be use more than those without. Is it because a call number is indicative of better metadata? Is it because publishers of better quality will provide better metadata? It’s hard to tell at this point, but it’s something he wants to look into.
A white paper is coming soon and will include a combined data set. It will also include the EBL data about how long someone was in a book in a session. Going forward, he will also look into LC subclasses.
I need to find a happy medium between self-paced instruction and structured instruction.
I signed up for a Coursera class on statistics for social science researchers because I wanted to learn how to better make use of library data and also how to use the open source program for statistical computing, R. The course information indicated I’d need to plan for 4-6 hours per week, which seemed doable, until I got into it.
The course consists of several lecture videos, most of which include a short “did you get the main concepts” multiple-choice quiz at the end. Each week there is an assignment and graded quiz, and of course a midterm and final.
It didn’t help that I started off behind, getting through only a lecture or two before the end of the first week, and missing the deadline for having the first assignment and quiz graded. I scrambled to catch up the second week, but once again couldn’t make it through the lectures in time.
That’s when I realized that it was going to take much longer than projected to keep up with this course. A 20-30 min lecture would take me 45-60 min to get through because I was constantly having to pause and write notes before the lecturer went on to the next concept. And since I was using Microsoft OneNote to keep and organize my notes, anything that involved a formula took longer to copy down.
By the end of the third week, I was still a few lectures away from finishing the second week, and I could see that it would take more time than I had to keep going, but I decided to go another week and do what I could.
That was this week, and I haven’t had time to make any more progress than where I was last week. With no prospect of catching up before the midterm deadline, I decided to withdraw from the course.
This makes me both disappointed in myself and in the structure of the course. I hate quitting, and I really want to learn the stuff. But, as I fell further and further behind, it became easier to put it off and focus on other overdue items on my task list, and thus compounding the problem.
The instructor for the course was easy to follow, and I like his lecture style, but when it came time to do the graded quiz and assignment, I realized I clearly had not understood everything, or he expected me to have more of a background in the field than a novice. It also seemed like the content was geared towards a 12 week course and with this being only 8 weeks, rather than reduce the content accordingly, he was cramming it all into those 8 weeks.
Having deadlines was a great motivation to keep up with the course, which I haven’t had when I’ve tried to learn on my own. It was the volume of content to absorb between those deadlines that tripped me up. I need to find a happy medium between self-paced instruction and structured instruction.
My day began with organizing and prioritizing the action items that arrived yesterday when I was swamped with web-scale discovery service presentations. I didn’t get very far when it was time to leave for a meeting about rolling out VuFind locally. Before that meeting, I dropped in to update my boss (and interim University Librarian) on some things that came out of the presentations and subsequent hallway discussions.
At the VuFind meeting, we discussed some tweaks and modifications, and most everyone took on some assignments to revise menu labels, record displays, and search options. I managed to evade an assignment only because these things are more for reference, cataloging, and web services. The serials records look fine and appear accurately in the basic search (from the handful of tests I ran), so I’m not concerned about tweaking anything specifically.
Back at my desk, I started to work on the action items again, but the ongoing conversations about the discovery service presentations distracted me until one of the reference librarians provided me with a clue about the odd COUNTER use stats we’ve received from ProQuest for 2011.
I had given her stats on a resource that was on the CSA platform, but for the 2011 stats I provided what ProQuest gave me, which were dubious in their sudden increase (from 15 in 2010 to 4756 in 2011). She made a comment about how the low stats didn’t surprise her because she hates teaching the Illumina platform. I said it should be on the ProQuest platform now because that’s where the stats came from. She said she’d just checked the links on our website, and they’re still going to Illumina.
This puzzled me, so I pulled the CSA stats from 2011, and indeed, we had only 17 searches for the year for this index. I checked the website and LibGuides links, and we’re still sending users to the Illumnia platform, and not ProQuest. So, I’m not sure where those 4756 searches were coming from, but their source might explain why our total ProQuest stats tripled in 2011. This lead me to check our federated search stats, and while it shows quite a few searches of ProQuest databases (although not this index, as we hadn’t included it), our DB1 report shows zero federated searches and sessions.
I compiled all of this and sent it off to ProQuest customer support. I’m eager to see what their response will be.
This brought me up to my lunch break, which I spent at the gym where one of the trainers forced my compatriots and I to accomplish challenging and strenuous activities for 45 min. After my shower, I returned to the library to lunch at my desk and respond to some crowd-sourced questions from colleagues at other institutions.
I managed to whack down a few email action items before my ER&L co-presenter called to discuss the things we need to do to make sure we’re prepared for the panel session. We’re pulling together seasoned librarians and product representatives from five different electronic resource management systems (four commercial, one open-source) to talk about their experiences working with the products. We hashed out a few things that needed hashing out, and ended the call with more action items on our respective lists.
At that point, I had about 20 min until my next meeting, so I tracked down the head of research and instruction to hash out some details regarding the discovery service presentations that I wanted to make sure she was aware of. I’m glad I did, because she filled in some gaps I had missed, and later she relayed a positive response from one of the librarians that concerned both of us.
The meeting ended early, so I took the opportunity of suddenly unscheduled time in my calendar to start writing down this whole thing. I’d been so busy I hadn’t had time to journal this throughout the day like I’d previously done.
Heard back from ProQuest, and although they haven’t addressed the missing federated search stats from their DB1 report, they explain away the high number of searches in this index as having come from a subject area search or the default search across all databases. There was (and may still be) a problem with defaulting to all databases if the user did not log out before starting a new session, regardless of which database they intended to use. PQ tech support suggested looking at their non-COUNTER report that includes full-text, citation, and abstract views for a more accurate picture of what was used.
For the last stretch of the day, I popped on my headphones, cranked up the progressive house, and tried to power through the rest of the email action items. I didn’t get very far, as the first one required tracking down use stats and generating a report for an upcoming renewal. Eventually, I called it a day and posted this. Yay!
The Library Journal periodicals price survey was developed in partnership with EBSCO when the ALA pulled the old column to publish in American Libraries. There is a similar price survey being done by the AALL for law publications.
There is a difference between a price survey and a price index. A price survey is a broad look, and a price index attempts to control the categories/titles included.
[The next bit was all about the methodology behind making the LJ survey. Not why I am interested, so not really taking notes on it.]
Because of the challenge of getting pricing for ejournals, the survey is based mainly on print prices. That being said, the trends in pricing for print is similar to that of electronic.
Knowing the trends for pricing in your specific set of journals can help you predict what you need to budget for. While there are averages across the industry, they may not be accurate depending on the mix in your collection. [I am thinking that this means that the surveys and indexes are useful for broad picture looks at the industry, but maybe not for local budget planning?]
It is important to understand what goes into a pricing tool and how it resembles or departs from local conditions in order to pick the right one to use.
Budgets for libraries and higher education are not in “recovery.” While inflation calmed down last year, they are on the rise this year, with an estimate of 7-8%. The impact may be larger than at the peak of the serials pricing crisis in the 1990s. Libraries will have less buying power, and users will have less resources, and publishers will have fewer customers.
Why is the inflation rate for serials so much higher than the consumer price index inflation rate? There has been an expansion of higher education, which adds to the amount of stuff being published. The rates of return for publishers are pretty much normal for their industry. There isn’t any one reason why.
Library data is meaningless in and of itself – you need to interpret it to give it meaning. Piotr Adamczyk did much of the work for the presentation, but was not able to attend today due to a schedule conflict.
They created the visual dashboard for many reasons, including a desire to expose the large quantities of data they have collected and stored, but in a way that is interesting and explanatory. It’s also a handy PR tool for promoting the library to benefactors, and to administrators who are often not aware of the details of where and how the library is being effective and the trends in the library. Finally, the data can be targeted to the general public in ways that catch their attention.
The dashboard should also address assessment goals within the library. Data visualization allows us to identify and act upon anomalies. Some visualizations are complex, and you should be sensitive to how you present it.
The ILS is a great source of circulation/collections data. Other statistics can come from the data collected by various library departments, often in spreadsheet format. Google Analytics can capture search terms in catalog searches as well as site traffic data. Download/search statistics from eresources vendors can be massaged and turned into data visualizations.
The free tools they used included IMA Dashboard (local software, Drupal Profile) and IBM Many Eyes and Google Charts (cloud software). The IMA Dashboard takes snapshots of data and publishes it. It’s more of a PR tool.
Many Eyes is a hosted collection of data sets with visualization options. One thing I like was that they used Google Analytics to gather the search terms used on the website and presented that as a word cloud. You could probably do the same with the titles of the pages in a page hit report.
Google Chart Tools are visualizations created by Google and others, and uses Google Spreadsheets to store and retrieve the data. The motion charts are great for showing data moving over time.
Lessons learned… Get administrative support. Identify your target audience(s). Identify the stories you want to tell. Be prepared for spending a lot of time manipulating the data (make sure it’s worth the time). Use a shared repository for the data documents. Pull from data your colleagues are already harvesting. Try, try, and try again.
Three options: do it yourself, gather and format to upload to a vendor’s collection database, or have the vendor gather the data and send a report (Harrassowitz e-Stats). Surprisingly, the second solution was actually more time-consuming than the first because the library’s data didn’t always match the vendor’s data. The third is the easiest because it’s coming from their subscription agent.
Evaluation: review cost data; set cut-off point ($50, $75, $100, ILL/DocDel costs, whatever); generate list of all resources that fall beyond that point; use that list to determine cancellations. For citation databases, they want to see upward trends in use, not necessarily cyclical spikes that average out year-to-year.
Future: Need more turnaway reports from publishers, specifically journal publishers. COUNTER JR5 will give more detail about article requests by year of publication. COUNTER JR1 & BR1 combined report – don’t care about format, just want download data. Need to have download information for full-text subscriptions, not just searches/sessions.
Speaker: Benjamin Heet, librarian
He is speaking about University of Notre Dame’s statistics philosophy. They collect JR1 full text downloads – they’re not into database statistics, mostly because fed search messes them up. Impact factor and Eigen factors are hard to evaluate. He asks, “can you make questionable numbers meaningful by adding even more questionable numbers?”
At first, he was downloading the spreadsheets monthly and making them available on the library website. He started looking for a better way, whether that was to pay someone else to build a tool or do it himself. He went with the DIY route because he wanted to make the numbers more meaningful.
Avoid junk in junk out: HTML vs. PDF downloads depends on the platform setup. Pay attention to outliers to watch for spikes that might indicate unusual use by an individual. The reports often have bad data or duplicate data on the same report.
CORAL Usage Statistics – local program gives them a central location to store user names & passwords. He downloads reports quarterly now, and the public interface allows other librarians to view the stats in readable reports.
Speaker: Justin Clarke, vendor
Harvesting reports takes a lot of time and requires some administrative costs. SUSHI is a vehicle for automating the transfer of statistics from one source to another. However, you still need to look at the data. Your subscription agent has a lot more data about the resources than just use, and can combine the two together to create a broader picture of the resource use.
Harrassowitz starts with acquisitions data and matches the use statistics to that. They also capture things like publisher changes and title changes. Cost per use is not as easy as simple division – packages confuse the matter.
High use could be the result of class assignments or hackers/hoarders. Low use might be for political purchases or new department support. You need a reference point of cost. Pricing from publishers seems to have no rhyme or reason, and your price is not necessarily the list price. Multi-year analysis and subject-based analysis look at local trends.
Rather than usage statistics, we need useful statistics.
We have the tools and the data, now we need to use them to the best advantage. Statistics, along with other data, can create a picture of how our online resources are being used.
Traditionally, we have gathered stats by counting when re-shelving, ILL, gate counts, circulation, etc. Do these things really tell us anything? Stats from eresources can tell us much more, in conjunction with information about the paths we create to them.
Even with standards, we can run into issues with collecting data. Data can be “unclean” or incorrectly reported (or late). And, not all publishers are using the standards (i.e. COUNTER).
After looking at existing performance indicators, applying them to electronic resources, then we can look at trends with our electronic resources. This can help us with determining the return on investment in these resources.
Keep a master list of stats in order to plan out how and when to gather them. Keep the data in a shared location. Be prepared to supply data in a timely fashion for collection development decision-making.
When you are comparing resources, it’s up to individual institutions to determine what is considered low or high use. Look at how the resources stack up within the over-all collection.
When assessing the value of a resource, Beals and her colleagues are looking at 2-3 years of use data, 10% cost inflation, and the cost of ILL. In addition, they make use of overlap analysis tools to determine where they have multiple formats or sources that could be eliminated based on which platforms are being used.
Providing readily accessible data in a user-friendly format empowers selectors to do analysis and make decisions.
For the past couple of weeks, the majority of my work day has been spent on tracking down and massaging usage statistics reports from the publishers of the online products we purchase. I am nearly half-way through the list, and I have a few observations based on this experience:
1. There are more publishers not following the COUNTER code of practice than those who are. Publishers in traditionally library-dominated (and in particular, academic library-dominated) markets are more likely to provide COUNTER-compliant statistics, but that is not a guarantee.
2. Some publishers provide usage statistics, and even COUNTER-compliant usage statistics, but only for the past twelve months or some other short period of time. This would be acceptable only if a library had been saving the reports locally. Otherwise, a twelve month period is not long enough to use the data to make informed decisions.
3. We are not trying to use these statistics to find out which resources to cancel. On the contrary, if I can find data that shows an increase in use over time, then my boss can use it to justify our annual budget request and maybe even ask for more money.
Update: It seems that the conversation regarding my observations is happening over on FriendFeed. Please feel free to join in there or leave your thoughts here.
The following is an email conversation between myself and the representative of a society publisher who is hosting their journals on their own website.
Can I access the useage information for my institution? We subscribe to both the print and online [Journal Name].
Dear Ms. Creech,
At the most recent meeting of the [Society] Board of Directors, the topic of usage statistics was discussed at length. As I am sure you are aware, usage statistics are a very coarse measure of the use of a web resource. As just one example, there is no particular relationship between the number of downloads of an article and the number of times it is read or the number of times it is cited. An article download could represent anything from glancing at the abstract, to careful reading. Once downloaded, articles can be saved locally, re-read and redistributed to others. Given the lack of any evidence that downloads of professional articles have any relationship to their effective audience size or their value to readers, the Board decided that [Society] will not provide potentially misleading usage statistics. We do periodically publish the overall usage of the [Society] website, about 10 million hits per year.
[Society] Web Editor
Dear Mr. [Name Removed],
Your Board of Directors are certainly a group of mavericks in this case. Whether they think the data is valuable or not, libraries around the world use it to aid in collection development decisions. Without usage data, we have no idea if an online resource is being used by our faculty and students, which makes it an easy target for cancellation in budget crunch times. I suggest they re-think this decision, for their own sakes.
We all know that use statistics do not fully represent the way an online journal is used by researchers, but that does not mean they are without value. No librarian would ever make decisions base on usage data alone, but it does contribute valuable information to the collection development process.
Hits on a website mean even less than article downloads. Our library website gets millions of hits just from being the home page for all of the browsers in the building. I would never use website hits to make any sort of a decision about an online resource.
Provide the statistics using the COUNTER standard and let the professionals (i.e. librarians) decide if they are misleading.
UPDATE: The conversation continues….
Dear Ms. Creech,
Curiously, the providers of usage statistics are primarily commercial publishing houses. Few science societies that publish research journals are providing download statistics. In part, this is a matter of resources that the publisher can dedicate to providing statistics-on-demand: commercial publishing houses have the advantage of an economy of scale. They are also happy to provide COUNTER-compliant statistics in part because they are relatively immune to journal cancellation, as a result of mandatory journal bundling.
In any event, after careful consideration and lengthy discussion with a librarian-consultant, the Board concluded that usage statistics are easy to acquire and tempting to use, but are in effect “bad data”. I certainly respect your desire to make the most of a tight library budget, but also respectfully disagree that download statistics are an appropriate tool to make critical judgements about journals. Other methods to learn about the use of a particular journal are available- for example, asking faculty and students to rate the importance of journals to their work, or using impact factors. I am sure you take these into account as well.
I will copy this reply to the [Society] Board so that they are aware of your response. No doubt the Board will revisit the topic of usage statistics in future meetings.
Dear Mr. [Name Removed],
I never ment to imply that we exclusively use statistics for collection development decisions. We also talk with faculty and students about their needs. However, the numbers are often a good place to begin the discussions. As in, “I see that no one has downloaded any articles from this journal in the past year. Are you still finding it relevant to your research?” Even prior to online subscriptions, librarians have looked at re-shelve counts and the layer of dust on the tops of materials as indicators that a conversation is warranted.
I suggest your Board take a look at the American Chemical Society. They provide COUNTER statistics and are doing quite well despite the “bad data.”