ER&L 2016: COUNTER Point: Making the Most of Imperfect Data Through Statistical Modeling

score card
“score card” by AIBakker

Speakers: Jeannie Castro and Lindsay Cronk, University of Houston

Baseball statistics are a good place to start. There is over 100 years of data. Cronk was wishing that she could figure the WAR for eresources. What makes a good/strong resource? What indicators besides usage performance should we evaluate? Can statistical analysis tell us anything?

Castro suggested looking at the data as a time series. Cronk is not a statistician, so she relied on a lot of other folks who can do that stuff.

Statistical modeling is the application of a set of assumptions to data, typically paired data. There are several techniques that can be used. COUNTER reports are imperfect time series data sets. They don’t give us individual data points (day/time). They are clumped together by month, but aside from this, they are good for time series. There is equal spacing and time of consistently measured data points.

Decomposition provides a framework for segmented time series. Old data can be checked by newer data (i.e. 2010-2013 compared to 2014) without having to predict the future. Statistical testing is important in this. Exponential smoothing eliminates noise/outlier, and is very useful for anomalies in your COUNTER data due to access issues or unusual spikes.

Cronk really wanted to look at something other than cost/use, which was part of the motivation to do this. Usage by collection portion size is another method touted by Michael Levine-Clark. She needed 4+ years usage history for reverse predictive analysis. Larger numbers make analysis easier, so she went with large aggregator databases for DB and some large journal packages for JR.

She used Excel for data collection and clean-up, R (studio) for data analysis, and Tableau (public) for data visualization. R studio is a lot more user-friendly than the desktop. There are canned analysis packages that will do the heavy lifting. (There was a recommendation forRyan Womack’s video series for learning how to use R.) Tableau helped with visualization of the data, including some predictive indicators. We cannot see trends ourselves, so these visualization can help us make decisions. Usage can be predicted based on the past, she found.

They found that usage over time is consistent across the vendor platforms (for journal usage), even though some were used more than others.

The next level she looked at was the search to session ratio for databases. What is the average? Is that meaningful? When we look at usage, what is the baseline that would help us determine if this database is more useful than another? Downward trends might be indicators of outside factors.

css.php