NASIG 2013: Knowledge and Dignity in the Era of Big Data

CC BY 2.0 2013-06-10
“Big Data” by JD Hancock

Speaker: Siva Vaidhyanathan

Don’t try to write a book about fast moving subjects.

He was trying to capture the nature of our relationship to Google. It provides us with a services that are easy to use, fairly dependable, and well designed. However, that level of success can breed hubris. He was interested in how this drives the company to its audacious goals.

It strikes him that what Google claims to be doing is what librarians have been doing for hundreds of years already. He found himself turning to the core practices of librarians as a guideline for assessing Google.

Why is Google interested in so much stuff? What is the payoff to organizing the world’s information and making it accessible?

Big data is not a phrase that they use much, but the notion is there. More and faster equals better. Google is in the prediction/advertising business. The Google books project is their attempt to reverse engineer the sentence. Knowing how sentences work, they can simulate how to interpret and create sentences, which would be a simulation of artificial intelligence.

The NSA’s deals that give them a backdoor to our data services creates data insecurity, because if they can get in, so can the bad guys. Google keeps data about us (and has to turn it over when asked) because it benefits their business model, unlike libraries who don’t keep patron records in order to protect their privacy.

Big data means more than a lot of data. It means that we have so many instruments to gather data, cheap/ubiquitous cameras and microphones, GPS devices that we carry with us, credit card records, and more. All of these ways of creating feed into huge servers that can store the data with powerful algorithms that can analyze it. Despite all of this, there is no policy surrounding this, nor conversations about best ways to manage this in light of the impact on personal privacy. There is no incentive to curb big data activities.

Scientists are generally trained to understand that correlation is not causation. We seem to be happy enough to draw pictures with correlation and move on to the next one. With big data, it is far too easy to stop at correlation. This is a potentially dangerous way of understanding human phenomenon. We are autonomous people.

The panopticon was supposed to keep prisoners from misbehaving because they assumed they were always being watched. Foucault described the modern state in the 1970s as the panopticon. However, at this point, it doesn’t quite match. We have a cryptopticon, because we aren’t allowed to know when we are being watched. It wants us to be on our worst behavior. How can we inject transparency and objectivism into this cryptopticon?

Those who can manipulate the system will, but those who don’t know how or that it is happening will be negatively impacted. If bad credit can get you on the no-fly list, what else may be happening to people who make poor choices in one aspect of their lives that they don’t know will impact other aspects? There is no longer anonymity in our stupidity. Everything we do, or nearly so, is online. Mistakes of teenagers will have an impact on their adult lives in ways we’ve never experienced before. Our inability to forget renders us incapable of looking at things in context.

Mo Data, Mo Problems

NASIG 2012: Why the Internet is More Attractive Than the Library

Speaker: Dr. Lynn Silipigni Connaway, OCLC

Students, particularly undergraduates, find Google search results to make more sense than library database search results. In the past, these kinds of users had to work around our services, but now we need to make our resources fit their workflow.

Connaway has tried to compare 12 different user behavior studies in the UK and the US to draw some broad conclusions, and this has informed her talk today.

Convenience is number one, and it changes. Context and situation are very important, and we need to remember that when asking questions about our users. Sometimes they just want the answer, not instruction on how to do the research.

Most people power browse these days: scan small chunks of information, view first few pages, no real reading. They combine this with squirreling — short, basic searches and saving the content for later use.

Students prefer keyword searches. This is supported by looking at the kinds of terms used in the search. Experts use broad terms to cover all possible indexing, novices use specific terms. So why do we keep trying to get them to use the “advance” search in our resources?

Students are confident with information discovery tools. They mainly use their common sense for determining the credibility of a site. If a site appears to have put some time into the presentation, then they are more likely to believe it.

Students are frustrated with navigating library websites, the inconvenience of communicating with librarians face to face, and they tend to associate libraries only with books, not with other information. They don’t recognize that the library is who is providing them with access to online content like JSTOR and the things they find in Google Scholar.

Students and faculty often don’t realize they can ask a question of a librarian in person because we look “busy” staring at our screens at the desk.

Researchers don’t understand copyright, or what they have signed away. They tend to be self-taught in discovery, picking up the same patterns as their graduate professors. Sometimes they rely on the students to tell them about newer ways of finding information.

Researchers get frustrated with the lack of access to electronic backfiles of journals, discovering non-English content, and unavailable content in search results (dead links, access limitation). Humanities researchers feel like there is a lack of good, specialized search engines for them (mostly for science). They get frustrated when they go to the library because of poor usability (i.e. signs) and a lack of integration between resources.

Access is more important than discovery. They want a seamless transition from discovery to access, without a bunch of authentication barriers.

We should be improving our OPACs. Take a look at Trove and Westerville Public Library. We need to think more like startups.

tl;dr – everything you’ve heard or read about what our users really do and really need, but we still haven’t addressed in the tools and services we offer to them

google print

My thoughts on Google Print, such as they are.

Benjamin asked for my opinion on Google Print. I started to reply in the comments, but it quickly grew from a small reply to something entry-sized:

I haven’t blogged on Google Print because I haven’t decided what I think about it. It’s gotten coverage on a variety of librarian blogs, as well as some public radio programs that I’ve heard.

The way I see it, it’s often difficult to find material in books because they aren’t always indexed very well. Unlike many journals, they aren’t available full-text so that you can search the entire book. Some companies are providing books in full-text formats, and there are several models for it, but their emphasis is on new books. I think what Google Print has to offer is full-text searching of old and out of print books. Often these have useful information for modern scholars.

My concern about Google Print is twofold:
1. Copyright — They need to be careful in not stepping over the line of copyright or else the whole project may be tainted.
2. Searching — If I’m doing scholarly research, I don’t want to get 10,000 hits on a keyword search. I’m not sure how Google’s relevance rankings will work for books, but I hope that the search results will be as precise and accurate as a good reference database’s.

I’m keeping an open mind, waiting to see how it all turns out. I don’t want to trash Google Print just because it may step on the toes of libraries. I do hope that libraries will keep the books that are scanned into Google Print, because I doubt our users who want to read the whole book rather than gleaning information from parts of it will be willing to read it on a computer screen or print out the entire thing. On the other hand, our emerging users are more comfortable with screen text than even my generation, so I could be wrong.