March | 2014 | Notions

It sounds painful and it kind of was, but progress has been made in the quest to open our digital repository data and that progress involves Solr. To wit, there are a few things I have learned recently:

I can actually figure out a lot more with Solr than I thought
Trying to implement a Solr plugin when you are somewhat challenged to make Solr run and index records in the first place is a clear indication of one’s willfulness
The following movies are really good for taking breaks: Fierce Creatures, The American, the new Muppets movie, and Monuments Men (breaks are important and George Clooney seems to help too!)

The latest chapter of my story begins in Bloomington and ends in Amsterdam, with a few trips around the sun in between. Let’s begin by talking about Solr because we all know what happened when I contemplated opening up the data directly from Fedora.

If you want to grab a version of Solr and give it a whirl, nothing is easier than hitting the Solr download page, unzipping/un-tarring the latest version, and running the magic “java -jar start.jar” command in the magic “core” directory. On a Mac or Linux box, of course (who knows what happens on Windows). You can even index some sample records with a different java command and the admin interface gives you search results and it just works. Magic! Everything’s great. Then, you want to do something more. Like try this Solr install with a giant index of actual data and then see if you can add a plugin for OAI feeds. What could go wrong?

Problem: Giant index of data is for a different Solr version than the one you installed
Problem: OAI plugin requires a different Solr version (3.4.0) than the one you installed (4.7.0) and the one where you got the giant index of data (1.4.1)
Problem: Even if you could get that giant index of data working on a Solr version that would also work for the OAI plugin, you can’t figure out where to put the OAI plugin’s binary (.jar file) because the magic install you used works with Jetty, a self-contained Java server, that mixes everything up between the Java code and the Solr code

We next embark on installing Solr to run on Tomcat, because that’s how I give up on things.

“Brand New Tomcats [CLR]” by Grigor Hristov

I had a giant index of data from Solr 1.4.1 and the oai4solr plugin needed Solr 3.4.0 at least, so 3.4.0 is what I chose and I backed off from using the giant index of data. Instead, I found the pieces we use to index certain types of MODS records from our Fedora digital repository into Solr 1.4.1 and, after copying the important pieces from the Solr 1.4.1 config (I hope), I’ve managed to recreate what could be our Solr index from Fedora using 3.4.0 – with 4 records of data.

Essentially, for those who like pictures, I was dealing with one version of Solr that had an index, like a piece of bread with a pat of butter on it.

Pictures of food always help a blog post, I say.

“Bread & Butter” by Mark H. Anbinder

Then I had another version of Solr that also had an index, like a broom and a dustpan.

“sikth y su amado recogedor” by Iris Shyroii

What I was initially trying to do was butter my broom, or apply an index from one version of Solr to a different version of Solr. This is not a thing, apparently.

OK, a broom made out of butter is apparently a thing, but buttering your broom is not, trust me.

“Butter broom and Monster Book of Monsters” by Sarah Cady

Once Solr was installed on Tomcat, all .jar files had a single place to go and I knew where to put the oai4solr.jar file. But OAI feeds don’t just happen, they are a specialized metadata format surrounding records in more commonly recognizable metadata formats (Dublin Core, for example) and can be called up using a specific URL with specific parameters. So stuff has to be programmed (.jar file) but stuff also has to be configured, and I wasn’t configuring anything well in the code that accompanied the .jar file.

Spending your Sunday afternoon with Java error messages on your localhost server sucks. In the end, clearer minds prevailed (Cliffster) and I was convinced that it was reasonable to email the guy who wrote the oai4solr plugin and put it on Github in the first place. In Amsterdam. On a Sunday night.

He answered. From Amsterdam. On a Sunday night. Lucien van Wouw from the International Institute of Social History, the author of oai4solr, helped me fix up my plugin configuration and I had a working oai4solr plugin on Monday morning. I now have OAI feeds from my Solr index returning sets and record lists in DC and MODS. (I’m not exactly sure how to say his last name, but I’m going with Wow! – including exclamation point – because that was awesome.)

“WoW!” by Laurie Chipps

So now I have a different Solr version with a super-small number of records, possibly configured the same as the original Solr index, definitely with a couple new things that need to be included when indexing Fedora records into Solr, and absolutely with other collections in Fedora that still need to receive this mapping treatment so they can have descriptive information and be included in our Solr index. But most importantly, there is a way forward to open our data with OAI feeds. I banged my head against the Sun and kinda made something happen!

So, the goal: expose our digital objects in our digital repository so they can be grabbed as whole sets of data. This makes our data open in an Open Access way, specifically “permitting any user to read, download, copy, distribute, print, search or link to the [digital objects], crawl [their metadata] for indexing, pass them as data to software or use them for any lawful purpose.” Our goal in digitizing the things we digitize is to make them more available for research and use. So this includes being able to not only go to a web site and conduct a search that we provide to find an item or set of items, but also to directly take the information about those items and make use of it beyond the user interface points that we provide. Opening our digital repository objects also allows all of our digital objects to be combined with digital objects opened up at other places. Digital Public Library of America offers a way to combine multiple sets like ours to create larger sets and collections. We need to move beyond only providing access the way we know (web sites with search and browse). We need to set our digital objects out there with enough hooks to make it possible for anyone swooshing by on the Intertubes to grab that data and use it the way they want to use it.

So, the question: where to start? No seriously, that was my first question. Our digital repository had previous attempts at OAI harvestable feeds. These exposed specific sets (collections) in a specific way (DC records and sometimes MODS records via XML). It turns out that all of these attempts involved not going directly from the digital repository but extracting data from a certain point in time into an OAI service and then exposing that data. These feeds are now old and stale (and somehow I’m picturing dried up biscuits and now I am sad – few things are better than a fresh biscuit, with butter and maybe some jam).

@shutterbean via Flickr

The next step was digging into the digital repository and figuring what was possible there to provide data feeds. Turns out, not much. We use Fedora and while it does a terrific job of taking things in, it doesn’t do so well at offering things up for discovery and access. I feel like my discoveries (which I’m sure in Fedora-land are only discoveries to me) might warrant a separate blog post, but I also know the Fedora Commons Mailing Lists exist and seem to be the place to go to ask questions and discuss Fedora-specific angst, so in trying to keep things organized for myself here on the blog, I will hold off for now.

In a serendipitous sort of fashion, our digital repository has undergone Solr-ization in the form of a Blacklight search interface – Digital Collections Search (DCS). This had more to do with the desire to provide better discovery options from us than to open our data. Instead of needing to go to a specific site to search or browse a specific collection (Cushman photographs) or a specific format (Archives Online – and good luck figuring out which finding aids have digital objects attached to them), you can use DCS to take in a better picture of all of our digital objects, or the repository as a whole. You still can’t do anything beyond accessing those web sites we provide for you, so the access part is still restrictive, but the discovery part is improved. From what I understand, anything in our digital repository containing a MODS record is automatically indexed into Solr and offered up through DCS.

So we can’t go directly from our digital repository to expose data, the OAI feeds that were previously implemented are not easily maintained or refreshed (obviously, because stale biscuits), and we appear to have a Solr index that is regularly populated from our digital repository. I also appear to have shifted my goal from exposing our digital repository data to exposing our Solr-indexed data. This seems like the more trendy way to approach the problem (the path involving the most mustaches and skinny jeans, if you will) but how does it survive over the long term?

We are now investigating OAI feeds from Solr (imagine my lack of surprise at the existence of a Github project for just this sort of thing) and what it would mean to adjust the response we provide from Solr so that instead of only going through Blacklight we can provide a response in a variety of ways (JSON, XML, CSV). Maybe this is just what we need and any future changes to available indexing can be managed more gracefully and the real work is involved in making sure our digital repository data is ready for whatever comes along. That is my hopeful ending to an otherwise weird tale of exposing data that should really already be out there.

Notions

Ideas, supposings, in the workings. And the occasional thread.

Monthly Archives: March 2014

Banging My Head Against the Sun

Technicalities of Exposing Digital Repository Data