Digital Archive Sabbatical

This blog is for anyone interested in or experienced with digital archives and institutional repositories, especially in science and technology libraries.

Saturday, October 30, 2004

MATRIX

One task of the CIS librarians is to review potential collections for the system. I attended a meeting with a medieval historian with a database entitled MATRIX Bibliographia, visible at http://monasticmatrix.usc.edu/bibliographia/ . Discussion is at the preliminary stages, with the Bibliographia scholars hoping to get a grant to keep it going. Discussions center around the cost of imaging, copyright, pushing copies to the public as needed, growth in resources, ownership of the images, adding new scholarship (like an open access journal), and so forth. This is an especially interesting example, as it is run by faculty around the country, and dependent more on the faculty person rather than on a local department or research center. CIS has to consider the cost of getting the collection into the system, the value of the collection to end-users, the ownership of the images, and the future of the collection should the present faculty member want to pass the responsibility on. CIS has developed a form for potential participants to fill out. I'll get a copy....


CIS and DIM

I am learning the acronyms. CIS is USC's Collection Information System. The group I am working with is creating the system using Documentum and Oracle, and migrating away from a legacy system built on BRS and its non-relational database.

DIM is Digital Information Management. They are the people that digitize the information and create the metadata for the Collection Information System. I had meetings with both groups this week. The CIS people are very much into working through their agenda of finishing the migration, adding to collections, improving thesaurus control, and improving the Contributor software for adding material to the system. This they are building themselves.

The DIM area has fabulous scanning facilities - an entire studio devoted to it, with cameras large and small, scanners large and small, special lighting, editing software, and temporary storage devices to hold the information until it is approved and moves on to have derivatives created (thumbnails, jpegs and the like) via MrSID and gets metadata treatment.

The metadata group uses qualified Dublin Core and XML to serve descriptive information to the public. I got a good description of the positives and negatives involved in adapting Documentum to the tasks at hand. Documentum was designed for corporate revision control and audit trails. It is requiring customization for application in the digital archive arena.

The whole system runs on 3 servers, one for ingest (getting info into the system), one for renditioning, and one for thesaurus management using Documentum's Content Intelligence System (also CIS, not to be confused with USC's Collection Information System!). More than anything I am amazed at the staff resources being applied to all of these steps, in particular to the customization of Documentum.


Home Scene

Just a little about the home scene.
I am staying in a Victorian house 10 blocks north of USC. I have my own room and bathroom, an internet connection, and a view into the back garden where the sun shines in all day. The garden has a furnished glassed-in room for parties and doing laundry. I can practice the violin there too and not bother anyone. The garden's fountains are always bubbling. The foliage is so thick I can scarcely see the Victorian B&B that sits across the garden. It is called the Pink Lady, for obvious reasons.

The entire garden is surrounded by a fence, lending a feeling of peace and security. There is also an off-street parking area, which inspired my decision to drive to Los Angeles. The owner of the B&B does not even lock her car! She also owns Cafe 29 on Hoover Street, which I have yet to visit.

There are six rooms in my house, each rented on a short or long-term basis by people generally related to USC, as are the alumni owners. Currently they range from the young Italian woman getting a doctorate in Slavic Literature to the Indian computer science research associate downstairs. For a few days, the parents of a student are visiting - the father from my hometown of Minneapolis but now a diplomat in Croatia. A Chinese couple just moved in while they seek a permanent apartment. We all share a living room with TV, and a huge kitchen with the biggest range I have seen - 8 gas burners, a griddle in the middle and 3 ovens! We each have spaces in cupboards and in the frig for our food supply. It seems a comfortable arrangement.

Research in Visual Arts

At Doheny (USC's main library) in the Intellectual Commons (a room set aside for symposia and the like) there was a presentation yesterday (Friday) for visual arts scholars on the resources and services available to scholars in the LA area. Presenters were from the various institutions in LA that house visual materials. An impressive array. I have summarized the information elsewhere, but the institutions represented are listed below. They have literally millions of photos among them, much relating to CA history and also the movie industry.

The Huntington: Jennifer Watts
USC Regional History Collection: Dace Taube
UCLA Dept of Special Collections: Online Archive www.oac.cdlib.org
Getty Research Institute
Academy of Motion Picture Art and Science: the OSCAR people
Margaret Herrick Library
USC Cinema Library
USC AFA (Architecture and Fine Arts)
Art history slide library
USC Digital Archive

Wednesday, October 27, 2004

USC beginning

Yesterday I met Barbara Shepard at USC and began the process of orientation. In addition to learning a bit about what is going on in her area, I had to get phone, email, and USC ID card. I also checked in with ISD's Human Resources.

Following a lunch treat at the University Club with Barbara and Deborah Holmes-Wong, who is working on metasearching using ZPortal, I settled in to reviewing the CIS documents on the intranet, mainly a charter spelling out the goals and project plan for 2004-2005. The plans are ambitious, including enabling OAI/PMH, evaluating user interfaces, expanding content, implementing a contributor module, not to mention management reporting and thesaurus management.

I walked to work yesterday and took the free USC Tram home, arriving just before the rains. It rained all night. The talk of the town.

Today I had occasion to meet Mr. Lynn Sipe, head of collections here. Continued to review Digital Information Management documents. I was interested to learn that the people who do scanning and digitizing are completely separate from this unit, that is just building the infrastructure for future digital projects.

Monday, October 25, 2004

The LA adventure

I returned from Michigan to Cincinnati, repacked the car, and set out for LA on October 22. After three hard days of driving (36 hours) including an overnight stay with a college friend in Santa Fe, I arrived in Los Angeles.

Today I checked into my lodging, a room in a Victorian house about a half mile from the University of Southern California (USC). I spent the afternoon unpacking, setting up my laptop, and getting groceries. I will be sharing a kitchen and garden area with other tenants, one of whom is an Italian student getting a PhD in Slavic Literature.

Tomorrow I will meet with Barbara Shepard at USC and my work there will begin. I have some questions from Anita at OhioLINK about the software they are using at USC. Hopefully I can get some useful answers.

Monday, October 18, 2004

Metadata

The Getty Museum has a thorough description of metadata, some of it written by Anne Gilliland-Swetland. Is that the Anne Gilliland that used to work in Records Management at UC? That is my focus today - wading through the metadata world. It's at http://www.getty.edu/research/conducting_research/standards/intrometadata/index.html if you are interested.

Yesterday was a Sunday, and the library was closed. I ended up performing in the Benzie Area Symphony Orchestra in a concert featuring two high-school award winners, a flutist from the Leelenau School and a cellist from Interlochen Arts Academy. I am amazed that they even have an orchestra here. Some ringers were pulled in, from the music department at the local public school and from Interlochen.

Today is my last day before heading back to Cincinnati from Frankfort (had you guessed where I am?) There will be a hiatus now as I re-pack and get myself to LA, a long solo drive.

Saturday, October 16, 2004

CODA

I couldn't seem to find much on the web showing behind-the-scenes workings of the USC Digital Archive. So I went to Caltech in Pasadena to see what they were up to with their CODA product (Collection of Open Digital Archives). Their material is of interest, as they too are a sci/tech institution. I am just exploring the interface at this point.

I took a little time out last night to attend Collage at Interlochen Arts Academy. It was 90 delightful minutes of short performances by over 200 performers - in groups ranging from jazz band to classical ballet, interspersed with soloes, small ensembles, and an amusing rhythmic group performing a la Stomp. It was a great way to spend a cold and rainy evening. Though Interlochen is 30 miles away, the people sitting in the next seats were cottage neighbors! It's a small world up here, with many hardy people who have retired and stay year round despite the cold winters.


Thursday, October 14, 2004

DSpace Demo

This morning I dispensed with some papers from my office. Trying to tie up loose ends. When it was time to work on my sabbatical, I returned to DSpace! Margret Branschofsky at MIT sent me the login and password for the DSpace Demo, so I played with it and then used it to submit a poem to the Demonstration Community. You can view it at
https://dspace-demo.mit.edu/bitstream/1721.2/3469/1/A+Walk+in+the+Woods.doc .

I wrote the poem last week after walking in the woods near the cottage. Most of the women here say they won't walk alone any more because of the cougar sitings. That's for real. Cougars have been seen in the very area where I was walking. I didn't think of it until part way through my walk, as you can see from the poem. I was wondering if the fear that gripped me was like what soldiers experienced in Vietnam, and decided theirs was worse - they were dealing with intentioned dangers in a dense jungle, not just "innocent" cougars in a friendly woods....




Wednesday, October 13, 2004

The User Side of DSpace

Today I returned to MIT’s DSpace and looked at the public side and its user interface, driven by Jakarta Lucene [do you know her?]. I also rummaged through some of the material posted by the various Communities. Stumbled across a biography of a possible relative, Edward Furber Miller (Furber being my maiden name). Born in 1866, he was a Professor of Steam Engineering and headed the Mechanical Engineering Department at the Institute from 1911 to 1933! See https://dspace.mit.edu/html/1721.1/5558/miller.html.

I gleaned some good ideas for what could be done at UC in the way of sci/tech repository material. I also added more acronyms and terms to my posting.

Lunch break was exciting. I sat at a picnic table between the library and the lakeshore, enjoying the brisk air and blue sky. I ran to the car to get my water bottle, and returned to find my sandwich dumped on the table and a seagull flying away with the plastic wrap! Luckily the plastic wrap unfolded and left the sandwich for me. I chased after the seagull and recovered the plastic wrap, lest the bird choke on it.

This afternoon it was time to switch gears and take a look at the USC Digital Archive. That is where I will be working just 12 days from now. I thought it a good idea to see what their content and interface look like. You can see it at http://digarc.usc.edu:8089/cispubsearch/ . They focus on the history of Los Angeles for the most part. I will be working under Barbara Shepard, who went to USC from the Getty Museum. USC is using Documentum, the software that OhioLINK is using for its institutional repository. An interesting aside is that CCM Dean Lowry’s wife used to work at the Getty in an office next to Barbara. I am to take greetings when I go.

During breaks I succumb to the fiction that surrounds me in this little library. I discovered a book – The Life of Pi – that was recommended by two sources, Sudhindra Rao, grad assistant at my library, and friend Elizabeth Hill, member of the local book club. They just finished reading it. I am not able to check books out, as my township does not support this library, so I sneak peeks at the book when I tire of the computer, or when the transmission slows down. When done, I practice my rusty skills in shelving fiction, aided by the gap on the shelf and the interesting neighboring book, Harvard Yard.

Tuesday, October 12, 2004

DSPace continued

Today I spent more time looking at background technology info for DSpace. I used lots of terms to beef up my posting of acronyms and terminology. [Be sure to check that posting as well as my links posting from time to time. I keep adding things.] Not done yet though. I sure am thankful for all the explanatory information MIT's DSpace folks put on the web.

I took a break about 2 pm. You can only watch bicyclists and roller-bladers go by so often before having to join them. The bike trail that goes by this little library starts at Lake Michigan and goes all the way to the town of Beulah, along an old railroad right of way. It is a beautiful 10-mile ride one way, with red and golden trees lining the paved path. Sometimes you can see the Betsie River, and sometimes vistas over the wetlands. The last 3 miles or so are pea gravel on a clay base - worked for my street tires. The trail parallels Crystal Lake for the last couple of miles. Can't beat the scenery! The whole trip took 2 hours, including a conversation with the woman who runs the conservancy office in Beulah.

I will continue looking at the tech aspects of DSpace, as well as begin looking at it from the user perspective. I want to see what kinds of material they have collected from their various communities. Might give me some ideas!


Monday, October 11, 2004

Acronyms and Vocabulary

API = application program interface, for example an API allows indexing of searching of DSpace content, using a Java freeware search engine called Lucene
http://jakarta.apache.org/lucene/docs/index.html .
BSD = Berkeley Standard Distribution, see www.opensource.org .
CDWA = Categories for the Description of Works of Art, developed by CIMI (Computer Interchange of Museum Information ) and AITF (Art Information Task Force), including elements for orientation, dimensions, condition, etc.
Celestial = software that harvests metadata from OAI-compliant repositories and re-exposes that metadata to other services - in effect an OAI cache.
CIMI = prior to December 15, 2003, Consortium for the Computer Interchange of Museum Information, developed CDWA.
CNRI = Corporation for National Research Initiatives www.cnri.reston.va.us .
Creative Commons = a non-profit offering alternatives to full copyright http://creativecommons.org/ .
Crosswalks = visual maps showing relationships among metadata in different databases to enable searching federated repositories, or more generally, mapped relationships between schemas.
CSDGM = Content Standard for Digital Geospatial Metadata, or FGDC-STD-001-1998, for describing geospatial datasets (topographic data, demographic data, GIS, computer-aided cartography files).
DA = digital archives.
DC = Dublin Core, a set of 15 metadata elements that can be used by any community to describe and search across a wide variety of information resources on the World Wide Web
DCMI = Dublin Core Metadata Initiative http://dublincore.org/
DIP = dissemination information package
DOI = digital object identifier, a name assigned to objects of intellectual property http://www.doi.org/ .
DRM = digital rights management.
Dublin Core LAP = Libraries Working Group Application Profile (descriptive attributes of a document such as author, title, etc)
EAD = Encoded Archival Description, an SGML (Standard Generalized Markup Language) DTD (Document Type Definition) for marking up the data in finding aids for online searching and display. Developed at UC-Berkeley, it is now maintained and supported as a standard by the Library of Congress and sponsored by the Society of American Archivists. The EAD can be used to represent complete archival structures, including hierarchies and associations. The kinds of functionality that EAD affords can also be implemented using Dublin Core, and it is also possible to migrate records from Dublin Core into the EAD format if necessary. More information on EAD is available at http://www.loc.gov/ead .
FGDC = Federal Geographic Data Committee, developed CSDGM.
Federated searching = searching across heterogeneous databases that follow different metadata standards rather than trying to convert all databases to one standard. Should be better than Z39.50 searching, which is keyword only, resulting in high recall, low precision.
GIS = Geographic Information Systems.
GNU = GNU's not Unix. Project started by Richard Stallman that has turned into the Free Software Foundation (FSF) to develop and promote alternatives to proprietary UNIX implementations, to build Unix(R)(TM)-compatible utilities and programs exclusively based on free program source code.
Granularity = level of detail, as in describing digital objects with metadata. E.g. for a video, identify the film as a unit, or each frame? describe a website as en entity, or each page within? depends on access need
Handle = persistent identifier or name for digital objects and other resources on the Internet. Can be used as Uniform Resource Names (URNs). URLs (locations) are not persistent.
Harmony Project = http://www.metadata.net/harmony/
IETF = CNRI's Internet Engineering Task Force .
= a framework to allow various schemes for transactions related to different genres such as music, journal articles, and books.
IR = institutional repositories
Jakarta Lucene = search engine, used by MIT for searching DSpace
Metadata standards: three important for web are 1) keyword and description metatags implemented by search engines, 2) Dublin Core Metadata Initiative, and 3) Resource Description Framework.
LOM = Learning Object Metadata, an IEEE standard to enable the use and re-use of technology-supported learning resources such as computer-based training.
METS = Metadata Encoding and Transmission Standard, an XML schema for describing complex digital library objects, allowing management of digital library objects within repositories and exchange among repositories.
MIME = Multipurpose internet mail extension (so mail can recognize file types like .doc .jpg etc.).
MPEG-7 = standard for metadata elements, structure and relationships used to describe audiovisual objects (still pics, graphics, 3D models, music, audio, speech, video, or multimedia collections).
MPEG-21 = standard to provide framework for interoperability of digital multimedia objects.
MODS = Metadata Object Description Schema, a derivative of MARC 21, for rich description of electronic resources.
MrSID = Multi-resolution Seamless Image Database, a powerful wavelet-based image compressor, viewer and file format for massive raster images that enables instantaneous viewing and manipulation of images locally and over networks while maintaining maximum image quality
NDLP = National Digital Library Program.
OAI = Open Archives Initiative.
OAICat = OCLC's open source framework http://www.oclc.org/research/software/oai/cat.htm OAI/PMH = OAI protocol for metadata harvesting, enabling ability to harvest other digital archives and to be data providers for other digital archives so they may harvest metadata
OAIS = Open Archival Information System, a conceptual framework for an archival system dedicated to preserving and maintaining access to digital information over the long term.
OCW = Open CourseWare at MIT
OKI = Open Knowledge Initiative at MIT.
ONIX = Online information Exchange, an XML-based metadata scheme developed by publishers for online book sales, and includes elements for evaluative and promotional information on books (reviews, blurbs, book jackets, etc.).
PURL = persistent URL
RDF data = Resource Description Framework - a metadata standard. A formal data model from the World Wide Web Consortium (W3C) for machine understandable metadata used to provide standard descriptions of web resources. It uses eXtensible Markup Language (XML). It is similar in intent to the Dublin Core, although perhaps broader in its scope and purpose. See the W3C RDF Page for further information.
RSS = RDF Site Summary. A lightweight multipurpose extensible metadata description and syndication format. RSS is an XML application, conforms to the W3C's RDF specification and is extensible via XML-namespace and/or RDF based modularization.
Schema = sets of metadata elements designed for a specific purpose, such as describing a particular type of information resource. There are schemas for describing web resources, electronic texts, digital objects, finding aids, learning objects, visual objects, multimedia, datasets, and so forth. Examples are Dublin Core, TEI, METS, MODS, EAD, indecs, ONIX, DCWA, VRA, MPEG-7, CSDGM, and DDI.
SFX = an open URL protocol link server from Ex Libris http://www.exlibrisgroup.com/sfx.htm
TEI = Text Encoding Initiative, developed guidelines for marking up electronic texts.
tuples = qualifiers and modifiers in a metadata structure
Unicode = provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. See http://www.unicode.org/
URI = uniform resource identifier
URL = uniform resource locator, a type of URI
URN = universal resource name, a type of URI .
VRA Core Categories = metadata schema whose elements describe a work of art as well as visual representations thereof.
WebChoir = thesaurus management tool. See www.webchoir.com
Wrappers = environment where a search is done, and then links to other content are automatically available within the original interface, unlike federated searching. See http://www.njit.edu/publicinfo/press_releases/release_544.php
XML = Extensible Markup Language for describing data. Works with HTML to display data. XHTML may be the successor to HTML and the future language of the Web. See http://www.w3schools.com/xml/xml_whatis.asp .

DSpace in Detail

Today I focused entirely on DSpace at MIT. Did I ever explain what DSpace is? They tell it better at www.dspace.org . But it's an institutional repository (IR) that stores and makes available the intellectual output of a faculty in digital format. I read great articles telling all about it, and also documentation at http://dspace.org/technology/system-docs/, which helps to understand all those nasty little acronyms. I think I'll create an entry just for acronyms so I can keep track. Some of the articles and documentation were written by friend Margret Branschofsky (formerly Lippert).

I found a good treatise on Metadata - referred from the MIT site - produced at the Getty Museum. Interestingly, my "supervisor" at USC will be Barbara Shepard, who went to USC from the Getty. If you want to know about metadata, visit
http://www.getty.edu/research/conducting_research/standards/intrometadata/index.html !

It's time to leave this nice library with the view of Betsie Lake and sound of gulls outside the open windows. I've been here 5 hours, counting a lunch break on the bench outside.

Weekend break

There are many distractions here in northern lower Michigan. Try an art tour, taking you to painting, weaving, tile, pottery and zany sculpture (from junk) studios in town and in the wetlands. Try a bike ride to an historic lighthouse. Try dinner at neighbors with a fabulous view of Lake Michigan. Not to mention running in early morning mist or walking in the woods, watching for cougars....

Friday, October 08, 2004

Project ideas

Engineering standards with Linda Musser
Institutional repository for UC Engineering College, a la DSpace
Joseph B. Strauss (a Cincinnati native and UC graduate) and Golden Gate Bridge archive with Professor Baseheart
Photo gallery of scientists and engineers w/ Mary Schlembach
Build subject collection (in engineering) together with other institutions, all open-archive (OAI) based

Fruitful links

Cliff Lynch on institutional repositories http://www.arl.org/newsltr/226/ir.html
Cornell tutorial on digital imaging http://www.library.cornell.edu/preservation/tutorial/contents.html
Creative Commons and federated repositories http://creativecommons.org
DSpace at MIT www.dspace.org
DSpace System Documentation http://dspace.org/technology/system-docs/index.html
The DSpace technology http://dspace.org/technology/system-docs/
Dublin Core http://dublincore.org/
Dublin Core in XML implementation http://dublincore.org/documents/2002/12/02/dc-xml-guidelines/
Harvard policy for Digital Repository Service http://hul.harvard.edu/ois/systems/drs/policyguide.html
Intro to Metadata http://www.getty.edu/research/conducting_research/standards/intrometadata/index.html
Metadata in a nutshell http://www.ukoln.ac.uk/metadata/publications/nutshell/
Metadata Glossary http://www.ukoln.ac.uk/metadata/glossary/
http://www.getty.edu/research/conducting_research/standards/intrometadata/4_glossary/index.html
NINCH http://www.nyu.edu/its/humanities//ninchguide/
NISO guidance for digital collections http://www.niso.org/framework/forumframework.html
OAIster at U. of Michigan http://www.oaister.org/o/oaister/
Open Archives metadata harvesting protocol http://www.openarchives.org/
Open Courseware Project at MIT www.ocw.mit.edu
USC digital archive primer http://isd.usc.edu/isd/services/dim/dap/index.html
USC imaging recommended practices http://isd.usc.edu/isd/services/dim/imaging/documents/DIM%20Imaging%20Stds.pdf

How not to start!

Well, progress has been slow.
Emails have been taking 48 hours to get from Michigan to Cincinnati.
My car broke down three days ago (water pump, timing belt). I won't have it for another week. Wouldn't you know it just went out of warranty last month.
I could use my bike but it's raining today. Finally got a loaner car.

After all the interruption, today I made it back to the little public library with free wireless internet access. I am looking at background articles, on both digital archives (DAs) and institutional repositories (IRs).

I've read some on MIT's DSpace from their site at http://www.dspace.org .
Clifford Lynch wrote a great report on IRs available at http://www.arl.org/newsltr/226/ir.html .
If I can figure out how, I'll create a separate folder with good links in it.
I might also create one with ideas of projects I might do myself.






Tuesday, October 05, 2004

The beginning

Yesterday my sabbatical officially began. I drove to a cottage in Michigan to remove myself from the normal routine and focus on my sabbatical work. It's cold here! The public library will definitely be the place to work.

The public library has a wireless access point. That enables me to create this blog, as well as dig into my sabbatical topic. I plan to lay out the many facets of digital archives and learn as much as I can before my sojourn to Los Angeles and USC.

Facets like:
what types of archives are other academic institutions putting up
what platforms are they using
what are their interfaces like
how do they handle copyrighted material
how do they organize the material
do they use federated searching of some kind across various archives
what sorts of metadata do they use
what exactly has MIT got in D-Space thus far
what are other technical institutions doing?
how is USC using Documentum, a product being used by OhioLINK
who has the best models for my needs?

I am sure I can glean a lot by viewing web sites for many schools. It's delightful to sit here by the library window with a view of Lake Michigan and search the web in pursuit of answers to these many questions.