A timeline of recent posts from UKOLN blogs.
Created by briankelly on Jul 9, 2008
Last updated: 07/11/10 at 10:03 PM
UKOLN Blogs has no followers yet. Be the first one to follow.
I’ve been at the excellent JISC CNI Meeting in Edinburgh these last two days. Lot’s of interesting work being described and met some great new people. Some people have asked me to post my slides, so here they are:
JISC CNI Meeting, Edinburgh 2010
View more presentations from Paul Walk.
http://blog.paulwalk.net/2010/07/02/jisc-cni-meeting-edinburgh/
In case you missed it, the OR10 Developers Challenge is now live!
Andy McGregor has explained why he thinks you should enter the challenge and, I’m pleased to say, there have been some expressions of intent already. If you do decide to enter, please register your intention on the OR10 Crowdvine forum.
A reminder of the challenge:
Create a functioning repository user-interface, presenting a single metadata record which includes as many automatically created,useful links to related external content as possible.?
We had one comment suggesting that the challenge was limited to dealing with Linked Data – this is certainly not the case – we are interested in linking in its broader sense.
http://blog.paulwalk.net/2010/05/25/or10-challenge/
Please note that what follows is a draft.
A few weeks ago I posted some thoughts about a Developer Challenge for OR10, with a plea for ideas for specific challenges. I’m pleased to say that this post got a really good response, with plenty of useful ideas and comments. Thank you to all who responded. I think it fair to say that all of the comments influenced our thinking, but the interest in linking content (most fully expressed by Andy Powell) stood out from several comments, so we have concentrated on trying to create a challenge around the this. While linked data was mentioned often (naturally enough), we wanted to stick to our principle of involving non-developers (or users) as much as possible: this can be difficult when dealing with the more esoteric aspects of linked data. So, after some discussion within the DevCSI team, we have worked up the following challenge:
Create a functioning repository user-interface, presenting a single metadata record which includes as many automatically created, useful links to related external content as possible.
Definitions:
“functioning” in this sense means that mockups/screenshots are not sufficient – however a working prototype is OK
“related” in this sense means that the external content is related to this particular metadata record in some way.
“as many useful links” means that marks will be awarded for useful links, so an interface with fifty meaningless links does not beat one with three genuinely useful links!
links must be related to content, not just a system. So, for example, a link to the page at http://www.wikipedia.org is not legitimate, but a link to a specific page in Wikipedia could be. Only one link of each ‘type’ counts: i.e. having four links to URLs which reference ‘topics’ in a given system is fine but will count as one link for the challenge.
Rules:
Entries must come from a team of at least one developer and one person representing users. The entries must be presented, in person, at OR10. If a team is responsible for the entry then not all of the team members need be present at OR10, but at least one team-member must be.
Judging:
The entries will be presented/demonstrated at OR10 in a show and tell session in a room dedicated for this. The show and tell will be open to OR10 delegates to come along and see the presentations as they are being made. These presentations/demonstrations will be video-recorded. There will be an opportunity for those delegates present (the ‘audience’) to ask questions and/or comment on the presentations. There will be a panel of judges who will observe and make notes. The judges will take note of the responses from the audience. Following the show and tell, the judges will privately discuss the entries and draw up a shortlist. The videos of the shortlisted entries will be presented at the conference dinner for the assembled delegates to vote a winner and a runner-up.
The judges will particularly take into account the following:
functionality – the links must work and must have been created automatically as part of the repository system
usefulness – the usefulness of the links to an end-user of the developed interface must be demonstrated
number of links – the number and variety of links will be considered
audience reaction – favourable and unfavourable reactions for the audience will be taken into account
General points:
The Challenge will be issued well in advance of the conference, giving people plenty of time to develop an entry. We will make facilities available at OR10 – such as a Developers’ Lounge area, for further work to be done at the conference itself.
We are very interested in any comments people may have about this – we intend to publish the final version of this, and open up the Developer Challenge, at the end of this week.
http://blog.paulwalk.net/2010/04/26/draft-or10-challenge-idea/
Update: I have closed comment on this post now. Thank you very much to all who commented and suggested ideas for a challenge. I have now posted a draft Challenge here and would welcome comments on that post. Thanks again!
Through the JISC-funded DevCSI project, UKOLN has been asked to arrange a ‘Developer Challenge’ for the Fifth International Conference on Open Repositories, (OR10) to be held in Madrid in July of this year.
This will be the third consecutive year that the Developer Challenge has been a feature of this conference. Previous challenges have been both competitive and creative.
Photo by Graham Triggs
This year we have been considering doing something slightly different. Previously, a general challenge has been issued, inviting developers to submit prototypes for anything which they feel is relevant and useful to the repository community. But now that the community has a better appreciation of the sort of creativity which developers can bring to these events, we wonder if we might try something a little different.
A general challenge?
We have been thinking about the possibility of the repository community issuing a particular challenge to the developers planning to attend OR10. This could be decided on by the community well in advance of the conference. If we managed to ‘crowd source’ a few ideas, we could organise a simple vote. Something we are trying to do more with the DevCSI project is to get developers together with non-developers from the same ‘domain’ (repositories in this case) – so we are quite interested in pursuing this approach with OR10.
The OR10 organisers have helpfully couched the conference itself in terms of some challenges:
In a world of increasingly dispersed and modularized digital services and content, it remains a grand challenge for the future to cross the borders between diverse poles:
the web and the repository
knowledge and technology
wild and curated content
linked and isolated data
disciplinary and institutional systems
scholars and service providers
ad-hoc and long-term access
ubiquitous and personalized environments
the cloud and the desktop.
Perhaps one or more of these could serve as the inspiration for a more concrete developers challenge?
What this boils down to is finding a challenge in the general area of repositories, recognised as important by the community generally, which could only be met by getting developers to work with non-developers at the conference. For it to be fair, the challenge would need to be non-specific with regard to any particular repository software.
I would welcome some feedback:
is this general approach a good idea?
do you have any ideas for a challenge?
please feel free to comment her if you have any ideas, or alternatively drop me an email at p.walk@ukoln.ac.uk
Thanks!
http://blog.paulwalk.net/2010/03/23/ideas-for-the-or10-developer-challenge/
There is a big momentum these days about data being accessible, available, and re-usable. Increasingly people want open data; Science Commons have been recommending using CC0 to make the fully open status of data clear. More recently the Panton Principles start: “Science is based on building on, reusing and openly criticising the published body of scientific knowledge. For science to effectively function, and for society to reap the full benefits from scientific endeavours, it is crucial that science data be made open.” We’ve been big fans of Open Access at the DCC since its early days. We use a Creative Commons licence for our content by default. This blog was one of the earliest to be specific about a Creative Commons licence not only for the core text that we write, but also for the comments that you might add here. So we strongly support the Open Data approach… where possible. For of course in some areas of science and research, there are data that cannot be open. Usually this is because the data are sensitive. They could be personal data, protected under Data Protection laws. Sensitive personal data (such as medical record data) has extra requirements under those laws. They could be financial microdata, commercially sensitive. Or perhaps data with strong commercial exploitation potential. They could be anthropological data, sensitive through cultural requirements. Research needs to go anywhere, whatever the issues; we can’t be constrained to only research where the data can be open. So perhaps it’s as simple as that: some science should have open data, and some should have closed data? Well, maybe not. Because the underlying issue of the Panton Principles must still apply. Research should be verifiable, whether through repeatable experiments or through re-analysable data. Unverifiable research is, well, unreliable- perhaps indistinguishable from fraud. Some access is needed; perhaps we should think of even sensitive data as Less Open Data rather than closed data. So how do you go about dealing with sensitive data? Keep it secure, transfer securely, provide access under strict licences and controls in dat enclaves, aggregate, de-identify, anonymise, there are plenty of tricks in the book. That’s the topic of the 4th Research Data Management Forum starting tomorrow in Manchester. I’ll hope to have more to write about what we learn later.
http://digitalcuration.blogspot.com/2010/03/when-data-shouldnt-be-open.html
As the DCC begins its third phase today, I am delighted to announce the appointment of our new Director, Kevin Ashley, who will succeed me upon my retirement in April 2010.Kevin Ashley has been Head of Digital Archives at the University of London Computer Centre (ULCC) since 1997, during which time his multi-disciplinary group has provided services related to the preservation and reusability of digital resources on behalf of other organisations, as well as conducting research, development and training. The group has operated the National Digital Archive of Datasets for The National Archives of the UK for over twelve years, delivering customised digital repository services to a range of organisations. As a member of the JISC's Infrastructure and Resources Committee, the Advisory Council for ERPANET, plus several advisory boards for data and archives projects and services, Kevin has contributed widely to the research information community. As a firm and trusted proponent of the DCC we look forward to his energetic leadership in this new phase of our evolution.So far so press release. But I'd go further. I can't tell you how pleased I am with this appointment. As some readers will know, I have personally lobbied all and any potential candidates for this post since before I officially announced I was leaving. I understand we had some excellent candidates (I wasn't directly involved), more than one of whom might have made an excellent Director. But I'm particularly pleased at Kevin's appointment for several reasons: he is well engaged in the community including good connections with JISC, our major funder), he's tough enough to keep this tricky collaboration thing going, he has an excellent technical understanding, and he has great experience of actually managing this stuff in all its crusty awfulness. I particularly remember his discussion (on a visit to the Edinburgh Informatics Database Group) about issues like how best to deal with an archived dataset where they came across the characters "five" in a field defined as numeric! You can make it work or make it a record but not both...So congratulations Kevin, and good luck!
http://digitalcuration.blogspot.com/2010/03/dcc-new-phase-new-perspective-new.html
I wrote about RDF-encoding contact information a little earlier and had some very helpful comments. On reflection, and after exploring the “View Source” options for a couple of institutional contact pages, I’ve had some further thoughts. - Contacts pages are rarely authored, they are nearly always created on the fly from an underlying database. This makes them natural for expressing in RDF (or microformats). It’s just a question of tweaking the way the HTML wrapper is assembled. Bath University’s Person Finder pages do encode their data in microformats. - I wondered why more universities don’t encode their data in microformats or (even better) in RDF for Linked Data. One possible answer is that the contact pages were probably one of the earliest examples of constructing web pages from databases. It works, it ain’t broke, so they haven’t needed to fix it! If so, a reasonable case would need to be made for any change, but once made it would be comparatively cheap to carry out. - A second problem is that it is not at all clear to me what the best encoding and vocabulary for institutional (or organisational unit) contact pages might be. So maybe it’s even less surprising that things have not changed. To say I'm confused is putting it mildly! So what follows list some of the options after further (but perhaps not complete) investigation... One approach is the hCard microformat, based on the widely used vCard specification, RFC2426 (this is what Bath uses). That’s fine as far as it goes, but microformats don’t seem to fit directly in the Linked Data world. I’m no expect (clearly!), but in particular, microformats don’t use URIs for the names of things, and don’t use RDF. They appear useful for extracting information from a web page, but not much beyond that (I guess I stand to be corrected here!). Looking at RDF-based encodings, there are options based on vCard, there are FOAF and SIOC (both really coming from a social networking view point), and there’s the Portable Contacts specification. Given that vCard is a standard for contact information, it would seem sensible to look for a vCard encoding in RDF. It turns out that there are two RDF encodings of vCard, one supposedly deprecated, and the other apparently unchanged since 2006. I now discover an activity to formalise a W3C approach in this area, with a draft submission to W3C edited by Renato Ianella and dating only from last December (2009), but I would need a W3C username and password to see the latest version, so I can't tell how it's going, Someone asked me a while ago who sets the standards for Linked Data vocabularies. My response at the time was that the users did, by choosing which specification to adopt. At the time, FOAF seemed to have most mentions in this general area, and I rather assumed (see the previous post) that it would have the appropriate elements. However, the “Friend of a Friend” angle really does seem to dominate; this vocabulary does seem to be more about relationships, and to be lacking in some of the elements needed for a contacts page. I suspect this might have stemmed from a desire to stop people compromising their privacy in a spam-laden world. However, those of us in public service posts often need to expose our contact details. However, FOAF does have email as foaf:mbox, which apparently includes phone and fax as well, as you can see from the sample FOAF extract in my earlier post. In a tweet Dan Brickley suggested: “We'll probably round out FOAF’s address book coverage to align with Portable Contacts spec”, so I had a look at the latter. The main web site didn’t answer, but Google’s cache provided me with a draft spec, which does appear to have the elements I need. What elements do I need for a contact page? Roughly I would want some or all of: NameJob title/role in DCC (my virtual organisation)(Optional job title/role in home organisation)Organisational unit/OrganisationAddress/locationPhone/fax numbersEmail address So what could I do if this information were expressed in RDF in the contact pages for a partner institution (say UKOLN at Bath)? Well, presumably the DCC contact pages would be based on a database showing the staff who work on the DCC, with the contact information directly extracted from the remote pages (either linked in real time or perhaps cached in some way). And if Bath changed their telephone numbers again, our contact details would remain up to date. But more. Given that there are some staff members who have roles in several projects, it would be easy to see who the linkages were between the DCC and the other project (eg RSP in the past, or I2S2 now). Part of the point of Linked Data (rather than microformats) is that one can reason with it; follow the edges of the great global graph… And perhaps I would be able to find a simple app that extracts a vCard from the contact page to import into my Mac’s Address Book, which is where I started this search from! You wouldn’t think it would be hard, would you? I mean, this isn’t rocket science, surely?
http://digitalcuration.blogspot.com/2010/02/more-on-contact-pages-and-linked-data.html
I had a very enjoyable day yesterday helping EDINA celebrate 10 years of the Digimap service. What began as an eLib project and experiment with 6 Universities in 1996 has grown to a mature service with over 100,000 users, 45,000 of them active, in pretty much every UK University, and soon in UK schools as well. In 1996 I was Programme Director of the eLib Programme, and my earliest email about Digimap was from the JISC money man, Dave Cook, on 30 January 1996 to Peter Burnhill of the Edinburgh Data Library (as it then was). Dave told Peter we were interested in his idea (for an Images project!) but had a few concerns (that the Ordnance Survey might not agree to let us use their mapping data; it’s hard to remember now how difficult some of those 1990s persuasions were!). Three days later, Dave was offering real money, although it had to be spent by 20 March that year. Done! By late 1997 the Digimap project (*) had a trial service; I remember experimenting with it and having some problems (this was with Netscape 3 on a PowerMac Duo or something like that; woefully under-powered in retrospect). By the end of 1999, they were moving to a new GIS system, and we were beginning to discuss turning Digimap into a service, and that went live in January 2000. They had to get 37 subscribing Universities by a particular deadline, and I think managed 39 by somewhat earlier. Since then the service has grown in scope, quality, usage and value. In my personal opinion (full disclosure, I’m not neutral here, having been associated with it through advisory groups of various kinds throughout its life), Digimap is the best service funded by JISC. Best in quality, best in professionalism, best in innovation, best in support. A lot of people deserve credit for that, and EDINA should all be extremely proud of what they have created. By the way, the OS have managed some major shifts in attitude over the years, from suspicious tolerance through to strong support, and the success is partly down to them, and to the efforts of the negotiators in what is now JISC Collections.As well as various forms of OS mapping for GB (whose trademark names always escape me... and it is GB rather than UK, for weird historical reasons), Digimap now offers 4 “epochs” of historic maps from Landmark, plus Geology maps from BGS and Marine maps from SeaZone. Due to licence restrictions it is only available to registered staff and students at subscribing UK institutions, but I hope that those of you unlucky enough not to fall in that category can soon read more about it on the pages to be put up related to the celebration. Digimap has been a bit clunky at times compared with the innovations introduced by some others, but with the new underlying GIS, the interfaces are being upgraded; they now have “slippy maps” (called Digimap Roam) on the base service, and it looks really smart and much more functional. It's tough for a small group to keep up with the likes of Google, Yahoo and MS! Soon this slippy map interface will be extended to the Historic service (“Ancient Roam”?), Geology (“Rock’n Roam”?) and Marine (a rather dull “H2Roam”!)… I think those might be internal names, but if you can complete the set with an even punnier marine name, who knows they might keep them! The day was good fun, and we heard quite a bit about what Digimap is and how it is being used (far more widely than geography departments). The most exciting was a student project using Digimap and a GPS for a light aircraft CFIT-avoidance system (CFIT is Controlled Flight Into Terrain, referred to as “having a bad day”!). We heard from the data suppliers, with a bit more about what’s coming. It was interesting to hear the OS man talking about moves towards Linked Data; I wasn’t sure how that would square with the closed access, but I think I muddled my question (confused Linked Data with OGC web services, I suspect). The service providers didn’t appear to be talking to each other about Linked Data, which might be a good start. A highlight was the closing keynote from Vanessa Lawrence, CEO of OS, clearly extremely supportive of OS. Choosing her words very carefully (she is not allowed to influence anyone) she outlined the government’s open data initiative and the consultation on its implications for the OS; this consultation closes late March 2010, but she urged us to make any responses, whether collectively or as private citizens well before then. The consultation isn’t simply “should we open up access to OS data?”, it’s much more “how can we open up access to OS data and still sustain the quality of the data into the future”. The celebration ended with a reception and dinner, with an amusing after-dinner talk by Michael Parker, author of Map Addict. All in all, a very enjoyable and worthwhile day to celebrate a significant anniversary. PS the twitter tag is #digimap10; I’m not going to tag the post with it, as I’ve got far too many one-time tags that are a pain to manage…PPS (*) Unfortunately the original Digimap project pages seem to have vanished, and the earliest Wayback Machine gathers appear to be faulty; the first successful gather I can find ishttp://web.archive.org/web/20011021051021/edina.ac.uk/digimap/... which seems to refer to the service, not the project.
http://digitalcuration.blogspot.com/2010/01/digimap-is-10.html
I'm quite interested in the idea of Scholarly HTML, as espoused in Pete Sefton's blog, and I've commented on some of Peter Murray Rust's hamburger PDF comments previously (although I do think a lot of people confuse wild PDF with well-made, should one say Scholarly PDF). I've always been slightly worried by one thing though.A well-known advantage of PDF is that it pretty much assures I can save a document, share it, move it around etc and it will still be intact and readable. That's one of the reasons it's so popular.Mostly we don't do that with HTML. Mostly we just point to it. But if I see an article these days, I want it on my computer if I'm allowed; this let's me study it at leisure, drop it in my Mendeley system, etc. As pointed out, that works a treat with PDF, and pretty well with Word or OpenOffice documents as well. This applies even where the document is quite heavily compound, with many embedded images, tables etc.But if I try saving a HTML document to my hard disk, nothing very standard happens. OK, if I use Safari on my Mac, I get a .webarchive file, which is quite nice as I can do all the things with it that I could do with a PDF and Word etc, and when I open it later it will be as it was before, with all the images in place. But neither IE nor Firefox seem capable of opening a .webarchive file.If I try saving the same article from Firefox, I get a .html file with the main article in it, and a directory with associated files in it (eg images). Safari does seem capable of opening this combination, but it's pretty ugly, and hard to move around. I haven't tried IE as I don't have easy access to it.Is there in existence or development a standard approach to packaging the HTML and associated files that would be as convenient as the .webarchive, but usable across all browsers? If so, Scholarly HTML would be that little bit closer!
http://digitalcuration.blogspot.com/2010/01/scholarly-html-would-be-nice-but.html
I had a chat before Christmas with Henry Thompson, who works both in Edinburgh Informatics and also on the W3C TAG. Insofar as the Internet is important in sustaining long term access to information in digital form, there is a sustainability problem that we rather seem to have ignored. Everything on the Internet (literally) depends on domain names, and these are only ever rented. There is no mechanism for permanently reserving a domain name. Domain names can be lost by mistake (overlooking a bill, perhaps having moved in the interim and not informed the relevant domain name registrar), but they can also be lost on business failure. Although domain names can be a business asset, I understand that the registrars have some discretion on transfers, and in particular one cannot make a "domain name will" seeking transfer of the domain name to some benevolent organisation. Note, the mechanism for renting domain names has sustainability advantages, providing sustainability to important services that underpin the DNS.There are two kinds of problem, one on a massive scale and one more fine-grained. The massive problem is that the entire infrastructure of the Internet depends on URIs, most of which are http URIs that in turn depend on the domain name system. So there are a number of organisations whose domain names are embedded in that infrastructure in a way and to an extent that is very difficult to change. W3C is clearly such an organisation. Many of these organisations seem rather fragile (not a comment on W3C, by the way, although its sustainability model is opaque to me). Should they fail and the domain names disappear, the relevant URIs will cease to work and various pieces of Internet machinery will fall apart.(By the way, this does seem to be one case where a persistent ID that is independent of the original domain, such as a DOI, has advantages over a HTTP URI plus a redirect table. If the domain name no longer exists, you can't get to a redirect, whereas someone can still relink the DOI to a new location.)On the more fine-grained scale, many documents (particularly in HTML) are not easily separable from their location, depending on other local files and documents. In addition of course, documents in some sense exist through their citations or bookmarks, that begin to exist separately from the document. Moving a document to a new domain can make it "fail" or disappear. So sustainability is linked to the domain as well as the other preservation factors.This seems to me to be not at all a technical problem, but it seems to have legal/regulatory, governance, social, business and economic aspects.Among the solutions might be creating a new top level domain designed for persistence, with different rules of succession, etc. Another (either instead of or in conjunction with the first) might be creating an organisation designed for persistence, to hold endowed domain names. Somehow the ongoing revenue stream for those underpinning services must be retained indefinitely into the future.We don't think we have the answers, but we do think there is a problem here; I'm not yet sure if we have articulated it accurately at all. I would appreciate any comments. Thanks,
http://digitalcuration.blogspot.com/2010/01/persistence-of-domain-names.html
I have been asked to provide a position paper for next week’s Future of Interoperability Standards meeting hosted by CETIS. This blog post is one I have been meaning to write for ages so I’m offering it as a position paper of sorts.
UKOLN has been charged by JISC with the task of supporting the development of Dublin Core Application Profiles (DCAPs) in a number of areas. While I have not (so far) had much direct involvement in this work I have developed, over the last year or so, a real interest in the process of developing these.
The development of DCAPs is governed through the application of the Singapore Framework for Dublin Core Application Profiles. In this document, the concept of the application profile is explained thus:
The term profile is widely used to refer to a document that describes how standards or specifications are deployed to support the requirements of a particular application, function, community, or context. In the metadata community, the term application profile has been applied to describe the tailoring of standards for specific applications.
The requirements for an application profile to be legitimately termed a Dublin Core Application Profile are defined within the Singapore Framework. In brief, a DCAP is a “packet of documentation” which includes the following elements:
Functional requirements (mandatory)
Domain model (mandatory)
Description Set Profile (DSP) (mandatory)
Usage guidelines (optional)
Encoding syntax guidelines (optional)
This seems mostly sensible although I have not been party to much of the discussion around the Singapore Framework and so I have never entirely appreciated the purpose of or need for the Description Set Profile (DSP). In passing I will note that it seems to me that the DSP could be optional rather than mandatory, and that the Usage Guidelines should be mandatory rather than optional.
According to the Singapore Framework web page, “there are no stable, published examples of full-blown application profiles that conform to these guidelines”. With one exception, Scholarly Works Application profile (SWAP), it is difficult to find any examples of DCAPs which are close to being realised. SWAP was developed for the most part at UKOLN so I have an interest in seeing it adopted; however to date we have seen no actual usage of this DCAP.
I come from a background of software and service development, rather than standards development. For this reason, the development of application profiles is more appealing to me than is standards development per se, as I expect to be able to apply my experience and skills more readily to work which is aimed at supporting “specific applications”. It is natural for me to measure success in terms of usage. This means that I take usability seriously, and tend to focus on users and their responses.
Early in 2009 I began to notice a few things about how DCAPs such as SWAP were expected to be developed. It seemed to me that usability was not a stated priority. As, I think, a consequence of this, there is little attention given to testing the usability of DCAPs within a context involving users and applications. It does seem that DCAPs are expected to be tested for conformance to the standard, for internal cohesion and logic in terms of the underlying information model, and even for theoretical satisfaction of functional requirements, but if the DCAP has not been tested for usability before it gets to this point then it is at high risk of failure. It was also apparent to me that users, even experts in the domain for which the DCAP was intended, might struggle to be able to appreciate, test or criticise the DCAP documented according to the Singapore Framework, unless they had relatively rare information management knowledge and understanding.
At UKOLN, I got together with some colleagues and proposed that we consider a more Agile approach to the development process. I use the term Agile in the sense in which it has been applied to software development in recent years. A key feature of Agile development is that it allows the development of not only the solution, but also the requirements, in a highly iterative process. Agile development tends to favour working solutions over future capabilities and encourages near-continuous engagement with users during the development process, responding to changes in functional requirements as both the developer and the user increase their understanding of the problem space. I wondered if we couldn’t devise some tools and techniques which would allow the early development stages of DCAP development to be done iteratively, with close engagement from prospective end-users. The following is a description of what we have developed so far.
An Agile approach
In order to focus on usability in the development of DCAPs, we realised that we would need to introduce a methodology which would allow us to frequently test what had been developed so far against user-requirements and understanding. Borrowing again from software development, we decided to adopt a rapid prototyping method, where we would give prospective end-users the means to quickly assemble information models which made sense to them in the context of their requirements. Some of our early experiments were in the domain of scholarly works because we have a particular interest there. Our method therefore relies on being able to assemble small groups of prospective users to participate actively in the development process.
We have observed an issue with users’ engagement with application profiles. Application profiles are, essentially, intangible – users cannot interact with them directly. For many users, this presents a very real barrier to engagement. Even if formal documentation such as Description Set Profiles (DSP) is developed during the development iterations, it tales a certain kind of user with a particular interest to engage with these. Many users need to see the sort of system interface which they will ultimately be using in order to contribute feedback on the development of an AP. We have developed two approaches to making DCAPs tangible, paper-protoyping and a flexible user interface tool for information modelling.
In early stages of requirements gathering, a paper-prototyping approach has shown real promise as an accessible method for eliciting requirements from groups of users. This has the advantage of being potentially very free-form, such that the developer’s unconscious influence on users’ contributions is reduced. Users are encouraged, collectively, to develop their own understanding and to model it. You can read about this in more detail in Emma Tonkin’s paper: Multilayered Paper Prototyping for User Concept Modelling: Supporting the Development of Application Profiles.
One limitation of paper prototyping comes from this very free-formedness: it is difficult to correlate the outcomes of a free-form modelling exercise with the outcomes of other similar exercises. For this reason, we have developed a second stage development tool which uses software to structure and, crucially, record, users’ engagement with the developing application profile.
Our software for allowing users to experiment with modelling their domain is MrVobi. Below you can view a short video of it being demonstrated on an interactive whiteboard:
Users are encouraged to use this tool to create and restructure entities and attributes through a user-friendly and intuitive interface. The user interface is is connected to a web service which records every decision, and which can hold and serve up pre-recorded models so that users can start from an advanced position in a given session.
As we move users from the free-formed to the more structured interfaces, we can start to gain an important benefit. By recording the decisions that individuals make about the information model, we can aggregate these so that, theoretically, we can start to assign a level of confidence to the decisions which are eventually made about the application profile. For example, we can say something like “this attribute belongs with this other in this entity, and 71% of our test users from this domain agree with this”.
As an application profile becomes more developed, it can be presented to users for testing through this same interface. This means, importantly, that an application profile can be treated as something more dynamic. As a domain changes over time, with shifting aspirations, challenges and issues, so the application profile can be re-assessed in terms of its usability in a changing context.
A concern which we identified early in the development of these processes and tools was the fact that the tools influence the testing process: when a user gives feedback, they are to some extent commenting on the artificial interface as well as the application profile. The paper-prototyping patly mitigates this, as does the simple fact that we don’t rely on a single interface. Within the very real constraints of users’ patience and available time, the general approach is to introduce as many types of interface as the user can bear so that biases based on the idiosyncrasies of specific tools are gradually cancelled out.
To bring this back to the Singapore Framework: we believe that we are evolving an effective process to develop several parts of the ‘package’ – the functional requirements, the domain model, and the usage-guidelines. We believe that if these are developed with frequent recourse to user-testing, then the resulting DCAP will be more robust, and more likely to be adopted. We think that we can build into the process an aspect of evidence gathering to allow to make assertions about the resulting DCAP which are based on a certain degree of confidence.
This is very much a work in progress. We have experimented with the paper-prototyping approach with a number of different groups, and in more than one domain, with some very interesting results. We ran an interactive workshop at last year’s International Conference on Dublin Core and Metadata Applications using the MrVobi software which was very well received (this was informed by a presentation which is also a useful overview). We have received strong encouragement from the Dublin Core Metadata Initiative to continue to develop this approach and are now considering how how we might take this work forward in 2010. Any comments are welcome.
Note
This work has been been the result of collaboration within UKOLN. Special mention should be made of Emma Tonkin’s efforts which have been crucial in a number of aspects of this work. Others at UKOLN who have contributed are Andy Hewson, Talat Chaudhri, Mark Dewey, Stephanie Taylor, Julian Cheal and Tom Richards.
http://blog.paulwalk.net/2010/01/06/an-agile-approach-to-the-development-of-dublin-core-application-profiles/
My colleague Angus Whyte has provided the following brief summary of two surveys carried out in Phases 1 and 2 of the Digital Curation Centre, in 2006 and 2009 respectively, as part of our evaluations. In retrospect, we might have done better revising the questions for the second survey rather more than we did; nevertheless I thought it worth while sharing this with you.Angus writes:In 2009 DCC users were surveyed, repeating a similar survey carried out in 2006. In the highlights below we draw conclusions both from the more recent results and also changes over the 3 year period. Both surveys were publicised on the DCC website and via several mailing lists, principally the DCC-Associates and (in 2009) the JISC sponsored Research-Dataman list.Our conclusions take into account that the online questionnaire was self-completed by a self-selected group of respondents (75 in 2009 and 125 in 2006). DCC Associates (640 approx.) provided the bulk of the responses[1]. The results indicated broad patterns, relatively wide differences and consistent responses over the two surveys, even though these are not taken to be statistically representative.HighlightsIn both surveys around 90% of respondents are familiar with the term ‘digital curation’ and regard it as a critical issue within their project or unit. The DCC is consistently given as the main source of information on curation issues by around 70% of respondents, with “on the job challenges/ research” second at around 60%.Between the two surveys there is a large jump (from 13% to 32%) in the number of respondents indicating that DCC has been “very effective” in raising awareness about digital curation, and those believing it to be “slightly effective” has correspondingly fallen from 53% to 31%.Of a list of DCC resources, five are identified as “most helpful” by at least 1 in 5 of the 2009 survey respondents, these being (in descending order) the DCC website, Briefing Papers (of various sorts), the DCC Curation Lifecycle Model, Case Studies, and the Digital Curation Manual.Respondents universally associate digital curation with “ensuring the long-term accessibility and re-usability of digital information”, and large majorities (around 90%) also relate it to “performing archiving activities on digital information such as selection, appraisal and retention” and “ensuring the authenticity, integrity and provenance of digital information are maintained over time”. Rather lower but still significant numbers (around 60%) associate digital curation with “managing digital information from its point of creation” and “managing risks to digital information” – although many more highlight the latter in 2009 (up to 84% from 61%).Curation or preservation addresses risks to the respondents’ organisations with “loss of organisational memory” consistently topping their list (identified by around 75% of respondents) and “business risks” second, identified by just under half, again across both surveys.More than two thirds indicate that their main reasons for curating and preserving digital information are its educational/research or historical value; in both years a minority cites other reasons. Similarly, the main obstacles are indicated as financial or staff resources, with around half also indicating lack of awareness or appropriate policies.For around 40% of respondents, management and preservation of digital information has an indefinite timescale. For a further 15% or so it is “beyond the life of the project/organisation”, and similar numbers indicate these are tasks “for the life of the project/organisation”.The 2009 survey respondents are no strangers to the ‘data deluge’, most dealing with at least 100Gb and some (7%) more than 100Tb. Overall 79% expect this to increase in the next two years, surprisingly 3% do not, while 7% do not know. Most need to manage a mixture of open and proprietary formats, and report a wide variety of formats in use, predominantly common office applications, PDF documents and multimedia formats. Curation and preservation challenges are most frequently identified with obsolete proprietary formats. Image, video, and geospatial data are also often identified as challenges, as are web sites combining these. Respondents were also asked in 2009 about re-use, and around a third indicate that research data is re-used internally, with similar numbers offering data generated by their project/unit for re-use by others, or re-using external data.Access issues facing research projects/units are identified in both surveys and along similar lines; intellectual property rights (e.g. copyright) is the most frequently cited issue, followed by “privacy or ethical issues”, however “embargo on research findings” is least prevalent, identified by only a fifth of respondents.Asked about funding for curation and preservation, responses show no clear picture. Around half of 2009 respondents indicate funding is “accounted for in project or institutional budget”. A large minority have no explicit funding for curation and preservation, and where resources are available these are pooled from other funded areas (e.g. IT budget for project or organisation) or research grants. Spending on curation/preservation is less than £50,000 (for around half of those respondents who were aware of this). Around half are unsure whether spending will increase or decrease, with the remainder being evenly split.Detailed questions and response data are available on request.Angus Whyte, Digital Curation Centre[1] The DCC Associates membership list includes UK data organisations, leading data curators, overseas and supranational standards agencies, and industrial/business communities. Currently research data creators are under-represented (information from registration details).
http://digitalcuration.blogspot.com/2010/01/digital-curation-centre-user-survey.html
I took advantage of an offer to upgrade my iPhone 3G to the 3Gs model just before Christmas. I spent some time considering the alternatives, and speculating about what might become available during the next eighteen months of my new contract, but I’ve been more than happy with the 3G so my decision was quite an easy one. The 3Gs offered three main improvements over the 3G:
a faster processor
a better camera
a ‘compass’
At first glance, these improvements seem quite modest. But, as we shall see, they add up to something quite significant.
The feature which attracted me mostly was the better camera. People talked about the paltry 2MP camera on the iPhone 3G but to be honest it wasn’t the resolution that was the problem – 2MP is actually adequate for the sorts of pictures I want to take with a pocket camera. The problem with the camera on the 3G was that it was just a rotten camera. I had a better camera in a Sony Clie PDA some five years ago. The camera on the 3Gs is, indeed, better than that on the previous model. It’s not great, but it is just about usable.
The surprise for me is the impact of the other two features. The faster processor was firmly in my ‘nice to have’ category – a welcome improvement but not especially important to me. Once I tried the new model however, I quickly realised what a difference this has actually made. With the previous model, I had attributed a lack of performance in certain applications to network latency. Essentially, I believed that a few apps were simply a little too advanced for the prevailing networks to serve them well. A good example of this was Evernote, an app which seemed promising but was just too sluggish on the 3G to be very useful to me. On the newest iPhone however, Evernote really flies, and network latency does not often impinge on it’s usability. Having a snappier user interface is always nice – but the 3Gs is so much more responsive as a result of its faster processor (and presumably its increased memory).
I had assumed the compass was, effectively, a gimmick. I could see how it would be occasionally useful to orient myself when using the GoogleMap application for example. But over Christmas I started to play with some of the many astronomy apps available for the iPhone. Several of these take advantage of the iPhone’s built-in GPS receiver and compass, allowing the screen to show the night-sky exactly as it appears to the user based on their location and the direction they are facing. This allowed me, for instance, to identify and point out Jupiter to my actually-quite-impressed-for-once family. Direction counts!
What the iPhone 3Gs offers to its applications is a sense of location and direction. Combined, these properties can afford a powerful new functionality.
During 2009 there was a little buzz about augmented reality, with apps such as Wikitude appearing for Android and iPhone, superimposing text and images over real-time views of the physical environment. While I try to avoid predictions for the new year, I’m confident that augmented reality apps will continue to develop, and will become more interesting, during 2010. All of the hardware ingredients – a fast processor, a decent camera, GPS + compass, are present in the iPhone 3Gs. I’m looking forward to what develops as a result.
Coincidentally, my good friend Peter just alerted me to an application called Star Walk. It doesn’t do anything that several other apps don’t also do, but it does it so beautifully. Like all Apple products, aesthetics count for much with the iPhone. When I fired up Star Walk I had a sudden thought – that the reality had just caught up with the aesthetics of mainstream science fiction. If you have an iPhone 3Gs, I recommend you spend the £1.19 for this application, if only to admire the way it looks.
Happy new year!
http://blog.paulwalk.net/2010/01/03/direction-counts/
If you saw tweets from @cardcc today, you might realise I’ve been very interested in a couple of recent developments in semantic publishing. I wrote earlier about linking data to journal articles, including David Shotton’s adventures in semantic publishing. David’s work was one of those included in the review article in the Biochemical Journal by Attwod, Kell, McDermott et al (2009). The article ranged over the place of ontologies and databases, science blogs, and various experiments. These included RSC and Project Prospect,The ChemSpider Journal of Chemistry,The FEBS Letters experiment,PubMed Central and BioLit,Shotton’s PLoS experiment,Elsevier Grand Challenge,Liquid Publications,The semantic Biochemical Journal experiment. The latter was the real focus of the article, available in PDF, but which if read through a special reader called Utopia Document displayed some active capabilities. These included the ability to visualise and rotate 3-d images of proteins, to see tables represented as graphs (or vice versa) and to link to entries in nucleic acid databases. The capabilities were perhaps a bit awkward to spot and to manipulate, but still interesting. This article is (gold) open access. Other articles in the issue have also been instrumented in this way. It’s clearly early days for Utopia, and I wasn’t wholly impressed with it as a PDF reader, but I was certainly very excited at some of what I read and saw. I also read today a very different article (I think not available on open access), by Ruthensteiner and Hess (2008). They describe the processes in making 3-d models of biological specimens, and presenting them in PDF, readable by a standard Acrobat Reader. The 3-d capability was at least as good as if not better than the Utopia results. Because it’s getting late, I’ll end with my last tweet: “My head is spinning with semantic article possibilities. I hope some get picked up in new #jiscmrd proposals, see http://www.jisc.ac.uk/fundingopportunities/funding_calls/2009/12/1409researchdata.aspx" Attwood, T. K., Kell, D. B., McDermott, P., Marsh, J., Pettifer, S. R., Thorne, D., et al. (2009). Calling International Rescue: knowledge lost in literature and data landslide! The Biochemical journal, 424(3), 317-33. doi: 10.1042/BJ20091474. Ruthensteiner, B., & Hess, M. (2008). Embedding 3D models of biological specimens in PDF publications. Microscopy research and technique, 71(11), 778-86. doi: 10.1002/jemt.20618. Pubmed abstract.
http://digitalcuration.blogspot.com/2009/12/more-activity-on-semantic-publishing.html
Timo Hannay presented a talk entitled 'From Web 2.0 to the Global Database”, providing a publishing perspective on the need for cultural change in scientific communication.Hannay took a step back to take a bigger picture view. He began by giving an overview to his work at Nature, noting that the majority of their business is through the web – although not everyone reads the work electronically, they do access the content through the web. He then explained how journals are being coming more structured, with links providing supplementary links and information. He admitted that this information is not yet structured enough, but it is there – making the journal more like databases.Hannay moved on to explain that Nature is getting involved in database publishing. They help to curate and peer-review database content and commission additional articles to give context to the data. This is a very different way of being a science publisher – so the change is not just for those doing the science!After taking us through Jim Gray's four scientific paradigms, Hannay asked us to think back to a talk by Clay Skirky in 2001, which led to the idea that the defining characteristic of the computer age is not the devices, but the connections. If a device is not connected to the network, it hardly seems like a computer at all. This led Tim O'Reilly to develop the idea of the Internet Operating System, which morphed into the name “Web 2.0”. O'Reilly looked at the companies that survived and thrived after the dot com bubble and created a list of features which defined Web 2.0 companies, including the Long Tail, software as a service, peer-to-peer technologies, trust systems and emergent data, tagging and folksonomies, and “Data as the new 'Intel Inside'”.... the idea that you can derive business benefit from powering data behind the scenes.Whilst we have seen the Web 2.0 affect science, science blogging hasn't really taken off as much as it could have done – particularly in the natural sciences – and is still not a main stream activity. However, Hannay did note some of the long term changes we are seeing as a result of the web and the tools it brings: increasing specialisation, more information sharing, smaller 'minimum publishable unit', better attribution, merging of journals and databases – with journals providing more structure to databases – and new roles for librarians, publishers and others. Hannay asserted that these changes are leading, gradually, to a speeding up of discovery.Hannay took us through some of the resources that are available on the web, from Wikipedia to PubChem and ChemSpider, where the data is structured and annotated through crowd sourcing to make the databases searchable and useable. He asserted that we are moving away from the cottage-industry model of science, with one person doing all the work in the process from designing the experiment to writing the paper. We are now seeing whole teams with specialisms collaborating across time and space in a more industrial-scale science. Different areas of science at at different stages with this.Hannay referred to Chris Anderson's claim on Wired Magazine that we no longer need theory. He rejected this, but did agree that more is different, so we will be seeing changes. He gave the example of Google, which didn't develop earlier in the history of the web simply because it was not necessary until the web reached a certain degree of scale for it to be useful.As publishers, Hannay believes that have a role to play in helping to capture, structure and preserve data. Journals are there to make information more readable for human beings, but they need think about how they present information to help both humans and computers to search and access information as both are now just as important.All human knowledge is interconnected and the associations between facts are just as important as the facts themselves. As we reach that point when a computer not connected to the network is not really a computer, Hannay hopes we will reach a point where a fact not connected to other facts in a meaningful way will hardly be considered a fact. One link, one tag, one ID at a time, we are building a global database. This may be vast and messy and confusing, but it will be hugely valuable – like the internet itself as the global computer operating system.
http://digitalcuration.blogspot.com/2009/12/idcc-09-keynote-timo-hannay.html
Last week I outlined an idea, that of the service anti-pattern, as part of a presentation I gave to the Resource Discovery Taskforce (organised by JISC in partnership with RLUK). The idea seemed to really catch the interest of and resonate with several of those members of the taskforce who were present at the meeting. My presentation was in a style which does not translate well to being viewed in a standalone context (e.g. on Slideshare) so I have decided to write it up here. I would very much welcome comments on this. (The presentation will be published on the Resource Discovery Taskforce pages and I will ask for this post to be linked to from there when it does appear).
The following diagram is meant to represent a design ‘pattern’ which I have seen often proposed, and sometimes implemented, in the JISC Information Environment (IE) as well as in the wider higher education (HE) sector in general:
It is my belief that readers who have been involved with the IE for some time will recognise this, at least in a general sense, if not in specific cases. In this arrangement, an aggregation of data is presented to the end user, through the development of a user-facing application or service. The user-facing service will in almost all cases be a web-interface, somewhat similar to the ‘portal’ concept of old but in a centralised, single, global deployment. Because it is generally accepted to be desirable to make such data available to other services (in keeping with the larger goal of interoperability through open standards), one or more machine interfaces or so-called APIs, giving access to the ‘backend’ of the system, will be offered. What this design pattern aspires to is a service implemented to be both user-facing service and machine-facing infrastructure component.
However, I contend that this is, in fact, what software engineers might call an anti-pattern. An anti-pattern is a design approach which seems plausible and attractive but which has been shown, with practice to be non-optimal or even counter-productive. It’s a pattern because it keeps coming up, which means it’s worth recording and documenting as such. It’s anti, because, in practice, it’s best avoided….
There is much which is implicit in this pattern, so I will attempt to surface what I believe are some hidden assumptions in a new version of this diagram: this is what this design pattern, once implemented, reveals:
In this second diagram, the orange colouring indicates the parts which actually get built and are supported; the yellow indicates the parts which might get built, but which won’t really be supported as a service – in a sense, this is stuff which is believed to work but actually doesn’t; in the case of the users, the yellow colouring indicates that their demand for this service is believed to exist; those components in the diagram which are neither orange, nor yellow, are the product of little more than speculation. In the end, the investment in creating a user-facing application based on an expectation of future demand which doesn’t materialise is wasted while, at the same time, the investment in providing unused machine interfaces is also wasted.
I believe that this design pattern rests on several assumptions which are actually fallacies, and is, therefore, an anti-pattern.
Fallacy 1: “Build it and they will come”:
While infrastructure services can, indeed should, be developed with future opportunity in mind, it is helpful to have an existing and real demand to satisfy, which the new development addresses. If the service is demonstrably useful to users, and is developed effectively with future opportunity in mind, then there is more chance of the service actually working, and of it being attractive to developers working on future opportunities.
Fallacy 2: Interoperability through additional machine interfaces:
Machine interfaces need as much specification, development, testing an maintenance as user-interfaces. Simply making a machine interface available through the adoption of a platform which has a built-in facility offering some standard interface is not enough. A system which proposes to offer three or four APIs is quite likely not going to support any of them adequately. I have argued before that ‘interoperability is not enough‘: in fact, this arrangement does not often lead to interoperability, let alone actual exploitation of the capability to interoperate.
Fallacy 3: People/organisations who can make good infrastructure are also going to be good at building end-user-facing services (and vice versa):
Effective infrastructure supports services which in turn support end-users. The skills and knowledge required to support service-providers are generally quite different from those needed to deliver good user-facing services.
I call this the infrastructure service anti-pattern because the result comes from conflated requirements to deliver both infrastructure (machine-to-machine interfaces) and compelling user-facing services and applications. The result can be something which satisfies neither requirement. The users, requirements and priorities are often completely different between these two problem spaces. I suggest that the following are some possible reasons for this anti-pattern appearing:
funding (naturally) tends to follow services, happy users and, importantly, new features.
funders like to see their investment showcased
infrastructure is mostly invisible making it hard to ascertain impact from users
Proposals for alternative design patterns
Here is a suggested alternative design-pattern:
In this design pattern, the API is developed before any user-facing application, or at least in parallel. An application is developed to exploit this API based on real users requirements. No service is developed until such requirements can be identified. This means that an API will be developed, and it will be being used in at least one case. Opportunities for third party integration for usage of the service are, ideally, identified beforehand. The API is properly supported from the start, or else the service fails completely. The value proposition being offered for further, opportunistic third-party developments, whether real or imagined, is now real and, crucially, supported.
An interesting alternative to this is the approach of combining the user-facing web pages and the machine-actionable API into one interface, through embedded RDFa for example:
It remains to be seen how this approach is going to work out over time, but we have seen hints of simpler approaches to combining user and machine interfaces in the past, such as RSS being styled to give a decent human-readable interface, or earlier attempts to do interesting things with XHTML.
I wonder if readers agree that the first diagrams represent an anti-pattern which they recognise. And would the proposed alternatives fare any better?
http://blog.paulwalk.net/2009/12/07/an-infrastructure-service-anti-pattern/
The recordings of the main sessions from Day 2 of IDCC 09 are now available to view on Vimeo.Please select from the links below to watch any sessions of interest, or any that you may have missed from the live stream of the event....Best Peer-Reviewed Paper “Multi-Scale Data Sharing in the Life Sciences- Some Lessons for Policy Makers” – Graham Pryor, University of EdinburghKeynote Address by Professor Ed Seidel, Associate Director, Directorate of Mathematical and Physical Sciences, National Science FoundationClosing Keynote Address: Timo Hannay – Publishing Director, Nature.com, Nature Publishing GroupClosing Remarks: Chris Rusbridge - Director of the DCCYou can also find the complete list of session recordings, together with links, at the IDCC 09 NetVibes page where we are still gathering #idcc09 tweets and other feeds about the event. If you are blogging about the event, please remember to use the #idcc09 tag!
http://digitalcuration.blogspot.com/2009/12/catch-up-on-idcc-09-day-2.html
In summing up day one, Cliff Lynch observed how the focus of the discussion at these conferences has shifted over the last five years, harking back to the first meetings when there was more discussion about preservation rather than curation. Lynch noted that we are now beginning to understand that preservation has to be a supporting structure to curation, which is a more complex process – more deeply involved in the research process. One of the other trends he observed emerging is that of “re-use” of data. We are no longer just interested in preserving, but evaluating the prospects of re-use for data and improving those prospects, where possible, to derive greater value from our data. Lynch noted that there is a deepening linkage between the tools and workflows that researchers use, so data curation needs to be increasingly integrated, as this will help solve the problems of meta data, providence and so on to make curation more effective. Lynch was very happy to hear mention of the notion that we need to get the scientific equipment developers and vendors involved. This could help feed curation into the workflow more effectively – he gave the example of cultural heritage researchers who found their cameras “knew” a lot of the meta data that they had to laboriously enter to fulfil their curation needs, and so could use the equipment to aid in the curation of the data it produced.Lynch took a lot of heart from the focus on education to give us a generation of data preservers and data curators. He was also heartened by comments that funding agencies were taking data curation seriously as part of the grant proposal and review processes. He also suggested it would be great if we could actually track the progress of this type of cultural shift.In concluding, Lynch looked at the more speculative elements of the day's discussion, including the Citizen Science debate – referring to Liz Lyon's paper on the topic. However, he wants us to recall that there is a whole range of computational and observational citizen science tasks, not just the survey-based BBC Lab UK model. He also reminded us that this is not just applicable to science... we are seeing the emergence of Citizen Humanities and amateur study in other areas which has been revitalised by the web. What we need need to look to is building data support for citizen scholarship as a whole.Finally, Lynch made a speculation involving the measure “scientific papers per minute” which underscores how badly out of control scientific communication is and creates a huge problem when propagating and curating knowledge. It seems to Lynch that one of the things we need to recognise is that many of these papers don't need to be papers, but database submissions. This would be a better way to do things if we are going to manage the data – without the emphasis on the traditional individual-voice analysis paper. So we need to have is a hard conversation about traditional forms of scientific communication and data curation to determine how data curation fits into scholarly communication and how scholarly communication may need to change to help us manage the sheer volume of the output.I caught up with Cliff just after his summary of day one to ask what he is looking forward to most from day two of IDCC 09...“I am looking forward to hearing from Ed Seidel. Most of us in the States know that there are three more DataNet awards in their final stages, so we would love to know who has got the inside track on those... although I suspect he will say that he can't comment on that! Following on from my summary, I would like to know what people think about how we can track the uptake on data curation in funding bids.Having been involved in the paper review process, I know that the best peer-review paper is very good, and there are some other great papers being presented tomorrow, so I am very much looking forward to it!”
http://digitalcuration.blogspot.com/2009/12/idcc-09-cliff-lynch-sums-up-day-one.html
To contrast and conclude the morning plenary sessions, the four speakers formed a panel to accept questions from the audience.Q: Anne Trefethen was asked to explain more about Blog3.A: Anne's colleagues have been using it and it will be launched next week at the All Hands meeting.Q: How is user-centric design being used in other areas? (aside from Neuro Hub)A: Carole Palmer explained that their work does involve requirements based work, whilst William Michener explained that DataONE engaged users from the beginning from different research centers, each of which also does its own work in their centers to establish the needs of their users. Mark Birkin explained how they have identified three different types of users – emphasising how diverse the groups of users can be – and highlighted the use of social networking tools to harvest user views directly.Q: What perspectives do the panel have as to whether data curation is still a pioneering activity and what level of maturity is there among researchers?A: Trefethen noted that whilst some researchers are mature in their understanding of data management, but there are groups who are surprised by the requirement for curation commitments in the funding bids. She explained that an understanding needs to be nurtured across disciplines, not just individual disciplines. Palmer explained that in terms of preservation, they see people lining up at the door, whilst the data sharing side is not so practiced (although there are people very keen philosophically). Bad experiences have fed into this. Michener noted that there has been a dis-service done by failing to educate young scientists with good data practice as part of doing science, so there is a lot of re-educating to be done. Birkin has a different perspective, as he is doing secondary analysis of well preserved primary data sources. However, there is not the same level of practice about the secondary analysis of data in his field, which can led to researchers having to reinvent methods.Q: Are there plans to be able to cite the data that's being used? A: Michener is looking at a data citation model that will rely on digital object identifiers to give scientists as much credit as possible for not just their publications, but also their data. It is key to cite the data as a specific object, as the data can lead to multiple publications. The other three panellists agreed that this is part of their projects, at different levels of priority.
http://digitalcuration.blogspot.com/2009/12/idcc-09-panel-discussion-ukus.html
There is still time to register for this year’s International Digital Curation Conference in London, although you will need to be quick – I’m told that registration closes on the 25th November.
This year’s conference (the fifth), organised in partnership with the Coalition for Networked Information, has the theme Moving to Multi-Scale Science: Managing Complexity and Diversity. It promises to be an interesting event – see the full programme for more details.
You can keep up with developments in advance of the event itself by reading the Digital Curation Blog (see this particular post for example) and/or following on Twitter etc. using the tag ‘#idcc09′.
http://blog.paulwalk.net/2009/11/20/idcc09/
During an interesting session called the ‘Great Global Graph’ at the CETIS conference this week I formed the opinion that, in the recent rush of enthusiasm for ‘linked data’, three ‘memes’ were being conflated. These next three bullets outline my understanding of how these terms have been used in recent discussions, including the CETIS session:
Open data: I see this as something expressed as a philosophy or, in more concrete terms, as a policy, such as that espoused by the UK Government. There are aspects of public ownership in this, but also a philosophical approach based on ‘openness’ and a rejection of the economic idea of value in scarcity of information. I think that specific technology does not come into this really: for example one concrete realisation of this policy in the UK is the Freedom of Information Act under which it is perfectly permissible for a data owner to supply data in any reasonable format and medium. Essentially, I generally take ‘open’ to mean accessible to all, notwithstanding conditions of use.
Linked data: This one is trickier, as the term is used in quite a precise way by some proponents, based on the principles of linked data form the W3C. There are others who prefer a looser definition. There have been some well-reshearsed arguments about this, which generally come down to whether or not RDF is a pre-requisite of linked data. I’ve become inclined to use the term in its more precisely defined sense, in recognition of the efforts going on in this space.
Semantic Web: This term introduces ’semantics’ into the mix, by layering on ontologies allowing inferences to be made from the data itself.
It seems that these terms are often used together in the same discussions, and I suspect I could benefit from some separation of concerns in some of these discussions. It seems to me that the following are true:
data can be open, while not being linked
data can be linked, while not being open
data which is both open and linked is increasingly viable
the Semantic Web can only function with data which is both open and linked
Option 1 satisfies, in part at least, the drive to make available to the public data which has been paid for by the public and which might be useful to it. There are those (and I count myself among them) who generally believe that at present, for example, it would be better to quickly make the data open in some useable form than to delay this unduly while it is processed into RDF. However, there is a reasonable case to be made for not polluting information spaces with poorly prepared datasets.
Option 2 is an approach for organisations which want to take a more resource-oriented approach to managing and exploiting internal information assets. In the CETIS session an interesting idea was floated around how such an approach might go a long way to helping organisations address data-quality issues.
Option 3 seems increasingly viable. There is value in the ‘linked’ aspect, regardless of whether or not semantic layers are introduced. This is how the Web works after all, and much of the impetus behind Web 2.0 seems, to me, to have come from a healthy mixture of addressable and accessible information and human-mediated convention (e.g. ‘hackable URLs). Perhaps this is the ‘Great Global Graph’ and it’s just a matter of scale?
I’m very open to comment and argument on any of this. Perhaps I’m worrying unduly about these things being mixed up, but I do sense that this space could benefit from some clarity to match the excitement and endeavour.
http://blog.paulwalk.net/2009/11/11/linked-open-semantic/
The latest issue (volume 4, issue 2) of the International Journal of Digital Curation is now available. It's a bumper issue, with two letters to the editor (a whiff of controversy there!), 8 peer-reviewed papers (originating from last year's International Digital Curation Conference), and 6 general articles (two of which came from last year's iPres08 conference). I'm really pleased with this issue, which as always is extremely interesting.This is the last issue to be produced by Richard Waller as Managing Editor, and I'd like to pay tribute to his dedication in making IJDC what it is today. He has sourced most of the general articles himself, and those who have worked with him as authors will know the courteous detail with which he has edited their work. They may not know the sheer blood, sweat and tears that have been involved, nor the extraordinarily long hours that Richard has put in to make IJDC what it is, alongside his "day job" of editing Ariadne. Thank you so much, Richard.We will have a new Production Editor for the next issue, whom I will introduce when that comes out (we hope at about the same time as this year's International Digital Curation Conference in London... have you registered yet?). We have some interesting plans to develop IJDC in volume 5, next year.Update: I thought I should have said a bit more about the contents, so the following is abridged from the Editorial. Two papers are linked by their association with data on the environment. Baker and Yarmey develop their viewpoint with environmental data as background, but their emphasis is more on arrangements for data stewardship. Jacobs and Worley report on experiences in NCAR in managing its “small” Research Data Archive (only around 250 TB!). Halbert also looks at elements of sustainability, in distributed approaches that are cooperatively maintained by small cultural memory organizations. Naumann, Keitel and Lang report on work developing and establishing a well-thought out preservation repository dedicated to a state archive. Sefton, Barnes, Ward and Downing address metadata, plus embedded semantics; their viewpoint is that of document author. Gerber and Hunter similarly address metadata and semantics, this time from the viewpoint of compound document objects Finally, we have two papers loosely linked through standards, though from different points on the spectrum of the general to the particular, as it were. At the particular end, Todd describes XAM, a standard API for storing fixed content; while from the more general end, Higgins provides an overview of continuing efforts to develop standards frameworks. Moving on to general articles, in this case I would like to mention first my colleagues Pryor and Donnelly, who present a white (or possibly green?) paper on developing curation skills in the community. Next, I would highlight two very interesting articles that originated from iPres 2008. These are Dappert and Farquahar who look at how explicitly modelling organisational goals can held define the preservation agenda. Woods and Brown describe how they have created a prototype virtual collection of 100 or so of the thousands of CD-ROMs published from many sources, including the US Government Printing Office. Shah presents the second part of his interesting independently-submitted work on preserving ephemeral digital videos. Finally, Knight reports from a Planets workshop on its preservation approach, while Guy, Ball and Day report from a UK web archiving workshop.
http://digitalcuration.blogspot.com/2009/10/new-issue-of-ijdc.html
Last week I posted a remark on Twitter:
Can’t help thinking that the idea that Google Wave will replace email rather misses the point….
The first response to this echoed my view on this suggesting that the real nature of Wave is rather harder to explain or understand, and implying that people fall back on a frame of reference with which they are comfortable. It certainly looks as though Google have anticipated this and offered some easily digested marketing messages. However, I also saw responses which suggested that some people still seem to be missing the point. One response insisted that Wave would only be successful if it was ‘integrated’ with email. I must confess that I still don’t understand this – I can’t really imagine what impact an integration between Wave and email would really have.
It seems to me that Wave is an ambitious attempt to exploit the idea that one future for the Web lies in social networked activity clustered around shared artefacts. Such artefacts, often what we still call ‘documents’, have been given the useful label social objects. At the centre of a Wave is a social object, with a series of applied and recorded operational transforms. Wave would therefore seem to be primarily about collaboration, as opposed to email or IM which are primarily concerned with messaging. Another way of looking at this would be to suggest that Wave is ‘object-centric’, as opposed to email which is message oriented with a facility to attach auxiliary objects.
The idea that Wave would replace email seems to be suggesting that we won’t need apples anymore because now we have oranges. This is not to say that Wave might not better fit some use-cases currently served by email – such as the problematic mode of collaborative editing of documents by sharing copies sent as email attachments. But even as we adopt better software for collaboration, there’s not much sign that we’re giving up using email. I don’t know about you, but my email inbox isn’t getting any smaller just because I use Google Docs, IM, Twitter…. Email has been tested quite thoroughly now over a few years, and appears to work quite well for asynchronous messaging!
Wave uses XMPP as its underlying protocol which is both interesting and important, but it is also slightly misleading as it implies an important connection with ‘instant messaging, which I think is illusory and unhelpful.
Wave is possible because the barrier of network latency is gradually being reduced. Real-time collaboration across the global network is now viable for many. Of course Wave is not the only game in town – other interesting approaches (mostly also using a variation on the pubsub paradigm) to the real-time Web, such as pubsubhubbub are being actively developed and experimented with. But Google Wave is important – because it’s Google who are doing it. It will gain a lot of publicity, and will likely play its part in driving a culture change allowing real-time collaboration across the global network to ‘go mainstream’. It should be remembered that Google’s Gmail, the poster-child for Web-based email, is still significantly smaller in terms of users than Yahoo and Hotmail.
Because Wave offers APIs to developers and users out of the box, I think it is going to be difficult to say what shape this new offering from Google will take once a significant number of people are using it. The ability to federate Wave services could be significant in this respect.
http://blog.paulwalk.net/2009/10/07/not-ready-to-wave-goodbye-to-email/
Scholars in the Netherlands can deposit or search information in a repository system called DANS EASY, containing about 500,000 files, with a wide diversity of formats. How do I deal with a file called cars.DBF, now an obsolete format. There system can read such formats and convert them to the XML-based MIXED format, which identifies the data type and contains information on structure and content. So this was a smart conversion from the binary, obsolete dbase file to an XML reusable file. In the future it can be converted from this format to a current format of choice. This process (allegedly) does not require multiple migrations… They have a SDFP community model for spreadsheet and tabular data. Have created some code for DBF and DataPerfect formats that they had to reverse engineer, in SourceForge; this a very labour-intensive activity, and really should be a community effort. Question: does reverse engineering expose to risk? Don’t know…
http://digitalcuration.blogspot.com/2009/10/ipres-2009-van-horik-on-mixed-framework.html
Early home computers often used audio cassettes as data media. Quite a bit of such data still exist in audio tapes in various archives, getting in worse and worse condition. Can they migrate the data without the original system in the future? The system they used is the Philips Videopac+ G7400, basically a video game system released in 1983… and another one (!). Data are encoded in bitstreams, which in turn are encoded in analogue waveforms (via a microphone/headphone socket pair and an audio cassette system!). They worked out how the waveforms responded to changes in the data (basically reverse-engineering the data encodings; would not have been so easy without a working computer). As a result, they were able to write a migration tool from the audio streams to non-obsolete formats. It turned out there was already a solution that worked where there was a good signal from the tape, but these were often very old tapes in poor condition, so they implemented a different approach, which worked better. Using old tapes, the other system recognised found no files. The actual system recovered 6 out of 23. Their new implementation recovered 22 out of 23 files, in some cases with errors. They checked by re-encoding the recovered files (on new tapes) and reloading to the actual system; most had minor errors that could be fixed if you knew what you were doing. They think their findings are valid for all systems that use audio encodings, although there will be wide variations in encodings and file types, but it’s not extensible to other media types.
http://digitalcuration.blogspot.com/2009/10/ipres-2009-guttenbrunner-on-digital.html
Blogs are a new medium but an old genre, witness Samuel Pepys’ diaries for instance (now also a blog!). But since they are web based, aren’t they already archived through web archiving? However, simple web archiving treats blogs simply as web pages; pages that change but in a sense stay the same. Web archiving also can’t easily respond to triggers, like RSS feeds relating to new postings. Web archiving approaches are fine, but don’t treat the blogs as first class objects. New possibilities can help build new corpora for aggregating blogs to create a preserved set for institutional records and other purposes. ArchivePress is a JISC Rapid Innovation (JISCRI) project, which once completed will be released as open source. The project started with a small 10-question survey, for which the key question was: which parts of blogs should archiving capture. In descending order the answers were posts, comments, tag & category names, embedded objects, and the blog name & URLs. These findings were broadly in agreement with an earlier survey 9see paper for reference). Set out to find the significant properties of blogs. Significant properties, they see as in the eye of the stakeholder. First round this includes content (posts, comments, embedded objects), context (including authors & profiles), structure, rendering and behaviour. To achieve this, they build on the Feed plugin for WordPress, which gathers the content as long as a RSS or Atom feed is available. WordPress is arguably the most widely used, it’s open source, it’s GPL and it has publicly available schemas. Maureen showed the AP1 demonstrator based on the DCC blogs [disclosure: I’m from the DCC!], including blog posts written today that had already been archived. The AP2 demonstrator (the UKOLN collection) will harvest comments, and resolving some rendering and configuration issues from AP1; and will allow administrators to add new categories (tags?). It seems to work; there turned out to be more variations in feed content than expected. Configuration is tricky, so must make it easier.
http://digitalcuration.blogspot.com/2009/10/ipres-2009-pennock-on-archivepress.html
[CR: missed the start of this posting the last one!]Using a cost model for digital curation, based on the functional breakdown from OAIS. Multiply break down activities until get to costable components; loos rather frightening. Have use case for digital migration. Cost factors include format interpretation, software provision (development of reader, writer & translator). Interesting data in person weeks for development of migration, eg TIFF to PDF/A as 34.7 person weeks (!!) Reporting results of some earlier stuff; A-archives dating 1968-1998; very heterogeneous; B & C archives more recent and more homogeneous. Shows results from model predictions and actual costs, differences mostly because the A archives were so hard. Also, for the better archives, the mode did well overall but under-estimated some parts and over-estimated other parts. Second test case was migration of 6 TB of data in 2000 files (very big ones: 300 MByte each). They bought software; the model over-estimated the “development” time on this basis, but under-estimated the processing, perhaps because of the very big files; throughput was very low. Overall, they found that detailed cost factors make the model not an accurate predictor (but still useful). Precision an issue; models are inaccurate per se, but sometimes give impression of accuracy. Searching for studies on format life expectancy and migration frequency [longer and less in my view]. Question: how about software re-use? They cost on a first mover basis. Also migration tools do also become obsolete. Question: why did you think migrating from PDF was necessary? Hardly a format at risk. Turns out to be a move from proprietary to non-proprietary. Question on scaling: thousands to hundreds of millions of objects; will these apply. Answer was that they will. [CR: doubt this; biggest flaw in LIFE so far has been devastating scaling problems.]
http://digitalcuration.blogspot.com/2009/10/ipres-2009-kejser-on-danish-cost-model.html
Paul reviews the two earlier phases of LIFE; LIFE3 is UCL, BL & HATII [disclosure: DCC partner; disclosure, I’m also an “expert” on LIFE panels etc] at Glasgow University. Defined a lifecycle approach to costing, creating a generic model of digital preservation lifecycle. LIFE3 is now trying to create a costing tool based on costing models based on stages of the digital lifecycle. Will use previous LIFE data, also Keeping Research data Safe project. Tool inputs: content profile, organisational profile and context. Lifecycle stages are creation/purchase, acquisition, ingest, bit-stream preservation, content preservation, and access. Where possible, exploit existing work, eg PLANETS work building on DROID, also FITS tool (?), also looking at DRAMBORA & DAF, plus PLANETS tool PLATO. A template approach lowers the barrier for non-digital preservation people. Context: still very much a hybrid world; analogue as well as digital. Non-digital not dying, but usage increasing. Also greater variety of digital content, eg video etc. Resources are currently 20:1 on preservation of non-digital to digital, but will need to move more towards 1:1. Need to think about the risk elements as well as cost elements. LIFE is also expected in the BL to support preservation planning, eg in purchase/acquire/digitise, and in selecting appropriate preservation strategies. Finally, need to budget for resulting costing [CR: the feedback from prediction versus actual could be very interesting!] Challenges and request for help: had a simple categorisation of content type & complexity. This has been criticised but without a better example. Hlpe, please. Also need more costing data. Finally, will be trialling models, and we’d like to hear from anyone who might want to participate in this.
http://digitalcuration.blogspot.com/2009/10/ipres-2009-wheatley-on-life3.html
Motivation is to distribute data not just geographically but also across different technologies. Also preserving through software changes; forward migrate to later versions, or replacements. Also to have a succession plan for the case where the repository fails. TIPR is defining a common exchange format. Involves FCLA using DAITSS, Cornell using ADORE but migrating to FEDORA, NYU using DSpace. FCLA have one AIP per intellectual entity, and they retain the first and the latest representation. Cornell hold one AIP for each representation. NYU also has one AIP (didn’t catch how it works). Format is called the Repository Exchange Package (RXP) based on METS and PREMIS. Need to work with multiple sources, but contain sufficient data for the receiving repository to do what it needs. Minimal structure is 4 files in a directory. A METS document about the source repository, plus provenance and optional rights, plus the actual representations in the package. The second file contains information about provenance. Then two more PREMIS files (?); finally a files manifest (cf BAGIT). [I’m not sure I’m capturing this well, best look at the PPTs later. But why are the slides blue and yellow mixed up???] Transfer tests: a broadcast transfer, and a ring transfer. In the latter case, each RXP is ingested, then disseminated and sent on to the next, until it gets back to the first. They have built a lot of stuff, and implemented the broadcast transfer test. Next steps: the ring test, and try different (wacky!) RXPs. Question: why use METS/PREMIS but not RDF & ORE? Familiarity! Question: will this work with Bagit? Yes; they use Bagit right now…
http://digitalcuration.blogspot.com/2009/10/ipres-2009-pawletko-on-tiprs-progress.html
I have just spent an interesting and inspiring 24 hours at the JISC Rapid Innovation Programme meeting, which was organised by UKOLN (disclaimer: I work for UKOLN), and funded through the JISC-funded IE Demonstrator project. The venue chosen for the event was certainly an unusual one – the City of Manchester Stadium, home of Manchester City Football Club. I thought the venue worked very well for this event and would recommend it. The event was primarily aimed at developers from the JISC Rapid Innovation projects, but with a significant number of others delegates drawn from JISC programme management as well as the the wider developer community
With this event, we decided to address an issue that has become apparent as JISC has started to engage more directly with developers in the HE sector: developers are often untrained, and sometimes not naturally disposed to explaining their projects to others, especially when those others are not themselves developers. So we hit on the (admittedly somewhat artificial) exercise of requiring a representative of each project to deliver a 45 second ‘pitch’ to the assembled audience, which was recorded to video. The project reps were then given a 15 minute consultation with one of a set of panels of three ‘experts’, led by people from media and communications backgrounds, where they reviewed the video of their pitch and discussed ways of improving upon it. These were held in the Stadium’s executive boxes overlooking the pitch, which was pretty cool! The reps were warned that they would be required to deliver a new, improved 20 second pitch the following day….
This exercise was something of a gamble to be honest. We were confident that a significant number of the project reps would appreciate why it is important that they be able to clearly explain what their project is about in a few sentences, in terms that a wide variety of people might be able to understand. We hoped that the majority would be able to benefit from these exercises to the point where they could deliver a compelling, clear pitch for their project. The results, I’m really happy to say, were outstanding! The improvement over 24 hours was remarkable, and JISC now has a portfolio of clear explanations of the 40+ ‘rapid innovation’ projects, not to mention a group of developers better equipped to explain what it is they are working on.
As well as a training exercise, this event delivered a series of ‘lightening talks’, panel sessions and ‘show & tell’ opportunities – a set of features which has become a staple of developer-centric events. Twitter was actively used as a back-channel to the event so you can get a small sense of what was going on from that stream.
I also used this event to ‘officially’ launch the DevCSI project – but managed to cock up my presentation – by ‘losing’ my presenter’s display with all my notes. As I’d decided to go for a one word per slide approach for much of the presentation, this was a bit of a disaster for me. Oh well, I gathered some real interest in this nonetheless and some opportunities for events and other engagements. If you’re interested, you can read more about this on the project blog.
I’d like to extend a big thank-you to everyone who came, many of whom stepped some way out of their ‘comfort zone’ to engage with the ‘pitching’ exercise. I’d especially like to thank Mahendra Mahey (UKOLN) who did most of the organising together with David Flanders (JISC), as well as the ever-professional UKOLN Events Team (Natasha Bishop and Michelle Smith) who seemed to work non-stop for 24 hours. Our army of professional bloggers was fantastic, offering expert advice on the pitches as well as conducting interviews, all of which have already been transcribed to the IE Demonstrator Blog, with a large number of the projects. David Tarrant (Southampton University) and Julian Cheal (UKOLN) provided excellent technical support, maintaining a networked news service which was displayed all over the venue.
This event was a pleasure to be involved in – there was a great spirit of cooperation throughout which bodes really well for future events with the developer community.
http://blog.paulwalk.net/2009/09/04/jisc-rapid-innovation-event/
I did a presentation on Trust and Digital Archives at the PASIG Malta meeting; not a very good presentation, I felt. But somewhat at the last minute I added a scarcely relevant extra slide on my favourite bete noir: Preservation’s dirty little secret, namely backup, or rather the lack of it. Curation and preservation are about doing better research, and reducing risk to the research enterprise. But all is for nothing if those elementary prior precautions of making proper backups are not observed. You can’t curate what isn’t there! Anyway, that part went down very well. Of course the obvious reaction at this point might be to say, tsk tsk, proper backups, of course, everyone should do that. But I’m willing to bet that the majority of people have quite inadequate backup for the home computer systems; the systems on which, these days, they keep their photos, their emails, their important documents. Or worse, they think that having uploaded their photos to Flickr means they are backed up and even preserved. There’s a subtext here. Researchers are bright, they question authority, they are even idiosyncratic. They often travel a lot, away from the office for long periods. They have their own ways of doing things. Yes, some labs will be well organised, with good systems in place. But others will leave many things to the “good sense” of the researchers. In my experience, this means a wide variety of equipment, both desktop and laptop with several flavours of operating system in the one research group. One I saw recently had pretty much the entire gamut: more than one version of MS Windows, more than one version of MacOS/X, and several versions of Linux, all these on laptops. Desktop machines in that group tended to be better protected, with a corporate Desktop, networked drives and organised back systems. But the laptops, often the researchers’ primary machines, were very exposed. My own project has Windows and Mac systems (at least), and is complicated by being spread across several institutions. The "good sense" of researchers apparently leaves a lot to be desired, according to a few surveys we've seen over the past couple of years, including the StORe project (mentioned in an earlier blog post). At a recent meeting, we even heard of examples of IT departments discouraging researchers from keeping their data on the backed-up part of their corporate systems, presumably for reasons of volume, expense, etc. To take my own example, I have a self-managed Mac laptop with 110 GB of disk or so. My corporate disk quota has been pushed up to a quite generous 10 GB. The simplest backup strategy for me is to rsync key directories onto a corporate disk drive, and let the corporate systems take the backup from there. Someone wrote a tiny script for me that I run in the underlying UNIX system; typically it takes scarcely a couple of minutes to rsync the Desktop, my Documents and Library folders (including email, about 9 GB in all), when in the office. But I keep downloading reports and other documents, and soon I’ll be bumping up against that quota limit again. For a while I supplemented this with backup DVDs, and there’s quite a pile of them on my desk. But that already doesn’t work as my needs increase. Remember, disk is cheap! No-one buys a computer with less than 100 GB these days. My current personal backup solution was to supplement the partial rsync backup with a separate backup (using the excellent Carbon Copy Cloner) to a 500 GB disk kept at home, on a USB 2 port. I backup a bit more (Pictures folders etc), but it’s MUCH slower, taking maybe 12 minutes, most of which seems to be a very laborious trek through the filesystem (rsync clearly does the same task much faster). By the way, that simple, self-powered disk cost less than £100, and a colleague says I should have paid less. I know this doesn't translate easily into budget for corporate systems, but it certainly should. But this one-off solution still leaves me unable to answer a simple question: are my project’s data adequately backed up? My solution works for someone with a Mac; the software doesn’t work on Windows. It seems to be everyone for themselves. As far as I can see, there is no good, simple, low cost, standardised way to organise backup! I looked into Cloud solutions briefly, but was rather put off by the clauses in Amazon’s agreements, such as 7.2 ("you acknowledge that you bear sole responsibility for adequate security, protection and backup of Your Content"), or 11.5, disclaiming any warranty "THAT THE DATA YOU STORE WITHIN THE SERVICE OFFERINGS WILL BE SECURE OR NOT OTHERWISE LOST OR DAMAGED" (more on this another time). That certainly makes the appealing idea of Cloud-based backups rather less attractive (although you could perhaps negotiate or design around it). I think what I need is a service I can subscribe to on behalf of everyone in my project. I want agreed backup policies that allow for the partly disconnected existence, experienced by so many these days. I want the server side to be negotiable separately from the client side, and I want clients for all the major OS variants (for us, that’s Windows, Mac OS/X and various flavours of Linux). I think this means an API interface, leaving room for individuals to specify whether it should be a bootable or partial backup, which parts of the system are included or excluded, and for management to specify the overall backup regime (full and incremental backups, cycles, and any “archiving” of deleted files, etc). I want the system to take account of intermittent and variable quality connectivity; I don’t want a full backup to start when I make a connection over a dodgy external wifi network. I don’t want it to work only in the office. I don’t want it to require a server on-site. On the one hand this sounds a lot; on the other, none of this is really difficult (I believe). Does such a system exist? If not, is defining a system like this a job for SNIA? JISC? Anyone? Is there a demand from anyone other than me?
http://digitalcuration.blogspot.com/2009/07/my-backup-rant.html
I was very pleased to hear David Rosenthal reprise his CNI keynote on digital preservation for the Sun-PASIG meeting in Malta, a few weeks ago now. David is a very original thinker and careful speaker. I’ve fallen into the trap before of mis-remembering him, and then arguing from my faulty version. I even noted two tweets made contemporaneously with his talk, that misquoted him and changed the meaning subtly (see below). Luckily, David has made his CNI presentation available in an annotated version on his blog, so I hope I don’t make the same mistake again. If you were not able to hear this talk, please go read that blog post. David has some important things to say, pretty much all of which I agree strongly with. No real surprise there, as part of the talk at least echoes concerns I expressed in the “Excuse Me…” Ariadne article (Rothenberg, 1995). The important digital preservation problems for society are not media degradation or media obsolescence or format obsolescence, because important stuff is online (and more or less independent of media), and widely used formats no longer go obsolescent the way they used to when Jeff wrote the article. The important issue is money, as collecting all we need will be ruinously expensive. Every dollar we spend on non-problems (like protecting against format obsolescence) doesn’t go towards real problems. And if you are so imbued with conventional preservation wisdom as to think that summary is nonsense, but you haven’t read the blog post, go read it before making up your mind! David concludes: "Practical Next Steps Everyone - just go collect the bits: Not hard or costly to do a good enough job, Please use Creative Commons licensesPreserve Open Source repositories: Easy & vital: no legal, technical or scale barriersSupport Open Source renderers & emulatorsSupport research into preservation tech: How to preserve bits adequately & affordably? How to preserve this decade's dynamic web of services? Not just last decade's static web of pages"So what are the limitations of this analysis? My quick summary from a research data viewpoint: Lots of important/valuable stuff is not online Quite a lot of this stuff is not readable with common, open-source-compatible software packages We need to keep contextual metadata as well as the bits for a lot of this stuff… and yes, we do need to learn how to do this in a scalable way. David clearly concentrates on the online world: “Now, if it is worth keeping, it is on-line Off-line backups are temporary” However, it’s worth remembering Raymond Clarke’s point in my earlier post from PASIG Malta about the cost advantages of offline. Particularly in the research data world, there is a substantial set of content that exists off-line, or perhaps near-line. Some of the Rothenberg risks still apply to such content. Let’s leave aside for the moment that parallels to the scenario that Rothenberg envisages continue to exist: scholars’ works encoded in obsolete digital media are starting to be ingested in archives. But more pressingly, some research projects report that their university IT departments discourage them from using enterprise backup systems for research data, for reasons of capacity limitations. So these data often exist in a ragbag collection of scarcely documented offline media (or may even be not backed up at all). In Big Science, data may be better protected, being sometimes held in large hierarchical storage management systems. A concern I have heard from the managers of such large systems is that the time needed to migrate their substantial data holdings from one generation of storage to the next can approximate the life of the system, ie several years. And clearly such systems are more exposed to risk. Secondly, David’s comments about format obsolescence apply specifically to common formats. He says “Gratuitous incompatibility is now self-defeating”, and “Open Source renderers [exist] for all major formats” with “Open Source isn't backwards incompatible”. But unfortunately there are examples where there are valuable resources that remain at risk. There are areas with valuable content not accessible with Open Source renderers (eg engineering and architectural design). There are many cases in research where critical analysis codes are written by non-experts, with poor version control, poorly documented. And even in the mainstream world, format obsolescence can still occur in minority formats, for all sorts of reasons, including bankruptcy, but also including sheer bad design of early versions. Finally, I’m sure David didn’t really mean “just keep the bits”. Particularly in research, but in many other areas as well, important contextual data and metadata are needed to understand the preserved data, and to demonstrate its authenticity. The task of capturing and preserving these can be the hardest part of curating and preserving the data, precisely because those directly involved need less of the context. Oh, that double mis-quote? Talking of the difficulty of engaging with costly lawyers, David said “1 hour of 1 lawyer ~ 5TB of disk [-] 10 hours of 1 lawyer could store the academic literature”. One tweet reported this as “Lawyer effects; cost of 10 lawyer hours could save entire academic literature!” and the other as “10 hours of a lawyer's time could preserve the entire academic literature”. See what I mean? Neither save nor preserve mean the same as store! Overall, David does a great job, in his presentation, blog post and other writings, in reminding us not to blindly accept but to challenge preservation orthodoxy. Put simply, we have to think for ourselves. Rothenberg, J. (1995). Ensuring the longevity of digital documents. Scientific American, 272(1), 42. http://search.ebscohost.com/login.aspx?direct=true&db=buh&AN=9501173513&site=ehost-live (yes, that URL IS the "permanent URL according to Ebsco!)Rusbridge, C. (2006). Excuse Me... Some Digital Preservation Fallacies? Ariadne from http://www.ariadne.ac.uk/issue46/rusbridge/.
http://digitalcuration.blogspot.com/2009/07/rosenthal-at-sun-pasig-in-malta.html
In the beginning was the World Wide Web. Then we were going to have the Semantic Web. (Then we had Web 2.0, but that’s another story.) But maybe the Semantic Web wasn’t semantic enough for some, so they changed the name to Linked Data, and it began to take off a little more. Now there’s an argument on whether all linked data are Linked Data! The debate started with Andy Powell asking on Twitter what name we should use when all the conditions for Linked Data are met except for one, which was the requirement that data be expressed in standards, specifically RDF (see Andy's summary). Tim Berners Lee had suggested there were 4 principles for Linked Data: Use URIs as names for things.Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).Include links to other URIs. so that they can discover more things. There were quite strong divisions; one group says roughly: “Linked Data is a brand and a definition; live with it”, while the other group says something like “Linked Data can afford to be inclusive, and will benefit from that” (both of these are extreme simplifications). I’ve read all the remarks and they’re pretty convincing; I mostly agree with them (not much help to you, gentle reader!). Paul Walk's summary is quite balanced. However, I particularly liked a comment made on someone else’s blog post by Dan Brickley, who should know about RDF (quoted by Andy in the post mentioned above): “I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Statistical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever. We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc. The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analogous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)” I think this makes lots of sense for research data. I’ve been wondering for some time how RDF fits into the world of research data. I asked the NERC Data Managers at their meeting earlier this year, and the general consensus appeared to be that RDF was good for the metadata, but not the actual research data. This seems reasonable and is consistent with Dan’s view above. But it does rather raise the question about exactly what kinds of data RDF IS suitable for. It begins to look as if it is good for isolated facts, simple relationships and descriptive data. While RDF probably can encode most things you would put in databases or scientific datasets, generally it would be very difficult to express what those databases and datasets can express, and there would be a massive explosion of triples if one tried. To answer Andy’s original question (what name…), although I was taken with the idea of linked data, it’s clearly too easy to confuse with Linked Data. So I think I’d go with Paul Walk’s suggestion of Web of Data, or interchangeably Dan Brickley's data Web. If we can weave research data into a Web of Data, we’ll be doing well!
http://digitalcuration.blogspot.com/2009/07/semantic-web-of-linked-data-for.html
That's volume 4, issue 1 of the International Journal of Digital Curation... and I didn't report it here. My apologies for that. It's our biggest issue yet, with 10 peer-reviewed papers and 4 general articles, plus 2 editorials (a guest editorial from Malcolm Atkinson, and a normal one from me). There's some really interesting stuff, mostly from the Digital Curation Conference in Edinburgh last year.There are still a few papers from last year's conference to come, plus a selection from iPres 2008 at the BL in London as well. We are also hoping that some papers will emerge from iPres 09, which has just opened registration, and will shortly be feeding back the results of their selection process to authors. Still time to submit to this year's Digital Curation Conference, guys (submissions close August 7, 2009).We have done a couple of interesting analyses on the IJDC. One was a "readership analysis" based on web stats, for the period January-June 2009. Eight out of the ten most down-loaded papers in that time were from IJDC 3(2) (the ninth was from 3(1), and the tenth was from IJDC 1). These 10 papers were down-loaded just under 440 times each during that period (395 to 485 times).The second was to use Google Scholar to assess citations for the issues up to and including 3(2). Issue 4(1) is too recent. I checked the peer-reviewed papers, which GS suggested had been cited 92 times (maximum 11 times for that most-down-loaded Beagrie article from Issue 1), for an average of 3.3 citations per paper. I also checked the articles, although I ignored simple reports, editorials and reviews. Counting peer-reviewed papers and checked general articles, there were 142 citations, for an average of 2.7 citations per item.Only one out of those eight most down-loaded papers in issue 3(2) had translated those downloads into significant citations, the Cheung paper has 6. But we should give them time, I think; citations per checked item per issue are noticeably lower for more recent items, as you might expect.4.2 in IJDC 13.3 in IJDC 2(1)4.2 in IJDC 2(2)2.1 in IJDC 3(1)1.4 in IJDC 3(2)By the way, we are particularly proud of one citation of an IJDC paper from a paper in Nature's Big Data Issue (Howe et al, The future of biocuration). The citation was of the Palmer et al paper in IJDC 2(2)... but Google Scholar failed to notice it. So these figures come with a few caveats!
http://digitalcuration.blogspot.com/2009/07/ijdc-volume-41-was-published.html
Over the years I’ve found the ‘Semantic Web‘ to be an interesting though, at times, faintly worrying concept. It has never much impacted on my work directly, despite my having been embroiled in Web development since, well pretty much, Web development began. Of late I’ve tried to follow the earnest discussions about how the Semantic Web went all wrong because it was hijacked by the AI enthusiasts, and how it is going to be alright now because a more pragmatic paradigm has gained the upper-hand, that of Linked Data.
This post is my tuppence worth provoked by an interesting debate on Twitter recently which was kicked off by Andy Powell who has just blogged about it. It’s worth reading Andy’s post to get the details of this, but in essence, Andy asked if there was a term we could use for Linked Data where the RDF part is not required. This provoked a distributed argument between those who believe that the RDF model is integral to Linked Data, those who believe it shouldn’t be, and those who Don’t Really Care To Be Honest.
I found myself generally in agreement with Paul Miller who made the point:
Despite this undoubted progress, the green shoots of a Linked Data ecology remain delicate. By moving from a message that stresses the value of unambiguous and web-addressable naming (HTTP URIs), providing ‘useful information,’ and enabling people to ‘discover more things’ by linking toward a message that elevates one of the best mechanisms (RDF) for achieving this to become the only permissible approach, we do the broader aims great harm.
It seems to me that there has been progress over the years which a zealous insistence on RDF could jeopardise. I had thought about joining in and blogging about this, and then came across this comment from Dan Brickley via Rob Styles, which pretty much said it all I thought. He finishes with:
But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much!
Quite.
But then I read Andy’s post, in which he links to various people including Ian Davis in the Linked Data Brand. Right up front, Ian states:
This is not a technical issue and its not one of zealots or pragmatists: its a marketing and branding issue.
The term Linked Data was coined to brand a specific class of practices: namely assigning HTTP URIs to abitrary things and making those URIs respond with RDF relating the things to other things.
Here very few of the ‘things’ are documents, instead they are people, places, objects and concepts.
That deliberately excludes many other practices of publishing data on the web such as atom feeds, spreadsheets, APIs and even many existing RDF use cases.
Ah – so, It’s the label which is important, because it denotes an important movement, led by Tim Berners Lee himself. Interestingly, it’s concerned with a very small part of the general concern of making data available on the Web – actually it’s not even about data per se – it’s about linking concepts.
Ian goes on to say:
The Semantic Web community has been notorious for its poor marketing over the past decade. Now just when it seems the community has found the right balance between technology and mass appeal it feels like people are trying to rip away that success for their own purposes. That is deliberately emotive language because brands are all about emotion.
I have spent much of my career linking data on the Web, linking eLearning systems to Library OPACs for example. I have occasionally used RDF in the past and am working with it again now. I have used many other technologies. In the last few years I have seen the dawning of an understanding on the part of the mainstream of Web developers and users that this kind of thing might be useful and worth investing some time and effort in. I would argue that the most significant advance in linking data in recent years has been in the wide-spread adoption of cottage-industry XML formats in Web 2.0 mashups. I don’t think people are trying to appropriate the brand, so much as resisting the idea that a term as generic sounding as ‘Linked Data’ could be owned by what is, in the scheme of things, a small group.
So if I decided to use ‘Linked Data’ to describe linking data in general – it certainly wouldn’t be because I was jumping on a band-wagon – I think that the wheels came of that particular band-wagon years ago.
So that leaves us back at Andy’s question. I’m happy to avoid winding up the Linked Data people by ‘appropriating’ their term but, then, what do I call it when I link data on the Web and I don’t check Sir Tim’s design issues first? Personally, I like ‘Web of Data’. I’ve blogged about this before, but I still believe that this slide from Tom Coates’s Native to a Web of Data presentation (which I suggested to Andy as part of the answer to his original question) sums it up best – I’ve had a print-out of that particular slide stuck up on my office wall for about three years.
http://blog.paulwalk.net/2009/07/21/no-data-here-just-linked-concepts/
5th International Digital Curation Conference (IDCC09) Moving to Multi-Scale Science: Managing Complexity and Diversity. 2 – 4 December 2009, Millennium Gloucester Hotel, London, UK. **************************************************************************We are pleased to announce that the Paper Submission date for IDCC09 has been extended by 2 weeks to Friday 7 August 2009: http://www.dcc.ac.uk/events/dcc-2009/call-for-papers/ Remember that submissions should be in the form of a full or short paper, or a one page abstract for a poster, workshop or demonstration.Presenting at the conference offers you the chance to:- Share good practice, skills and knowledge transfer Influence and inform future digital curation policy & practice Test out curation resources and toolkits Explore collaborative possibilities and partnerships Engage educators and trainers with regard to developing digital curation skills for the future Speakers at the conference will include:- Timo Hannay – Publishing Director, Nature.comProfessor Douglas Kell – Chief Executive of the Biotechnology & Biological Sciences Research Council (BBSRC)Dr. Ed Seidel, Director of the National Science Foundation’s Office of Cyberinfrastructure All papers accepted for the conference will be published in the International Journal of Digital CurationSent on behalf of the Programme Committee – co-chaired by Chris Rusbridge, Director of the Digital Curation Centre, Liz Lyon, Director of UKOLN and Clifford Lynch, Executive Director of the Coalition for Networked Information.
http://digitalcuration.blogspot.com/2009/07/digital-curation-conference-deadline.html
I was at the Sun Preservation and Archiving SIG meeting in Malta a couple of weeks ago, a very interesting meeting indeed. The agenda and presentations are being mounted here. I’ll try to pick out some points that are worth briefly blogging about, if I can. Raymond Clarke of Sun and SNIA spoke quite early on about storage (presentation not up yet). You may know that Sun now holds the Internet Archive, apparently in a mobile data centre (basically a standard container in a car park!). The point I was interested in was Raymond’s comments about tape, which he said was the fastest growing market segment. He said the cost ratio of disk:tape storage was 23:1. But, noting that storage represents around 40% of the power consumption of data centres, and given our current environmental concerns, it’s notable that the energy ratio for disk:tape is 200:1! On risk, a separate presentation by Moreira (pdf) also spoke about the green and cost advantages of tape (although he only identified a 3:1 advantage), but I note two tweets from the time: dkeats: Tapes degrade, data loss happens, there is risk, need to measure quality of tapes in real time, and manage risk. #pasig cardcc: #pasig Moreira shows green & cost advantages of tape, but concerns on fragility & lifetime 5-10-30 years, but maybe only days if mistreated Now there was an element of standard vendor pitch towards hierarchical storage management systems. But if you have very large volumes, sufficient value but relatively low re-use rate, then despite some of the significant disadvantages of tape, those numbers have to drive you towards a tape solution for preservation and archiving! It certainly will significantly increase the up-front investment cost, the data centre management requirements, and possibly some risk factors, but given sufficient volumes you should make savings overall. And this could apply to quite a bit of research data...
http://digitalcuration.blogspot.com/2009/07/online-and-offline-storage-cost-and.html
Facebook rolled out its ‘usernames‘ function today. This is a new feature at Facebook which allows a user to claim their little bit of the Facebook namespace, along the lines of:
http://www.facebook.com/[preferred_name]/
The process started at 05:00 am UK local time – on a Saturday morning – yet several people in my social and professional networks got up early to claim their personalised Facebook URL. Not all were successful despite this determination, and some ended up having to settle for some variation on their preferred username.
As for me, I enjoyed a rare lie-in
So, why do people think this is important – and worth getting up at 05:00 for? And why am I not ‘bovvered’? From the various commentaries I’ve seen so far – blog posts and Twitter discussion primarily, here are some aspects & motives I’ve identified so far, and some of the issues I have with them.
Fear of someone else registering your preferred username
This seems to be the main reason for the 05:00 land-grab. The motivation for registering a username appears to be, primarily, a defensive one. I guess there’s a sense that this might become important. The majority of people, from my very limited straw-poll, seem to fall into this category. While I don’t personally feel the need, I understand this reasoning.
Wanting to be able to offer a neat & personalised Facebook URL for you or your organisation
This is covered by Brian Kelly – he describes the decision to register a Facebook URL for an organisational Facebook page as a ‘no-brainer’, and lists a few higher-education institutions (HEIs) which have rushed to register a URL.
In his post, Brian asks:
So tell me, what is the logic in having a personal or institutional Facebook account and keeping the long form for its address? Or are the tweets I’ve been seeing simply a minority view from the ideological purists….?
For some people, the personalised URL is immediately important as they intend to use it as a personal ‘identifier’. The motivations here are convenience – such a URL can be much more memorable, and ‘vanity’ – a personalised URL is undoubtedly more satisfying and attractive. (Note, I use the term ‘vanity’ here as it has been used by others in this context and I don’t intend any pejorative sense that this term might convey).
So, why was I lounging in bed rather than rushing to claim my Facebook ID, and why would I hesitate (’ideological purity’ aside!) before registering and publicising a URL for my HEI?
I have a personal namespace, having registered the domain ‘paulwalk.net’. This is also my OpenID, through the use of delegation (I have already changed OpenID identity provider twice without changing my OpenID). I realise that maintaining a personal domain is not yet a mainstream activity – yet I’m frequently surprised by the fact that many of those generally very tech-savvy people in my professional/social networks do not bother to do this, instead investing a major part of their online identities with companies such as wordpress.com or Facebook.
Do you trust Facebook? How much? Because, by registering a Facebook URL and publicising it, you just tied a potentially major part of your online identity with the fortunes and behaviour of this company. As an individual, this risk might be worth the convenience perhaps. But as an HEI – why would you want to introduce this risk when you already own and manage your own namespace?
As an HEI, you will have, no doubt, invested considerably in establishing a strong URL-based online brand, being careful with search engine optimisation and the like. Why then would you introduce a competing URL which will tend to dilute your primary Web address’s prominence? It may be that some HEIs have, after careful deliberation, decided to base their online identity and the marketing of their organisation on the Facebook platform – but I’d be amazed if this were true. So what exactly is the point in establishing a public Facebook URL for your organisation?
An expectation that Facebook will become an OpenID identity provider in the future
More tech-savvy users recognise that the Facebook URL they claim could soon become an OpenID. If they are a regular user of Facebook, this could offer a measure of convenience in the sense that their identity provider will be also a service provider which they use frequently. But as the usability issues with OpenID (and there are several) are gradually ironed out, we can expect to see OpenID’s importance as an ‘identifying system’ rather than an authenticating mechanism come to the fore. Using Facebook (or any equivalent service provider) as an identity provider will make less and less sense.
Time will tell
It may be that I am wrong about these issues. However, I have challenged the HEI sector’s desire to jump on the Facebook bandwagon in the past, and I have not seen much evidence to convince me that Facebook is a significant platform for engagment with students. As part of a marketing strategy, it probably makes sense to maintain some sort of presence in Facebook – just as it might make sense to establish a presence in various other systems. But on the public Web, an HEI’s identity must surely be kept independent of any private commercial concern. The mechanisms for ensuring this are well established. And, increasingly, we can begin to apply these mechanisms to our individual identities.
http://blog.paulwalk.net/2009/06/14/heis-get-facebook-fever-again/
Thanks to David Giaretta for the following information on the state of the revision to OAIS (I have commented earlier on this process):OAIS version for public examinationMany comments and ideas for clarifications and improvements for OAIS were received as part of its 5 year review process.These suggestions were reviewed and the proposed dispositions sent to their originators for further comment. This draft version of OAIS contains these and many other improvements and is the candidate for submission to ISO for review. At this stage we are seeking primarily to identify errors rather than further ideas.The PDF file is available at http://cwe.ccsds.org/moims/docs/MOIMS-DAI/Draft%20Documents/OAIS-candidate-V2-markup.pdfPlease send corrections to oais-support@oais.info by 15 June 2009(NB there are some cross-reference errors which will be corrected in the final version)Shortly after this date the corrected OAIS update will be sent to ISO and in due course this will be released for international review at which point further comments may be submitted.John Garrett (chair) David Giaretta (deputy-chair)DAI-WG CCSDS
http://digitalcuration.blogspot.com/2009/05/oais-version-for-public-examination.html
In my data recovery sideline, I thought I might tackle some of my own ancient media. I have a number of Mac backups on ancient Jaz 1 GB disks, written in the period 1997-1999. Andrew Treloar of ANDS has recovered the contents of those files onto CD-ROM, so I'm not looking to read the media any more. There are 2 backups on each of 2 Jaz disks, and the CD has 4 files from each Jaz disk. The earliest two have the following names:FIGIT 2.971113A #001FIGIT 2.971113A.FULLI'm guessing these were written on 13 November, 1997. Definitely from a Mac. Does anyone know what backup software wrote these? I would like to recover the contents if possible!The .FULL is about 250 KB, while the #001 appears to be around 300 MB and is presumably the actual backup!Any ideas on what the backup software might be? I thought it might be some standard software on the Iomega install. The install disks I still have don't read properly, but I managed to get sight of a directory for a French DOS install disk, and couldn't see any file names that looked like backup software!
http://digitalcuration.blogspot.com/2009/04/lazyweb-mac-backup-software-associated.html
Last Friday I was at the National e-Science Centre (NeSC) for the second day of a workshop on e-Science; unfortunately I wasn’t able to get to the first day. I tweeted most of it with the Twitter hashtag #eresearch, so you might learn all you need to just be searching for that hashtag (although it is used for other things, especially by some of the Australians involved in ANDS). One thing I did like was the use of e-Science quite generally, without a specific focus on any one technology (ie it was not GRID-dominated!). However, that does it make it a little harder to define the e-Science bit.Leif Laaksonen from Helsinki, Chair of the e-Infrastructure Reflection Group, gave the keynote (not sure if, or when, the slides will be made available). The e-IRG prepares white papers and other documents aiming to realise the “vision for the future […] an open e-Infrastructure enabling flexible cooperation and optimal use of all electronically available resources”. He mentioned the European Strategy Forum on Research Infrastructures (ESFRI); this group must be doing something right, as apparently they have 10 billion Euros to play with to build and maintain this infrastructure! He spoke quite a lot about sustainability, but it appeared that this means not paying for infrastructure through a succession of European projects, but rather through sustained funding through national governments. Hmmm. I wondered what happened when a country has a bad budget year and cuts its infrastructure project; how much of the global infrastructure can be damaged by that?Laaksonen had a number of interesting diagrams in his presentation, which can be seen from the web site. One that alarmed me (slide 17?) had an un-differentiated layer of “data” just above the network layer. I worry that is a dangerously simplistic summary; the data layer is far more fractured than that, with disciplinary, sub-disciplinary, even sub-group different approaches to curation.Laaksonen did show a great slide, dating from 2003, showing routes of innovation from academic research to industrial acceptance. Not a monotonic progress!This was followed by 3 quick presentations, one on the National Grid Service (not the one that supplies electricity and gas, but the one that supplies compute and data storage resources, using, yes, GRID technologies. The second was on OMII-UK, aiming to sustain community software developments, but itself, like the DCC, perhaps facing its own sustainability crisis? Finally, I gave a short presentation on the DCC, based on the one I gave to the environmental data managers a month or so ago. Then Bruce Beckles from Cambridge gave a wonderfully enthusiastic talk on being a one-man support service for e-Science within the Cambridge IT infrastructure.After an hour-long panel session (no notes as I was on the panel; there was quite a focus on data, and on education & training), and lunch, there was an interesting demo session in the afternoon. This was organised so that each demo was given twice, and there were 3 sessions of 20 minutes each, so you could see quite a lot. I took in a Taverna demo (workflows, which I've wanted to understand better for some time), found I was the sole person in the second demo on NaCTeM (which meant I could ask all the text mining questions I wanted), and then saw the e-Science Central demo. Disappointing that the latter invented their own workflow system, although they claim there were good reasons, and they are hoping to backstitch Taverna in later (@lescarr tweeted back to me that their site doesn't use UTF-encodings, so if you hover over their rather nice cartoon images, the captions come up all wonky!).Finally the wrap-up session, chaired by David de Roure (not as advertised). We were meant to find “just 3 things” that needed to be done to move e-Science more firmly into the national infrastructure. But of course our enthusiasm got the better of us, and we couldn’t stop. I think David and Malcolm Atkinson between them have the job of winnowing it down to the 3 top priorities. Altogether, a very interesting day; it’s good to see data becoming a real priority in e-Science!
http://digitalcuration.blogspot.com/2009/04/engaging-e-science-with-infrastructure.html
The Call for Papers for the 5th International Digital Curation Conference has just been published. With the title "Moving to Multi-Scale Science: Managing Complexity and Diversity", the conference will be held in London from 2-4 December, 2009. I believe this is THE conference for papers on advances in digital and data curation! The text of the call follows:We invite submission of full papers, posters, workshops and demos and welcome contributions and participation from individuals, organisations and institutions across all disciplines and domains that are engaged in the creation, use and management of digital data, especially those involved in the challenge of curating data for e-science and e-research.Proposals will be considered for short (up to 6 pages) or long (up to 12 pages) papers and also for demonstrations, workshops and posters. The full text of papers will be peer-reviewed; abstracts for all posters, workshops and demos will be reviewed by the co-chairs. Final copy of accepted contributions will be made available to conference delegates, and papers will be published in our International Journal of Digital Curation [external]. Accordingly, we recommend that you download our template and read the advice on its use.Papers should be original and innovative, probably analytical in approach, and should present or reference significant evidence (whether experimental, observational or textual) to support their conclusions.Subject matter could be policy, strategic, operational, experimental, infrastructural, tool-based, and so on, in nature, but the key elements are originality and evidence. Layout and structure should be appropriate for the disciplinary area. Papers should not have been published in their current or a very similar form before, other than as a pre-print in a repository.We seek papers that respond to the main themes of the conference: multi-scale, multi-discipline, multi-skill and multi-sector, and that relate to the creation, curation, management and re-use of research data. Research data should be interpreted broadly to include the digital subjects of all types of research and scholarship (including Arts and Humanities, and all the Sciences). Papers may cover:Curation practice and data management at the extremes of scale (e.g. interactions between small science and big science, or extremes of object size, numbers of objects, rates of deposit and use)Challenging content: (e.g. addressing issues of data complexity, diversity and granularity)Curation and e-research, including contextual, provenance, authenticity and other metadata for curation (e.g. automated systems for acquiring such metadata)Research data infrastructures, including data repositories and servicesDisciplinary and inter-disciplinary curation challenges and data management approaches, standards and normsPromoting, enabling, demonstrating and characterizing the re-use of dataSemantically rich documents (e.g. the “well-supported article”)The human infrastructure for curation (e.g. skills, careers, training and organisational support structures, careers, skills, training and curriculum)Curation across academia, government, commerce and industryLegal and policy issues; Creative Commons, special licences, the public domain and other approaches for re-use, and questions of privacy, consent, and embargoSustainability and economics: understanding business and financial models; balancing costs, benefits and value of digital curationImportant DatesSubmission of papers for peer-review: 24 July 2009Submission of abstracts posters/demos/workshops: 24 July 2009Notification of authors of papers: 18 September 2009Notification of authors of posters/demos/workshops: 2 October 2009Final papers deadline: 13 November 2009Final posters deadline: 13 November 2009
http://digitalcuration.blogspot.com/2009/04/5th-international-digital-curation.html
I've had 3 responses by email to yesterday's post on the ICTHES journals (some responding to an associated email from me on the same issue). I'll summarise the two where quote permission was not explicit, and quote the third at length.Adam Farquhar of the BL told me he had discussed it with their serials processing team under the voluntary scheme for legal deposit of digital material, and they will download the material into the BL's digital archive, where it will become accessible in the reading rooms (in due course, I guess). Wider access to such open access material should be available later under their digital library programme.Tony Kidd, of the University of Glasgow and UKSG suggested that an OpenLOCKSS type approach might be feasible. This is consistent with the email from Vicky Reich of LOCKSS; she told me I could post her response. So here it is:"UK-LOCKSS can, and should, preserve the four ICTHES journals.First step: Contact the publisher and ask them to leave the content online long enough for it to be ingested.Second step: Ask the publisher to put online a LOCKSS permission statement.Third step: Someone on the LOCKSS team does a small amount of technical work to get content ingested.With these minimal actions, the content would be available to those institutions who are preserving it in their LOCKSS box.If librarians want to rehost this O/A content for others, there are two additional requirements:a) the content has to be licensed to allow re-publication by someone other than the original copyright holder. This is best done via a Creative Commons license. b) institutions who hold the content have to be willing to bear the cost of hosting the journals on behalf of the world.Librarians, even those who advocate open access have not taken coordinated steps to ensure the OA literature remains viable over the long term. Librarians are motivated to ensure perpetual access to very expensive subscription literature, but ensuring the safety of the OA literature is not a priority because... it's available, and it's free. [...]When the majority of librarians who think open access is a "good idea" step up and preserve this content (and I don't mean shoving individual articles into institutional repositories), then we will be well on our way to building needed infrastructure" See also the comment from Gavin Baker to yesterday's post, which i think backs up Vicky's last point:"I've thought for a while that archiving OA journals should be a goal of the library and OA community, maybe via a consortium which would harvest new issues of journals listed in the DOAJ. (We can treat as separate, for these purposes, the question of short-term archiving in case a journal goes under from the question of long-term preservation.) Is there a reason why this approach isn't undertaken? Do people assume that any OA journal worth archiving is already being archived by somebody somewhere?"Let's be quite clear, contrary to my simplistic assumptions, the Internet Archive is NOT undertaking this task!
http://digitalcuration.blogspot.com/2009/03/more-on-icthes-journals.html
This should probably be titled “are research datasets comprised of facts and does it matter?”. It certainly does appear to matter whether datasets are comprised of facts, as in some legal jurisdictions facts are not copyrightable. If this is so, then without other protection such as the EU Database Right, or perhaps contract law, then there is no basis for licences (you can’t control someone else’s use unless you have a right to exercise that control). This is part of the argument that led Science Commons to abandon attempts to find variants of the Creative Commons licences for datasets and databases, in favour of its proposals for putting datasets into the public domain.This is an appealing solution in some research contexts, but worrying for other kinds of research. This is not for reasons of profit, but of ethics. Many medical, social science, anthropological, financial and other datasets contain data that are private, perhaps personally, culturally or corporately; these data were usually gathered with some kind of informed consent on use, and have to be protected. They cannot be placed into the public domain. If they are to be made available at all for re-use, there must be terms and conditions attached, ie some kind of licence.I’m not attempting to argue the legal angle here. But I am interested in the “factness” of the data, that might inform the legal angle.One might assume the height of Mount Everest is a fact. But check out the Wikipedia article on the subject to see a range of results. One might assume that the physical properties of chemical substances are facts, but check out Chemspider’s approach of assembling different measurements with their provenance (see http://www.chemspider.com/blog/there-are-no-facts-in-science-only-measurement-embedded-within-assumptions.html which links to Jen-Claude Bradley’s earlier UsefulChem article ). Or think of a geospatial database, some elements must be pretty much “skill and judgment” rather than facts, such as the point where a river debouches into the sea. Finally, one might assume that the names of the winners of horse races are facts, and so they are, but only after a race committee has adjudicated on the photo-finish, or whether interference took place.In practice, what goes into datasets is rarely what is directly measured; it is almost always highly derived through various computations, adjustments and combinations. Environmental sciences can be quite explicit on this, see for example the British Atmospheric Data Centre’s description of the UARS (Upper Atmosphere Research Satellite) data levels. Here level 0 is the raw output data streaming from telemetry and instrumentation, effectively at the level of voltage changes; it is devoid of context. Level 1 data has been converted to the physical properties being measured, but will still be in formats tied to the instrument. Level 2 is post-calibration, and would refer to entities such as calculated geophysical profiles. Level 3 would be gridded and interpolated, and at this level there might be no clear correspondence with any observations (but there should be a clear computational lineage or provenance path linking these steps).So we seem to be in a situation where datasets contain highly derived data, at some creative distance from direct observations, and what we think of as facts are (or ought to be) contestable consensus based on potentially conflicting evidence,In fact (hah!) after a while it becomes hard to think of any good example of real science/research data that are facts. The question is, does this matter enough to make any difference?
http://digitalcuration.blogspot.com/2009/03/are-research-data-facts-and-does-it.html
I've been travelling and waiting pretty much all day, but it's still 24 March, so I have an hour or so to make a post for Ada Lovelace Day: a day to celebrate women excelling in technology. I didn't sign the pledge, partly as I knew I'd be out of touch, but I wanted to do a post anyway, if possible.I haven't asked my heroines for permission to mention them, so I'll just use first names (possibly fictional). I hope they know who they are, if they read this.Marita taught me APL (she was writing an APL interpreter), or at least enough of it to know I didn't want to know it. I've had an aversion to highly compressed code ever since! Marita also taught me a huge amount about project and people management. Mostly though, she taught me there were excellent ways to balance and integrate social and work life; ways I wished later I could emulate!Christine was a friend; I never worked with her closely. But in between being a sociable, funny party animal mother, she took part in a management buyout, ran her company with a colleague for a decade or more, and sold her HR software across the world (still running when I moved to a new University a hemisphere away at the turn of the millenium). All apparently effortlessly!Lynne turned me down for a job once but not twice (but then again). She showed you didn't have to be a techie specialist to understand the implications of technology for universities and libraries, and what could be achieved with the right application of forcefulness to the levers of power. She might not always agree with you, or you with her, but you always got a fair listen. And hey did things get done!I never worked with Carole either, nor do I know her well. But you absolutely have to admire someone who can run a large, highly successful Computer Science group, making practical and highly successful contributions to data curation, and give the most successful keynote at a conference, linking curation to the VivaLaDiva shoe shopping site!I know several Nancys of influence and technical bent in libraries. The Nancy I want to celebrate here is a negotiator, a maker of mind meetings, one of those people who gets things to happen, things involving many other people, smoothly, productively. In, round and about technology, but not fixated on it; not technology for a purpose, but a purpose supported by technology. And a good, warm-hearted friend.Some ten years after I started in computing, I remember being quite shocked at reading about the gender gap in computing. It had never struck me before, and indeed I don't think the male/female ratio then was any different from any walk of professional life (in fact probably more balanced). This was before the PC, games and the geek factor. Even since then, I don't know what the stats say, but I've worked with so many excellent women in technology, that the saddest thing about my list is the many great candidates I have to leave out if it's to be published today.Secondary school exam results in Britain over the last few years (girls far outstripping boys, specially in maths) suggest we should have many more Ada Lovelace candidates in the future. I certainly hope so!
http://digitalcuration.blogspot.com/2009/03/ada-lovelace-day.html
Some time ago, Suw Charman-Anderson introduced the idea of an Ada Lovelace Day, to celebrate the achievements of women in technology. As part of this effort, Suw also created a ‘pledge’ on MySociety’s excellent and innovative PledgeBank service, which stated:
I will publish a blog post on Tuesday 24th March about a woman in technology whom I admire but only if 1,000 other people will do the same.
In her blog post, Suw says:
Women’s contributions often go unacknowledged, their innovations seldom mentioned, their faces rarely recognised. [...] Recent research by psychologist Penelope Lockwood discovered that women need to see female role models more than men need to see male ones. That’s a relatively simple problem to begin to address
Now, I will admit that in addition to wanting to support this excellent initiative for its own sake, I had another, related motive. When I read the original pledge I will confess that I couldn’t immediately think of a candidate to write about – which rather reinforces the point that Suw is making. In my day job I get to collaborate in one way or another with some talented women. One or two of them I really admire – but I won’t embarrass them here – some readers of this blog will know some of these women. There are some women I encounter in my professional life who will clearly carve out a place for themselves in the lore of their particular fields.
For some reason I set myself the following criteria for selecting a candidate:
they must have actually inspired me in some way, in a broadly technological context
they must have achieved a reasonable level of public exposure – ‘made an impression’ in other words
I found myself fretting about the fact that I couldn’t think of a woman who met these conditions (I should say that I can’t think of an awful lot of people, of either sex, who I would say have inspired me in this context). But a week ago someone mentioned ’software agents’ in conversation, I started talking about the interest I had in this area in the 90s and then I remembered Pattie Maes.
In around 1996-97 I was writing my dissertation for my MSc on the area of software agents. I had prototyped some software which would follow the activity of users browsing the Web and harvest and collate pages of interest. My ‘Big Idea’ was to gradually establish an automated recommender system using agents acting on behalf of users across a University department. It never really worked because I was a naive developer at the time and made a bunch of poor tech choices but, hey, not such a bad idea. In the course of my research I discovered the Software Agents Group of MIT’s Media Lab, led at the time by Pattie Maes.
The aspect of my interest in software agents which really fascinated me was not so much the ’software’, as the ‘agency’. Pattie’s work seemed to be focussed in this area, and she and her team were producing some very interesting results.
At some point in the writing of my dissertation, I remember I got very bogged down in trying to get something to work well enough to demonstrate some of the ideas I was developing. It got to the point where I was considering a rather drastic change of direction, abandoning this line of research entirely. Then I came across an interview with Pattie which I found utterly inspiring. I have just spent an hour tracking this down – I did not have a reference saved anywhere unfortunately, but I believe it was On Software Agents: Humanizing the Global Computer , even if this is not the source I used at the time.
Some highlights from this include:
Now that we have a network, it’s as though we already have our intelligent machine. It’s a huge distributed system in which, like an ant society, none of the components are critical.
and
The whole metaphor of direct manipulation, of viewing software as a tool that the user manipulates, was invented about 25 years ago when the personal computer was first emerging and when the situation for the user was completely different. Back then, the computer was being used for a very small number of tasks. It was being used by one person, who knew exactly where all the information was on the computer because he or she put it there. Nothing would happen unless that person made it happen. This was a very controlled, static, structured kind of environment.
The situation that a computer user faces today is completely different. Suddenly the computer is a window into a world of information, people, software. . . . And this world is vast, unstructured, and completely dynamic. It’s no longer the case that a person can be in control of this world and master it. So there is actually a mismatch between the way in which we interact with computers, or the metaphor that we use for human-computer interaction, and what the computer environment really is like today. I think we need a new metaphor.
and, in answer to the question: “Do you think the Microsoft 97 Office Advisor is an agent?”:
It’s a simple example of an agent, but it definitely is one. It’s just providing better help functionality, but it monitors your actions, and based upon the pattern of actions that you demonstrate, it recommends specific help topics to you. So it tries to recognize what your goal is and gives you help that is relevant to the current situation. It’s not personalized yet, it’s assisting me in the same way it’s assisting you, but it’s a first step. Hopefully people will like this first attempt, and Microsoft will take it further.
Hmmm – Well you can’t be right all the time. I guess some of us don’t miss that PaperClip…!
Reading this again, I’m struck by how much my subsequent work and thinking was influenced by this. I no longer maintain a close interest in software agents, but a particular paragraph leapt out at me as being highly relevant to my thinking and work today:
Definitely my priority is to build things that demonstrate the usefulness of this technology, so that it isn’t simply the next fad that everybody has forgotten about a year from now. I want to make sure that there is something substantial there. I’m less interested in coming up with the standards before we even know whether users want this stuff. [...] Yes, definitely. And I think it’s still too early to standardize agents and the languages they use. We need more experimentation first, more wild ideas that people try out, and different applications. Whenever you come up with standards you stop research and development right there, or at least slow things down a lot.
In the mid-late nineties, some of the best work of Pattie and her team was commercialised – notably with the launch of Firefly.com (now defunct). Around this time I had started reading Wired – only available as an import from the US at the time I think – and stumbled across a feature about Pattie, the work she was doing at MediaLab, and the business she had founded. It’s still an interesting read, and shows someone being successful both in the rarefied academic world of computer science, and in mainstream commercial entrepreneurship.
Pattie has moved on to other things at Media Lab, and now runs the Fluid Interfaces Group (great home page!). If you want to see the kind of cool stuff they do there, check out Pattie’s TED talk where she demos some very cool new technology developed by one of her students, Pranav Mistry.
Photo used under Creative Commons license. Copyright Wa-J.
http://blog.paulwalk.net/2009/03/24/celebrating-ada-lovelace-day-pattie-maes/
Amsterdam in Spring, who could turn down the offer? Perhaps it would be irresistible a little later in Spring than early March (brrr), but when the sun did come out, and the workshop was done, it was lovely. I was in Amsterdam for a curious International workshop on repository infrastructure, funded by JISC and SURF, with the DRIVER project. It turns out I had no idea what repository infrastructure meant before I went, and I guess I know only a little more now.I was asked to take part in this workshop late last year, and was supposed to stimulate discussion on a use case on preservation. There were 4 use cases:PreservationAccessDeposit Workflows and EnvironmentsOnline reputation and reportingMaybe some of us didn’t do enough work on this, as somewhere along the line this morphed into 4 proposed action plans, each with a breakout group:Identifier infrastructureCitation ServicesRepository handshakeOrganisation StructuresThe first 3 of those probably fit with the last 2 use cases, but I don’t know what happened to the others, or where the last one came from.If you use Twitter, you can read (in reverse order) some of the backchat on this through the hashtags #repinf09 (for the whole workshop), #rihs09 (for the handshake group) and #reporg09 (for the organization group). If you do have a look, bear in mind that what you read is devoid of context, and represents only a small part of what was going on, probably from a vocal, cantankerous, ornery, rumbustious and maybe just plain rude subset! (As a new Twit, I haven’t yet worked out how much ruder tweeting during a meeting is, compared with passing notes in class, say.) Oh, and not all the Twits were actually there.So what happened? The introductory presentation was given by Norbert Lossau of DRIVER. Lots of interesting stuff, but I was a bit taken aback by claims that the defining vision of repositories was the Berlin Open Access Declaration, and that data were out of scope. Personally I think Open Access is most often a Good Thing, but sometimes it’s inappropriate: repositories of sensitive data, for instance. And I’m not at all sure that you can cleanly distinguish document from data, or that it makes sense to do so, especially when supplementary materials and extended documents come into the frame.I was in the organisation group, so I don’t know in detail what happened in the other groups, but we did come together right at the end for a plenary. I was particularly impressed by what Les Carr said about the Citation Group, including the idea of creating a test corpus, and also a competition for the best text mining algorithms to find citations and references (and even their surrounding context: “we illustrate the errors in Rusbridge (2006)" is clearly different from “following the excellent suggestions given in Carr (2009)"). I don’t think much was said about the citation microformat (perhaps it really is dead?), nor about whether text mining might be aided by embedding RDF etc in documents. Nevertheless, a sensible plan of work was laid out.On identifiers, it also sounded as if some sensible progress was made. In particular, it seemed as if identifiers for authors (as disambiguation tools for citations etc) went from being ruled out to being a strong part of the work plan. They had the ultimately cool mindmap.From the repository handshake tweets, it looked like they had a bumpy ride getting their 8 use cases agreed, but did get through their “stormin’, normin’ and formin’” stages into at least the first part of “performin’”. I did like “beg” as a repository verb!So to the repository organisation group. Here too, there were plenty of storms, and maybe the odd teacup. It was quite hard to work out what sort of organization we were interested in; what it was intended to do. There was a strong feeling of unspoken sub-text; whatever it was we were talking about was proposed to launch on October 7. One of the best quotes for me came from Sandy Payette; she said the DSpace/FEDORA organizations could provide some of the things that were being talked about, but they would have to “feel the hunger”. It was difficult to feel the hunger for whatever this thing was. At one point, I’m afraid I compared it unfavourably with jelly; whenever I thought I had hold of it, something bulged out somewhere else (yucky image, sorry).On the start of day 2 our organisers had a new approach, which worked better. They broke the breakout into yet smaller breakouts, and gave each of us a role to play and 6 questions to answer. My table was “funders”. That’s OK, I’ve been one, I could handle that (even funded repository work back in 1998 or thereabouts, think Harnad’s CogPrints and Krichel’s WoPEc/RePEc, which made it particularly annoying to be patronized at one point as someone who obviously didn’t know much about repositories). I think the two main “expectations” for our group were “clarity of aims” and “benefits to justify investment”. Anyway, the result of all this was that, by the skin of our teeth, we did have something to say at the final plenary. Not that I yet know quite what this organization would be for!Was it entirely coincidental in this context that the DRIVER project was ending in a few months? Well, we had heard that DRIVER did repository infrastructure, that DRIVER did published papers, not data, that DRIVER Guideline enabled interoperability, that DRIVER Guidelines were in demand internationally and had been translated into (2? 3?) other languages, and that the continuation of the DRIVER brand was seen as important. Some Twit asked “If the DRIVER Guidelines are the answer, what is the question?” Join the dots…The workshop was ended with some closing remarks from Cliff Lynch, as perspicacious as ever, but slightly more hoarse than usual (not from shouting at us; too much inter-planetary travel, I suspect). And after the workshop there was a “funders’ meeting”. Real funders, the kind with money. What went on there, I have no idea!
http://digitalcuration.blogspot.com/2009/03/international-repositories.html
I’ve been trying to write this post for ages, but the draft never seemed right, so this is starting all over again, blank sheet.Many people have a bunch of stuff on media for which it’s hard to find a working drive, whether disk or tape, or punched card or paper tape, or… A feature of the response to my “12 files for Christmas” post was that those who responded have their interesting stuff locked away on such media. There may be other challenges to reading it, but the first is getting the content off the media.We tend to go all gloom and doom about this. I’ve got stuff on Iomega Jaz cassettes from an earlier Mac, so without a Jaz drive I can’t read it. Might as well chuck it in a bin? Folks have stuff on early Mac 3.5” drives, or Amiga drives, neither of which can be easily read on current systems. And 5.25” or 8” drives are even scarcer in working condition.Is the only answer to find a working drive on a working computer of the day, the "technology preservation" approach? I remember an ancient engineer scolding me years ago for failure of the imagination, on the subject of disappearing 7-track magnetic tape drives. “Young man” he gruffed, “if you really care about this stuff, lay the tape on your kitchen table, cover it with paper, scatter iron filings across the top, give it a tap, and read the bits off with a magnifying glass!” He didn’t mention that I had to know the relationship between domains and bits, the parity and other features of character encodings, the block structure of the tapes etc; in those days we knew that stuff!Now I’ve done the maths, and this is NOT feasible with a 3.5”, 2 MByte floppy disk (the iron filings are too big)! It might just be feasible with some of the 8” floppy formats. Different technological approaches (not iron filings, but some other means of making magnetic domains visible, if such means exist) might be feasible for higher densities. In any such case, you would end up with an image of the bits on the disk, in concentric tracks. From here, you have a computational task, or a series of such tasks: identify the tracks, separate into sectors, decode into bytes or characters, decode into directories and file structures, process into files, and now you have something to operate on! Yes, it’s tough to do all that, but you would be able to combine lots of contributions together to do it.Now, I’m NOT saying that’s the best approach. I AM saying, 8” drives were advanced technology when introduced more than 20 years ago. The requirements for a production 8” drive included high read/write performance (for the time). The requirements now have changed. Performance isn't the issue; scraping every last reliable bit off that drive is!Today, more than 20 years later, storage engineers in their clean-room high tech environments can build amazingly high performance production drives with previously inconceivable capacities and speeds. But what could you do today with a Masters-level Electronic Engineering lab, some bright students, a few hundred dollars, these ancient media formats, and a much-reduced performance requirement? I don’t care if it takes 10 minutes to read my disk, as long as I can do it!Is this important? These disks have been stashed away for 20 years, who cares what’s on them? Well, in many cases no-one does. But just think who was using those early drives, and what for. They certainly include authors, poets, scientists, scholars, politicians, philosophers… and many of those people, if not in the first flush of youth then, are moving towards retirement now. Some of these will be candidates for approaches from libraries and archives interested in their “papers”. Previously this meant boxes of paper, photos, diaries etc. Now it includes old media, dropped in the box years ago. Who knows what treasures they may contain? (See the Digital Lives project for examples.)So, I think there is or will be an emerging interest in these obsolete media and their contents. And at the same time, I think (hope) it would represent an interesting challenge to set students. Perhaps not quite in the same class as building a car to drive across the country on solar power, or robots to play football, but interesting in its own different way.One of those combined Computing Science and Electronic Engineering schools would be perfect. Would a prize help? Maybe this could factor in something like the Digital Preservation Awards one year? A new kind of Digital Preservation Challenge?
http://digitalcuration.blogspot.com/2009/03/obsolete-drives-sideways-thinking.html
Brian Kelly posted Twitter Can Pimp Up Your Stuff – But Should It? a while ago. This post has caused me to think about courtesy and good practice. The aspect I want to talk about is Brian’s reporting of a conversation which took place on Twitter. I’m writing this to make a general point, not as a personal criticism of Brian who has well-established credentials as an experimenter with these technologies and who I know, from talking to him directly, is interested in these issues.
The point is this: I tend to think that the quoting of Twitter exchanges in blog posts is something to be done sparingly, if at all, and has all kinds of potential for misunderstanding. I think there is some evidence of this occurring in Brian’s post.
Twitter has a very basic model for threaded discussions and this is not honoured by many clients (many users I follow clearly use more than one client, as do I). Importantly, as a user, you do not necessarily know who else is involved in the exchange – other users may be responding to remarks made by still more users about whom you are unaware.
When a Twitter dialogue is presented out of the context of Twitter, this is potentially misleading. The conversation which the person posting the dialogue reports is not the same as the dialogue whcih each individual contributer has participated in. And, importantly, they may be responding to a point which has been made but which the blogger, and the readers, never see. We are reporting our version of a conversation conducted in a crowded pub, involving people we half know and people we don’t know at all.
In the comments, Paul Boag says:
I think the problem is that because twitter is new, we all see it as playing a different role. You could argue twitter is a place for sharing personal experience. You could argue it is the place where you post ‘what you are doing’ (the original intention). Or you could argue it is a place to have a dialogue with your followers. All are valid as are many other uses. Ultimately it comes down to personal choice.
Quite so. Paul goes on to say:
People should use twitter as they want and others should stop criticising them for it. If they don’t like it they should stop following.
This is in the context of a response to a tweet by someone else, reported in the main body of his post. I don’t imagine the person who made this short comment to their network on Twitter did so in the expectation that their words might be used in this way. Now that people in my particular peer-network realise that anything they say in that wonderful, peculiar space that is Twitter might be lifted and repeated, very likely out of context, on a blog, I wonder if this will gradually stifle the free-flowing, relaxed conversations which spring up there. It is almost inevitable that people would be misrepresented in this way. Going back to that pub, would we, for example, speak as freely if someone was standing there with an audio recorder, waiting for something juicy to copy and paste into their podcast? Going back to Paul’s comment, I think he misunderstood the reported remark – but the lifting out of context has probably been partly responsible for this, and has left no obvious channel for a response.
‘Exchanges’ of Tweets can start and finish in a matter of minutes. I tend to take time over a blog post, marshalling arguments, checking references, re-reading for tone to avoid offending people unnecessarily etc. I take much less care with tweets, because I expect them to be taken much less seriously. Now I do understand that tweets are, for the most part, on public record, on the Web. I think this is mostly beside the point. It’s how we use the thing which counts, and how we expect it to be used. I think a Twitter which consisted only of quotable statements of verifiable fact or carefully thought through arguments would become, in Brian’s own words:
a sterile environment [which] could well lead to a killing of the golden goose
People may not expect their Tweets to be taken as seriously as something they might write in a blog. I certainly don’t. Of course, we know that most of our Twitter output is public – that’s part of the point of it. Many conversations happen in near-real-time: this gives Twitter a dynamic ‘edge’, where people can respond to topics with an off-the-cuff response. Of course not every Twitter exchange is like this – the point is the expectations about how seriously one’s Tweets will be taken is difficult to anticipate but should, I suggest, default to ‘not necessarily’. Clearly, we have different levels of discourse. We probably wouldn’t want to quote ‘tweets’ in academic papers…. would we? And yet I wouldn’t hesitate to quote a blog post in a paper.
There are some who do use Twitter as a micro-blogging platform. For example, Paul Boag is a highly successful broadcaster, with many followers in several media-spaces (Twitter, blog, podcast). Gaining and expanding an audience is important to him, as it is for many people. (Paul has even written a guide to broadcasting and responding to followers in Twitter). However, I suggest, tentatively, that a predilection for gaining followers obscures the fact that others don’t really think this way, and value Twitter for very different reasons. Where one person welcomes any exposure on any platform, another might be disconcerted by suddenly finding their throw-away 140 characters appearing on someone’s blog.
I note today that Brian has used a set of Twitter exchanges on another post. I would suggest that at least one of the tweets featured there was not something the author would necessarily have wanted to be broadcast more widely.
As a matter of courtesy I would ask believe people should consider carefully before quoting tweets in their blog. I hope it doesn’t become common practice for bloggers to treat Twitter as a cheap and easy source of (sometimes provocative) material.
Again, I want to make it clear that this should not be viewed as a personal criticism of Brian, or his blog. It is only by doing these things that such issues can be revealed and discussed. However, we have to be able to realise what doesn’t work, and to recognise the possible consequences of the practices we necessarily are evolving through trial and error.
http://blog.paulwalk.net/2009/03/06/anything-you-quote-from-twitter-is-always-out-of-context/
John Coggins, VP for Life Sciences from Glasgow on researcher’s perspective. Wants to prompt us to think, about “culture change” and resources.Volume & complexity of digital is rapidly increasing. In many fields data are rarely used outside lab/dept. Data management skills under-developed, research data often unstructured & poorly documented & inaccessible. Represents huge un-tapped resource. Never even fully analysed by researchers who generate it. Could be much added value from data mining & combining datasets.(CR from break conversation with Jmes Currall: need lots of additional tacit context knowledge to understand properly.) Most data stored locally and often lost. Most researchers say they are willing to share data, usually through informal peer exchange networks (CR: important distinction!). Some of those who deposit data regularly do so because journals won’t publish unless they do (CR: Bermuda agreement in bio-sciences; Alma Swan says compliance lower than expected).Funders want to protect & enhance their investment by making data widely available,. Must be agreement among researcher on quality and format. Slow to achieve, cf story about Protein Data Bank. Story about surplus computers from particle physicists in the 1970s, playing with 7 protein structures; lots of subsequent arguments about the best way to store this stuff in those primitive computers; now 36 years on, 44,000 structures available, huge investment since then (eg early stuff re-examined in light of later advances). Essential to invest in data management, curation & storage as well as for easy access. Needs international collaboration & substantial funding for permanent delivery, eg EBI.Researchers want confidence their data will be permanently stored & accessible; tht costs will be met; be able to access other people’s data, preferably on world-wide basis; and training.Complications: commercially sensitive data, personal data, eg patient data, etc.All Universities should have access to a UK-wide service. FCs and RCs likely to provide funding (goodwill but not yet much in £ notes). What about private sector? Charities? Other government departments like DEFRA (no-one from DEFRA in the room). These departments not celebrated for willingness to commit resources to research sector.Way forward: research community would welcome UK-wide approach. Need more consistency, will take time. Researchers need to improve their skills for managing & using data. Significant building blocks in place. International dimension important. Culture change and mindset critical and must be gradually changed, needs investment in training & infrastructure, & spreading best practice. This is the way forward but won’t be easy.David Lynn from Wellcome Trust (independent charity interested in biomedical research) for a funder’s perspective. (CR: is bio-science TOO good an example; poster child from which wrong inferences can be drawn?) Talking about size & rate of data, about integrating large datasets to gain insights into complex systems (CR: via Google Earth!!! Reminds me of NERC meeting last week again!) Now on immense potential to link research papers and data (CR: yeah!). Also growth in traffic of researchers accessing data from others (via web traffic on Sanger Institute site).Meeting challenges: infrastructure: key data resources need coordinated and long-term sustainable funding. ELIXIR programme aiming to build sustainable infrastructure across Europe. Technical and cultural challenges: coordination & advocacy from key communities (funders, institutions, publishers etc). Data security challenge, important for some parts of bio-medical community. Recent high profile incidents in UK. Recent reports from Academy of Medical Sciences, Council of Sci & tech, Thomas/Walport review, US Inst of Medicine. Calls for safe havens & codes of practice. So management and governance of such data will be a key concern to retain public confidence. Mentions Bermuda Principles (1996) & Fort Lauderdale principles (2003); need to involve researchers in all such discussions,Agrees with coordinated approach to preserving key research data & ensure long-term value; must meet needs of researchers & funders. Devil is in the detail; must develop in way that truly adds value; must link effectively with existing activity & expertise; will need buy-in of ALL major funders & stakeholders; will need to accommodate differences in approach between funders & disciplines; & must appropriately resource initiative.Would like clarification of detail on Pathfinder study, but like to see it go forward; full specification, but with careful assessment of funding implications.Q&A session. Malcolm Atkinson again. PDB took place initially at national lab at Brookhaven; perhaps the researchers had a bit of time. Those who put together the data for re-use don’t get an immediate reward. Those who get the immediate reward don’t pay attention to the long term but very focused on immediate need. PhD skills means PhD courses take longer. Where does the extra resource come from to invest in this change in researchers, and how to invest in incentive (promotion, RAE etc) to make sure this happens? John Coggins: it’s going to be hard. Was working at Brookhaven at the time of PDB, did manage to publish wee paper, but true, none of those involved are now household names. Focus on VFM from government doesn’t help as too short term. David Lynn: 3 ways to recognise work: institutions themselves think about their own promotion criteria, FCs think about REF and consider metrics for such activities other than direct research; research funders now requiring data management plans.Michael Jubb from RIN: Tension between disciplinary difference versus one size fits all. Coggins: consistency means common attitude that it’s worth sharing data and accessible; fierce arguments on data structures etc; seems to be blaming arts & humanities for being poorly funded and behind the digital times (CR: largely not fair!). Lynn: engage with researchers not in the abstract.Bryan Lawrence again: no new money just old money re-purposed. Always tension between research, teaching & data, and only 1 pot for all of it, so where decision point lands will differ.Peter Burnhill, EDINA: statement makers versus artisans? Observing what’s happening, can assuming the internet is always on. Need to think about what it means to publish data which includes archival issues. Not just take away, but software access. Variety out there, will be drivers, researchers are driven. What constitutes-well-published data including some notion of designated community to which you are publishing. Lynn, funders are increasing expectation on what researchers do in this respect, but can only drive this so far. Wellcome want researchers to tell us in DM plans what their approach will be to curating the data, making it available (including when); this DM plan is then peer-reviewed (larger grants, population studies etc) as part of the selection process. Coggins: well run labs with careful research systems means data better documented. Burnhill, provenance an issue, yes. Need to understand the drivers for someone wanting to re-analyse someone else’s data.Kevin Schurer again: reflection on Qs so far. Making fruits of research available not a new conundrum, RS celebrating 350 years. The ways this is done very segmented: researcher, publisher, copy-editor; very segmented process. Problem with data (varies from disciplines): PI often the data generator, publisher & often distributor as well. UKRDS provides possibility to separate those things out again. Agreement from panel.
http://digitalcuration.blogspot.com/2009/02/ukrds-conference-2.html

