SCAPE: Large Scale Image Migration

Authors: Martin Schaller and Sven Schlarb

In the SCAPE Project, the Austrian National Library is currently working on practical application scenarios for the tools and solutions developed within the project. One of these application scenarios is the migration of a large image collection from one format to another. There are many reasons why such a scenario may be of relevance in a digital library. On the one hand, conversion from an uncompressed to a compressed file format can significantly decrease storage costs. On the other hand, particularly from a long-term perspective, file formats may be in danger of becoming obsolete, which means that institutions must be able to undo the conversion and return to the original file format. In this case a quality assured process is essential to allow for reconstruction of the original file instances and especially to determine when deletion of original uncompressed files is needed – this is the only way to realize the advantage of reducing storage costs.

Based on these assumptions we have developed the following use case: Uncompressed TIFF image files are converted into compressed JPEG2000 files; the quality of the converted file is assured by applying a pixel for pixel comparison between the original and the converted image. For this, a sequential Taverna concept workflow was first developed, which was then modelled into a scalable procedure using different tools developed in the SCAPE Project.

The Taverna Concept Workflow

The workflow input is a text file containing paths to the TIFF files to be converted. This text file is then transformed into a list that allows the sequential conversion of each file, hence simulating a non-scalable process.

Before the actual migration commences, validity of the TIFF file is checked. This step is realized by using FITS – a wrapper that applies different tools to extract the identification information of a file. Since the output of FITS is an XML-based validation report, an XPath service extracts and checks the validity information.

If the file is valid, migration from TIFF to JPEG2000 can begin. The tool used in this step is OpenJPEG 2.0. In order to verify the output, Jpylyzer – a validator as well as feature extractor for JPEG2000 images created within the SCAPE Project – is employed. Again, an Xpath service is used to extract the validity information. This step concludes the file format conversion itself, but in order to ensure that the migrated file is indeed a valid surrogate, the file is reconverted into a TIFF file, again using OpenJPEG 2.0. Finally, in a last step the reconverted and the original TIFF files are compared pixel for pixel using LINUX based ImageMagick. Only through the successful execution of this final step can the validity as well as the possibility of a complete reconversion be assured.

taverna workflow
Figure 1 (above): Taverna concept workflow

In order to identify how much time was consumed by each element of this workflow, we ran a test consisting of the migration of 1,000 files. Executing the described workflow on the 1,000 image files took about 13 hours and five minutes. Rather unsurprisingly, conversion and reconversion of the files took the longest: the conversion to JPEG2000 took 313 minutes and the reconversion 322 minutes. FITS validation needed 70 minutes and the pixel-wise comparison was finished in 62 minutes. The SCAPE developed tool Jypylizer required only 18 minutes and was thus much faster than the above mentioned steps.

diagram taverna workflow
Figure 2 (above): execution times of each of the concept workflows’ steps

Making the Workflow Scale

The foundation for the scalability of the described use case is a Hadoop cluster containing five Data Nodes and one Name Node (specification: see below).

Besides having economic advantages – Hadoop runs on commodity hardware – it also bears the advantage of being designed for failure, hence reducing the problems associated with hardware crashes.

The distribution of tasks for each core is implemented via MapReduce jobs. A Map job splits the handling of a file. For example, if a large text file is to be processed, a Map job divides the file into several parts. Each part is then processed on a different node. Hadoop Reduce jobs then aggregates the outputs of the processing nodes again to a single file.

But writing MapReduce jobs is a complex matter. For this reason, the programming language Apache Pig is used. Pig was built for Hadoop and translates a set of commands in a language called “Pig Latin” into MapReduce jobs, thus making the handling of MapReduce jobs much easier or, as Professor Jimmy Lin described the powerful tool during the ‘Hadoop-driven digital preservation Hackathon’ in Vienna, easy enough “… for lazy pigs aiming for hassle-free MapReduce.”

Hadoop HDFS, Hadoop MapReduce and Apache Pig make up the foundation of the scalability on which the SCAPE tools ToMaR and XPath Service are based.

ToMaR wraps command line tasks for parallel execution as Hadoop MapReduce jobs. These are in our case the execution of FITS, OpenJPEG 2.0, Jpylyzer and ImageMagick. As a result, the simultaneous execution of these tools on several nodes is possible.

This has a great impact on execution times as Figure 3 (below) shows. The blue line represents the non-scalable Taverna workflow. It is clearly observable how the time needed for file migration increases in proportion to the number of files that are converted. The scalable workflow, represented by the red line, shows a much smaller increase in time needed, thus suggesting that scalability has been achieved. This means that, by choosing the appropriate size for the cluster, it is possible to migrate a certain number of image files within a given time frame.

Performance_image_migrationFigure 3 (above): Wallclock times of concept workflow and scalable workflow

Below is the the specification of the Hadoop Cluster where the master node runs the jobtracker and namenode/secondary namenode daemons, and the worker nodes each runs a tasktracker and a data node daemon.

Master node: Dell Poweredge R510

  • CPU: 2 x Xeon E5620@2.40GHz
  • Quadcore CPU (16 HyperThreading cores)
  • RAM: 24GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 3 x 1TB DISKs; configured as RAID5 (redundancy); 2TB effective disk space

Worker nodes: Dell Poweredge R310

  • CPU: 1 x Xeon X3440@2.53GHz
  • Quadcore CPU (8 HyperThreading cores)
  • RAM: 16GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 2 x 1TB DISKs; configured as RAID0 (performance); 2TB effective disk space

However, the throughput we can reach using this cluster and pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour – GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files.

throughput_gb_per_h
Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h)
against the number of files processed

As our use case shows, by using a variety of tools developed in the SCAPE Project together with the Hadoop framework it is possible to distribute the processing on various machines thus enabling the scalability of large scale image migration and significantly reducing the time needed for data processing. In addition, the size of the cluster can be tailored to fit the size of the job so that it can be completed within a given time frame.

Advertisements

Joint APARSEN / SCAPE Satellite Event

APARSEN_LARGE_2Scape_Logo_v11_with-tagline

Joint APARSEN / SCAPE Satellite Event: Long-term Accessibility of Digital Resources in Theory and Practice

Author: Manuela Holzmayer

The two projects, APARSEN and SCAPE organised a Satellite Event on 21st May 2014, following the 3rd LIBER Digital Curation Workshop in Vienna.

Participants were invited to the baroque premises of Palais Mollard, part of the Austrian National Library, to listen to presentations given by partners from the three EU co-funded projects APARSEN, SCAPE and 4Cproject.

Welcoming words were spoken by Max Kaiser from the Austrian National Library, who pointed out the importance of digital long-term preservation at the library. He also drew attention to the work of digitisation done within the Austrian National Library, available for instance through ANNO (AustriaN Newspapers Online), a virtual collection of historical Austrian newspapers and magazines, and through the Bildarchiv Austria.

Sabine Schrimpf from the German National Library is involved in research activities in economic and legal areas of the APARSEN project. She gave a short introduction on APARSEN and presented work being done within the project on the topic of “Digital Rights Management in the context of long-term preservation”. Key challenges in dealing with digital rights and DRM were explained and recommendations were made.

The full APARSEN report can be found here.

Ross King (AIT) gave a short introduction to the SCAPE project and provided details about the application of SCAPE results regarding the problem of scalable quality control in digitisation workflows. Ross explained how the accuracy and reliability of image quality assurance components is assessed by means of annotated files.

David Wang from SBA research introduced the audience to the 4Cproject. He reported on an evaluation to assess current digital curation cost and benefit models and to identify the needs of the stakeholders, which was performed within the context of a coordination action by the Collaboration to Clarify the Costs of Curation (4C). Gaps between the evaluated models were found and recommendations were presented on how to overcome them.

A list of List of the models evaluated can be found here.

Sven Schlarb (ONB) gave an overview of the different application scenarios at the Austrian National Library. Related to the Web Archiving area he presented digital object processing workflows to determine the characteristics of files stored as web archive container files. And related to the Austrian Books Online project he explained how different outcomes of the SCAPE project can be used to support quality assurance in the context of large digitisation projects.

Krešimir Đuretec (ifs) presented the Planning and Watch sub project of the SCAPE project. He covered tools for collection profiling, monitoring and planning and explained how these tools can be integrated with and organisation’s digital repositories to help with preservation policies.

Ruben Riestra from INMARK, an APARSEN partner, took the approach of “Answering Key Questions” in digital preservation, such as “Who should preserve?”, “What should be preserved?”, “When should preservation start?”, “How to preserve?”, “How much will it cost?” and “Who should pay?”. He pointed out that there is a market for digital preservation in terms of services. The APARSEN Virtual Centre of Excellence that is taking shape at the moment will stand between the supply and the demand sector of digital preservation.

Participants were additionally invited to visit the Globe Museum, which is part of the Austrian National Library and situated in Palais Mollard as well. It is the only one of its kind worldwide.

The presentation slides are available here.

APARSEN_LARGE_2APARSEN – Alliance Permanent Access to the Records of Science in Europe Network – is a Network of Excellence that brings together an extremely diverse set of practitioner organisations and researchers in order to bring coherence, cohesion and continuity to research into barriers to the long-term accessibility and usability of digital information and data, exploiting our diversity by building a long-lived Virtual Centre of Digital Preservation Excellence. The objective of this project may be simply stated, namely to look across the excellent work in digital preservation which is carried out in Europe and to try to bring it together under a common vision.
Scape_Logo_v11_with-taglineSCAPE – Scalable Preservation Environments – develops scalable services for planning and execution of institutional preservation strategies on an open source platform that orchestrates semi-automated workflows for large-scale, heterogeneous collections of complex digital objects. SCAPE enhances the state of the art of digital preservation in three ways: by developing infrastructure and tools for scalable preservation actions; by providing a framework for automated, quality-assured preservation workflows and by integrating these components with a policy-based preservation planning and watch system. These concrete project results are being validated within three large-scale Testbeds from diverse application areas.