Skip to Content

Archiving Web Sites - Provide

The primary purpose of most digital collections is to provide user access to the information in the collection.  In order to provide access to your content, you must determine how digital content will be made accessible and who will be allowed to access it.  In order to best provide access to your users, it is important that you understand the needs of the users and their preferred methods for accessing the content you provide.  You will also have to take measures to ensure that the content you provide adheres to copyright and digital rights management laws.

Explore Web Archive Access Tools

Wayback is an open source java implementation of the Internet Archive's Wayback Machine. It can be used to provide access to web pages that have been stored in WARC format.
  • Mignify Web Data Extractor. Internet Memory Research. http://mignify.com/
    Users "provide a set of reference pages specifying the format of the data to be extracted (from Price information to Description, copyright or any needed information)," and Mignify Web Data Extractor then, crawls sources, extracts "your desired data with help of your reference pages," converts "unstructured to structured data for efficient analysis," and moves "data to your desired location."
  • NutchWAX. http://archive-access.sourceforge.net/projects/nutch/
    "NutchWAX ('Nutch + Web Archive eXtensions') searches web archive collections. The Web Archive eXtensions (WAX) include adaptation of the Nutch fetcher step to go against web archives rather than crawl the open net -- adaptation currently does Internet Archive ARC files only -- and plugins to add extra fields to the index that return an Archive Records' location in the repository, its collection name, etc."
  • WayBack - Internet Archive. http://archive-access.sourceforge.net/projects/wayback/

Read

  • Aschenbrenner, Andreas, and Andreas Rauber. "Mining Web Collections." In Web Archiving, edited by Julien Masanès, 153-76. New York, NY: Springer, 2006.
    "It is the ambition of this chapter to highlight the intricate interrelations between Web archive construction, usage, and preservation, to illustrate the myriad of issues involved in Web archive usage, and to convey the importance of planning and organisation of Web archives with respect to their later usage."
  • Costa, Miguel and Mário J. Silva. "Understanding the Needs of Web Archive Users." In Proceedings of the 10th International Web Archiving Workshop (IWAW 2010), Vienna, Austria, September 22-23, 2010, edited by Julien Masanès, Andreas Rauber and Marc Spaniol, 9-16, 2010. http://arquivo-web.fccn.pt/sobre-o-arquivo/understanding-the-information-needs-of-web-archive
    "A complete characterization of web archive users must respond to three questions: why, what and how do users search? This study focuses on the first two: what are the user intents and which topics are most interesting to them? Answers to these questions are essential for guiding the development of web archives towards better user satisfaction. We used three instruments to collect quantitative and qualitative data, namely, search logs, an online questionnaire and a laboratory study. The obtained results are coincident. Users perform mostly navigational searches and do not restrict searches by date. Other fi ndings show that users prefer full-text over URL search and the oldest documents over the newest. We discuss all these fi ndings and their implications in the design of search engines for web archives."
  • LAWA (Longitudinal Analytics of Web Archive data). http://www.lawa-project.eu/
    LAWA aims to develop "a sustainable infra-structure, scalable methods, and easily usable software tools for aggregating, querying, and analyzing heterogeneous data at Internet scale. Particular emphasis will be given to longitudinal data analysis along the time dimension for Web data that has been crawled over extended time periods."
  • Meyer, Eric T., Arthur Thomas, and Ralph Schroeder. "Web Archives: The Future(s)." Oxford Internet Institute and International Internet Preservation Consortium, 2011. http://ssrn.com/abstract=1830025
    "In this report, the authors consider the possible future uses of web archives. This report is structured first, to engage in some speculative thought about the possible futures of the web as an exercise in prompting us to think about what we need to do now in order to make sure that we can reliably and fruitfully use archives of the web in the future. Next, we turn to considering the methods and tools being used to research the live web, as a pointer to the types of things that can be developed to help understand the archived web. Then, we turn to a series of topics and questions that researchers want or may want to address using the archived web. In this final section, we tentatively identify some of the short, medium and long term challenges individuals, organizations, and international bodies can target to increase our ability to explore these topics and answer these questions."
  • Niu, Jinfang. "Functionalities of Web Archives." D-Lib Magazine 18, No. 3/4 (2012). http://mirror.dlib.org/dlib/march12/niu/03niu2.html
    "The functionalities that are important to the users of web archives range from basic searching and browsing to advanced personalized and customized services, data mining, and website reconstruction. The author examined ten of the most established English language web archives to determine which functionalities each of the archives supported, and how they compared. A functionality checklist was designed, based on use cases created by the International Internet Preservation Consortium (IIPC), and the findings of two related user studies. The functionality review was conducted, along with a comprehensive literature review of web archiving methods, in preparation for the development of a web archiving course for Library and Information School students. This paper describes the functionalities used in the checklist, the extent to which those functionalities are implemented by the various archives, and discusses the author's findings. "
  • Rauber, Andreas, Andreas Aschenbrenner, Oliver Witvoet, Robert M.Bruckner, and Max Kaiser. "Uncovering Information Hidden in Web Archives: A Glimpse at Web Analysis Building on Data Warehouses." D-Lib Magazine 8, no. 12 (2002). http://www.dlib.org/dlib/december02/rauber/ 12rauber.html
    "The Internet has turned into an important aspect of our information infrastructure and society, with the Web forming part of our cultural heritage. Several initiatives thus set out to preserve it for the future. The resulting Web archives are by no means only a collection of historic Web pages. They hold a wealth of information that waits to be exploited, information that may be substantial to a variety of disciplines. With the time-line and metadata available in such a Web archive, additional analyses that go beyond mere information exploration become possible. In the context of the Austrian On-Line Archive (AOLA), we established a Data Warehouse as a key to this information. The Data Warehouse makes it possible to analyze a variety of characteristics of the Web in a flexible and interactive manner using on-line analytical processing (OLAP) techniques. Specifically, technological aspects such as operating systems and Web servers used, the variety of file types, forms or scripting languages encountered, as well as the link structure within domains, may be used to infer characteristics of technology maturation and impact on community structures. "
  • Rosenthal, David S. H., Thomas Lipkis, Thomas S. Robertson, and Seth Morabito. "Transparent Format Migration of Preserved Web Content." D-Lib Magazine 11, no. 1 (2005). http://www.dlib.org/dlib/january05/rosenthal/01rosenthal.html
    "The LOCKSS digital preservation system collects content by crawling the web and preserves it in the format supplied by the publisher. Eventually, browsers will no longer understand that format. A process called format migration converts it to a newer format that the browsers do understand. The LOCKSS program has designed and tested an initial implementation of format migration for Web content that is transparent to readers, building on the content negotiation capabilities of HTTP."
  • Schneider, Steven M., and Kirsten A. Foot. "The Web as an Object of Study." New Media & Society 6, no. 1 (2004): 114-22.
    "We identify three sets of approaches that have been employed in web-related research over the last decade. These approaches are not necessarily mutually exclusive, and some studies cited below employed more than one approach. Distinguishing between these approaches helps to establish the trajectory of web studies; highlighting the strengths and weaknesses of each focuses attention on the methodological challenges that are associated with the field of web studies."
Groups:


about seo | group_wiki_page