Skip to Content

Archiving Web Sites - Prepare

Q. How should I prepare to archive web sites?

Before you dive head first into collecting and preserving web sites, it is important that you take some time to assess your current situation. Take a look at what web content you may already have collected and what content you are considering to add to your collection. Make sure you understand all aspects of web archiving – including the human resources, technology, and costs before you begin. It may also be a good idea to understand some of the history of the Web and basics about how the Web works.

Take action

  • Establish the monetary, human, and technological resources you need and what you have available
  • Perform needs and resource assessments
  • Prepare clearly defined policies for all web archiving processes
  • Review use cases, watch videos, and read literature to gain a greater understanding of web archiving principles

Review Examples of Web Archive Collections

  • Center for History and New Media and American Social History Project/Center for Media and Learning.  The September 11 Digital Archive.  http://911digitalarchive.org/ 

    The September 11 Digital Archive uses electronic media to collect, preserve, and present the history of September 11, 2001 and its aftermath. The Archive contains more than 150,000 digital items, a tally that includes more than 40,000 emails and other electronic communications, more than 40,000 first-hand stories, and more than 15,000 digital images.  In September 2003, the Library of Congress accepted the Archive into its collections, an event that both ensured the Archive's long-term preservation and marked the library's first major digital acquisition.

  • Federal Web Harvests. U.S. National Archives and Records Administration. http://webharvest.gov/collections/
    The National Archives and Records Administration (NARA) preserved a one-time snapshot of agency public web sites as they existed on or before January 20, 2001, as an archival record in the National Archives of the United States. NARA also conducted a harvest (i.e., capture) of Federal Agency public web sites in 2004 and of Congressional web sites in 2006, 2008 and 2010. In January 2005, NARA issued "Guidance on Managing Web Records," which addresses agencies' responsibilities for identifying, managing and scheduling web materials they identify as Federal records. Accordingly, each agency is now responsible, in coordination with NARA, for determining how to manage its web records, including whether to preserve a periodic snapshot of its entire web page.
  • Internet Archive. http://archive.org/index.php 
    The Internet Archive, a 501(c)(3) non-profit, is building a digital library of Internet sites and other cultural artifacts in digital form.  Like a paper library, it provide free access to researchers, historians, scholars, and the general public.
  • Lecher, Hanno E. "Small Scale Academic Web Archiving: DACHS." In Web Archiving, edited by Julien Masanès, 213-25. New York, NY: Springer, 2006.
    "The main objectives of the DACHS2 are to identify and archive Internet resources relevant for Chinese Studies in order to ensure their long-term accessibility. Selection plays an important role in this process, and special emphasis is put on social and political discourse as reflected by articulations on the Chinese Internet."
  • Library of Congress Web Archives (LCWA). http://lcweb2.loc.gov/diglib/lcwa/
    The Library of Congress Web Archives (LCWA) is composed of collections of archived web sites selected by subject specialists to represent web-based information on a designated topic.  It is part of a continuing effort by the Library to evaluate, select, collect, catalog, provide access to, and preserve digital materials for future generations of researchers.  The early development project for Web archives was called MINERVA.
  • Library of Congress.  "United States Election 2002 Web Archive."  Last updated August 5, 2011.  http://lcweb2.loc.gov/diglib/lcwa/html/elec2002/elec2002-overview.html
    The Election 2002 Web Archive includes Web sites associated with United States 2002 mid-term Congressional elections, gubernatorial elections, and mayoral elections in 15 major United States cities (including Washington, DC).
  • National Diet Library (Japan). "Survey on Comprehensive Collection, Storage, and Archiving of Japanese Web Sites." 2006. http://www.ndl.go.jp/en/aboutus/bulkresearch2005summary_e.html
    "From October 2004 to March 2005, a survey of web data in Japan was conducted for the purpose of studying the feasibility of and methodology for collecting, storing and archiving Japanese web sites. According to the survey, the total amount of web data in Japan as of March 2005 was estimated at 18.4 TB, and the total number of files at 450 million. These results are presented below, along with the results of studies on web archiving requirements."
  • Our Digital Island: A Tasmanian Web Archive. State Library of Tasmania. http://odi.statelibrary.tas.gov.au/
    "Our Digital Island provides access to Tasmanian Web sites that have been preserved for posterity by the LINC Tasmania."
  • September 11 Archive. Internet Archive.  http://archive.org/details/911 
    The 9/11 Television News Archive is a library of news coverage of the events of 9/11/2001 and their aftermath as presented by U.S. and international broadcasters.  A resource for scholars, journalists, and the public, it presents one week of news broadcasts for study, research and analysis.
  • UK Government Web Archive. The National Archives (UK).  http://www.nationalarchives.gov.uk/webarchive/ 
    The National Archives is preserving government information published on the Web by archiving UK Central Government Websites.
  • UK Web Archive.  http://www.webarchive.org.uk/ukwa/
    Here you can see how sites have changed over time, locate information no longer available on the live Web and observe the unfolding history of a spectrum of UK activities represented online.  Sites that no longer exist elsewhere are found here and those yet to be archived can be saved for the future by nominating them.  The Archive contains sites that reflect the rich diversity of lives and interests throughout the UK. Search is by Title of Website, Full Text or URL, or browse by Subject, Special Collection or Alphabetical List.
  • WebArchiv: Archive of the Czech Republic. http://en.webarchiv.cz/
  • WebBase Project. Stanford University. http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/
    "The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. For example, we collected pages from 350 sites every day for several weeks after the Katrina hurricane disaster. We also collect pages from government Web sites on a regular basis. In addition, the project examines how our archives can be explored by historians, sociologists, and public policy professionals. "

Review Examples of Web Archiving Projects and Initiatives

  • ARCOMEM (Collect-All ARchives to COmmunity MEMories). http://www.arcomem.eu/
    Intended outcomes include: "innovative models and tools for Social Web driven content appraisal and selection, and intelligent content acquisition; novel methods for Social Web analysis, Web crawling and mining, event and topic detection and consolidation, and multimedia content mining; reusable components for archive enrichment and contextualization; two complementary example applications, the first for media-related Web archives and the second for political archives; and a standards-oriented ARCOMEM demonstration system."
  • BlogForever. http://blogforever.eu/
    "BlogForever will create a software platform capable of aggregating, preserving, managing and disseminating blogs.Any user or organization will be able to use the BlogForever software & guidelines to create a digital repository containing their own selection of blogs."
  • LiWA: Living Web Archives. http://liwa-project.eu/
    Focusing on "long term interpretability as archives evolve," "improved archive fidelity by filtering out irrelevant noise," and "considering a wide variety of content"
  • Memento. http://www.mementoweb.org/
    "Memento proposes a technical framework aimed at better integrating the current and the past Web. The framework adds a time dimension to the HTTP protocol and, inspired by content negotiation, introduces the notion of datetime negotiation. The proposed framework can lead to more Web browsing fun as old versions of Web resources (e.g. in Web Archives and in Content Management Systems) become easier to access. But Memento also suggest a generic approach for versioning Web resources that can help bootstrap a variety of novel, temporal Web applications."
  • Netarchive.dk. http://netarkivet.dk/
    "Since 2005 the collection and preservation of the Danish part of the internet is included in the Danish Legal Deposit Law. The task is undertaken by the two legal deposit libraries in Denmark, State and University Library and The Royal Library. Netarchive.dk cannot be accessed by the general public.The archive is only accessible to researchers who have requested and been granted special permission to use the collection for specific research purposes. This website, Netarkivet.dk, is designed to inform researchers, website owners, and other interested parties about the Danish web archive. For the time being most of the website is in Danish."
  • PANDORA (Preserving and Accessing Networked Documentary Resources of Australia). National Library of Australia. http://pandora.nla.gov.au/
    "PANDORA, Australia's Web Archive, is a growing collection of Australian online publications, established initially by the National Library of Australia in 1996, and now built in collaboration with nine other Australian libraries and cultural collecting organisations."

Familiarize Yourself with Related Tools and Services

See also specifically:

Watch

  • "Web Archiving."  November 30, 2009.  Library of Congress, 3:11.  http://www.youtube.com/watch?v=T0943YkhLWU

    "Web content changes all the time.  If we don't save that content before it disappears, a major part of our cultural history will be lost.  The Library of Congress is working to provide permanent access to web content of historical importance.  It selects websites for collection, requests permissions from the website owners, addresses the technology of collecting web sites and preserves the web sites and makes them available.  This video examines those four challenges."
  • "Web Archiving and the IIPC."  2011.  International Internet Preservation Consortium, 5:23.  http://vimeo.com/26276709

    World scholars discuss the necessity of archiving the Web for future access.  This video is also available in German, Spanish, French, Japanese and Arabic.

Read

  • Ball, Alex. "Web Archiving." Edinburgh, UK: Digital Curation Centre, 2010. http://lac-repo-live7.is.ed.ac.uk/bitstream/1842/3327/1/Ball%20sarwa-v1....
    "Web archiving is important not only for future research but also for organisations’ records management processes. There are technical, organisational, legal and social issues that Web archivists need to address, some general and some specific to types of content or archiving operations of a given scope. Many of these issues are being addressed in current research and development projects, as are questions concerning how archived Web material may integrate with the live Web."
  • Bergman, Michael K.  "The Deep Web: Surfacing Hidden Value."  Journal of Electronic Publishing 7 no.1 (2001).   doi: http://dx.doi.org.libproxy.lib.unc.edu/10.3998/3336451.0007.104.  http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104 (subscription required to access this resource)
    A study at the NEC Research Institute, published in Nature, estimated that the search engines with the largest number of web pages indexed (such as Google or Northern Light) each index no more than sixteen per cent of the surface Web.  Since they are missing the deep Web when they use such search engines, Internet searchers are therefore searching only 0.03% — or one in 3,000 — of the pages available to them today.  Clearly, simultaneous searching of multiple surface and deep web sources is necessary when comprehensive information retrieval is needed.
  • Berners-Lee, Tim and Dan Connolly.  Hypertext Markup Language – 2.0.  Networking Working Group, 1995. http://www.ietf.org/rfc/rfc1866.txt
    This document specifies an Internet standards track protocol for the Internet community and requests discussion and suggestions for improvements.
  • Bragg, Molly, Kristine Hanna, Lori Donovan, Graham Hukill, and Anna Peterson. "The Web Archiving Life Cycle Model." Internet Archive, March 2013. http://archive-it.org/static/files/archiveit_life_cycle_model.pdf
    "The model is an attempt to distill the different steps and phases an institution experiences as they develop and manage their web archiving program."
  • Brown, Adrian.  Archiving Websites: A Practical Guide for Information Management Professionals.  Facet Publishing, 2006.
    This book is targeted at policy-makers, information management professionals, and web site owners and webmasters.  It provides an overview of best practice that can be applied to anything from archiving a national domain to an organizational web site.  The chapters include: the development of web archiving, selection, collection methods, quality assurance and cataloguing, preservation, delivery to users, legal issues, managing a web archiving programme, and future trends.
  • Brügger, Niels. "Step-by-step guide to archiving a website." In Archiving Websites: General Considerations and Strategies. Århus, Denmark: The Centre for Internet Research, 2005. http://cfi.au.dk/fileadmin/www.cfi.au.dk/publikationer/archiving_underside/guide.pdf
    "Since an archived website to a certain degree is only shaped in the archiving, it should be accompanied by a document containing methodical considerations of why and how the website has been archived. The following step-by-step guide is meant as an aid to the outline of such a document. In addition, it will naturally also act as a practical aid in connection with the actual archiving (and, of course, the following is to be seen in the context of the previous pages’ general deliberations and strategies, which it condenses in an itemised, tabular form.
    The guide is divided into two main parts: 1) prior to archiving, 2) the archiving process."
  • Cho, Junghoo and Hector Garcia-Molina.  "The Evolution of the Web and Implications for an Incremental Crawler."  In Proceedings of the 26th International Conference on Very Large Data Bases, 200-209. San Francisco, CA: Morgan Kaufmann, 2010. http://www.vldb.org/conf/2000/P200.pdf
    This paper studies how to build an eff ective incremental crawler.  The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically refreshing the collection in batch mode.  The incremental crawler can improve the freshness of the collection significantly and bring in new pages in a more timely manner.  It fi rst presents results from an experiment conducted on more than half million web pages over 4 months to estimate how web pages evolve over time.  Based on these experimental results, it compares various design choices for an incremental crawler and discusses their trade-off s.  It proposes an architecture for the incremental crawler, which combines the best design choices.
  • Day, Michael. "Collecting and Preserving the World Wide Web: A Feasibility Study Undertaken for the JISC and Wellcome Trust." Joint Information Systems Committee (JISC) and Wellcome Trust, 2003. http://www.jisc.ac.uk/uploaded_documents/archiving_feasibility.pdf
    This document reports on an "evaluation and feasibility study of Web archiving" supported by the Joint Information Systems Committee (JISC) and the Library of the Wellcome Trust. "The aims of this study are to provide the JISC and Wellcome Trust with:
    • An analysis of existing Web archiving arrangements to determine to what extent they address the needs of the UK research and FE/HE communities. In particular this is focused on an evaluation of sites available through the Internet Archive's Wayback Machine, to see whether these would meet the needs of their current and future users.
    • To provide recommendations on how the Wellcome Library and the JISC could begin to develop Web archiving initiatives to meet the needs of their constituent communities."
  • Farrell, Susan ed. "A Guide to Web Preservation." 2010. http://jiscpowr.jiscinvolve.org/wp/guide/
    This guide is based on the earlier (2008) "PoWR: The Preservation of Web Resources Handbook."
  • Fitch, Kent.  "Web site archiving: an approach to recording every materially different response produced by a website."  Paper presented at the AusWeb 2003: The Ninth Australian World Wide Web Conference, Sanctuary Cove, Australia.  http://ausweb.scu.edu.au/aw03/papers/fitch/paper.html
    This paper discusses an approach to capturing and archiving all materially distinct responses produced by a web site, regardless of their content type and how they are produced.  This approach does not remove the need for traditional records management practices but rather augments them by archiving the end results of changes to content and content generation systems.  It also discusses the applicability of this approach to the capturing of web sites by harvesters.
  • Gillies, James and Robert Cailliau.  How the Web was born: The story of the World Wide Web.  Oxford: Oxford University Press, 2000.
    Chapters include The Foundation, Setting the Scene at CERN, Bits and PCs, Enquire Within Upon Everything, What Are We Going To Call This Thing?, Sharing What We Know, The Beginning of the Future, and It's Official.
  • Kenney, Anne R., Nancy, McGovern, Peter Botticelli, Richard Entlich, Carl Lagoze, and Sandra Payette. “Preservation Risk Management for Web Resources: Virtual remote Control in Cornell’s Project Prism. D-Lib Magazine 8, no. 1 (2002). http://www.dlib.org/dlib/january02/kenney/01kenney.html
    "Project Prism's approach begins with characterizing the nature of preservation risks in the Web environment, develops a risk management methodology for establishing a preservation monitoring and evaluation program, and leads to the creation of management tools and policies for virtual remote control. The approach will demonstrate how Web crawlers and other automated tools and utilities can be used to identify and quantify risks; to implement appropriate and effective measures to prevent, mitigate, recover from damage to and loss of Web-based assets; and to support post-event remediation."
  • Lyman, Peter.  "Archiving the World Wide Web."  In Building a National Strategy for Preservation: Issues in Digital Media Archiving.  Council on Library and Information Resources, April 2002.  http://www.clir.org/pubs/reports/pub106/web.html
    This section of the Building a National Strategy for Preservation report analyzes the cultural, technical, economic, and legal issues surrounding Web archiving.
  • McGovern, Nancy, Anne R. Kenney, Richard Entlich, William R. Kehoe, and Ellie Buckley. “Virtual Remote Control: Building a Preservation Risk Management Toolbox for Web Resources. D-Lib Magazine 10, no. 4 (2004). http://www.dlib.org/dlib/april04/mcgovern/04mcgovern.html
    "Unlike most web preservation projects, Cornell University Library's Virtual Remote Control (VRC) initiative is based on monitoring websites over time—identifying and responding to detected risk as necessary, with capture as a last resort." "VRC leverages risk management as well as the fundamental precepts of records management to define a series of stages through which an organization would progress in selecting, monitoring, and curating target web resources. The first part of this article presents the stages of the VRC approach, identifying both human and automated responses at each stage. The second part describes the development of a toolbox to enable the VRC approach. The conclusion sets out our intentions for the future of VRC."
  • Masanès, Julien, ed. Web Archiving. New York, NY: Springer, 2006.
    "Julien Masanès, Director of the European Archive, has assembled contributions from computer scientists and librarians that altogether encompass the complete range of tools, tasks and processes needed to successfully preserve the cultural heritage of the Web. His book serves as a standard introduction for everyone involved in keeping alive the immense amount of online information, and it covers issues related to building, using and preserving Web archives both from the computer scientist and librarian viewpoints."
  • Masanès, Julien.  "Towards Continuous Web Archiving: First Results and an Agenda for the Future."  D-Lib Magazine 8 no. 12 (2002).  doi: 10.1045/december2002-masanes.  http://www.dlib.org/dlib/december02/masanes/12masanes.html

    This article outlines the contribution of the national library of France (BnF) to the Web archiving discussion.  BnF began a research project on Web archiving in late 1999.  Their work on Web archiving is divided into two parts.  The first part is to improve crawlers for continuous and adapted archiving.  This means being able to automatically focus the crawler for satisfactory archiving. Apart from getting existing, hands-on tools, this part of the project, which is presented in this article, consists of defining and testing good parameters toward that aim.  The second part of their work is testing every step of the process for depositing web content.

  • NDSA Content Working Group. "National Digital Stewardship Alliance Web Archiving Survey Report." June 19, 2012. http://www.digitalpreservation.gov/ndsa/working_groups/documents/ndsa_we...
    "From October 3 through October 31, 2011, the Content Working Group conducted a survey of organizations in the United States that are actively involved in, or planning to start, programs to archive content from the web. The goal of the survey was to better understand the landscape of web archiving activities in the United States, including identifying the organizations or individuals involved, the types of web content being preserved, the tools and services being used, and the types of access being provided. This summary report examines participant responses for the purposes of discerning trends, themes, and emerging practices and challenges in web-based content acquisition and preservation."
  • O'Neill, Edward T., Brian F. Lavoie, and Rick Bennett.  "Trends in the Evolution of the Public Web."  D-Lib Magazine 9 no. 4 (2003).  doi: 10.1045/april2003-lavoie.  http://www.dlib.org/dlib/april03/lavoie/04lavoie.html
    This article examines three key trends in the development of the public Web — size and growth, internationalization, and metadata usage — based on data from the OCLC Office of Research Web Characterization Project, an initiative that explores fundamental questions about the Web and its content through a series of Web samples conducted annually since 1998.
  • PADI: Preserving Access to Digital Information. Web Archiving. http://www.nla.gov.au/padi/topics/92.html [Not updated or maintained since 2010, but provides an extensive annotative list of resources up to that date.]
  • "PoWR: The Preservation of Web Resources Handbook." ULCC, UKOLN and JISC, 2008. http://jiscpowr.jiscinvolve.org/wp/files/2008/11/powrhandbookv1.pdf
    This Handbook is one of the outputs from the JISC-funded PoWR (Preservation Of Web Resources) project.

Last updated on 08/26/13, 9:52 pm by callee

Groups:


about seo | group_wiki_page