Skip to Content

Archiving Web Sites - Selection Method

Q. How should I select specific web resources to archive?

Selection of web resources is usually based on identifying some entities (e.g., functions, individuals, organizational units, types of transactions) that warrant documentation over time, and then focusing on the subset of the overall universe of documentation that is most likely to serve as documentation of those entities.

Take action

  • Choose appropriate collection methods depending on: types of content, organizations and structures of target content, and relationship between content collectors and content providers.
  • Develop selection criteria
  • Possible considerations: scope and content of collections, time and frequency of collections, collection type (repeated collection vs. ad-hoc collection vs. one-off collection vs. comprehensive collection)
  • Defining boundary of collection
  • Defining level of collection (page level vs. site level vs. domain level)
  • Identifying entry points (manual vs. automatic)


Review use cases

  • Hurricanes Katrina and Rita Web Archive 
    Internet Archive and many individual contributors created a comprehensive list of websites documenting the historic devastation and massive relief effort due to Hurricane Katrina. The sites were crawled between the dates of September 4 - November 8, 2005. This collection, containing more than 61 million searchable documents, will be preserved by Internet Archive with access to historians, researchers, scholars and the general public.
  • Center for History and New Media and American Social History Project/Center for Media and Learning.  The September 11 Digital Archive. 
    The September 11 Digital Archive uses electronic media to collect, preserve, and present the history of September 11, 2001 and its aftermath. The Archive contains more than 150,000 digital items, a tally that includes more than 40,000 emails and other electronic communications, more than 40,000 first-hand stories, and more than 15,000 digital images. In September 2003, the Library of Congress accepted the Archive into its collections, an event that both ensured the Archive's long-term preservation and marked the library's first major digital acquisition.
  • National Library of Australia.  "Selection Guidelines."  Last updated April 27, 2011. 
    Links to the selection guidelines of the participating agencies in PANDORA.


  • K-12 Web Archiving: Preserving the Present,
    "Through the K-12 web archiving program, a collaboration between the Library of Congress and the Internet Archive, students — children and teenagers — archive websites that represent their lives and interests. They not only develop critical-thinking skills and learn how to solve problems with others, they also develop an awareness of the transitory nature of web content. The students use Archive-It, a web-based web archiving service from the Internet Archive, to capture sites and manage, describe and browse their collections. The Library of Congress archives the sites the students collect, and those collections become primary sources of information for future researchers. The students' experience of creating primary sources leads them to consider the authenticity and value of other primary sources. In the spring of 2010, a team from the Library of Congress visited one of the program's participating classes at the James Moran Middle School in Wallingford, Connecticut. Over two days Library staff interviewed the students and their teacher, Paul Bogush."


  • Brown, Adrian.  "Selection."  Chap. 3 in Archiving Websites: A Practical Guide for Information Management Professionals.  London: Facet Publishing, 2006.
    Outlines the structural, temporal and informational qualities of the Web that influence the selection process.  Elaborates on a diagram of the discrete steps in the selection process, defining various selection methods and criteria.
  • Lee, Christopher A. "Collecting the Externalized Me: Appraisal of Materials in the Social Web." In I, Digital: Personal Collections in the Digital Era, edited by Christopher A. Lee, 202-238. Chicago, IL: Society of American Archivists, 2011.
    "With the adoption of highly interactive web technologies (frequently labeled 'Web 2.0'), forms of individual documentation and expression also often are inherently social and public. Such online environments allow for personal documentation, but they also engage external audiences in ways not previously possible. This opens up new opportunities and challenges for collecting personal materials, particularly within the context of archival appraisal. This chapter explores various ways in which principles of archival appraisal can be operationalized in an environment in which collecting takes the form of submitting queries and following links."
  • Lee, Christopher A., and Helen R. Tibbo. "Capturing the Moment: Strategies for Selection and Collection of Web-Based Resources to Document Important Social Phenomena." In Archiving 2008: Final Program and Proceedings, June 24-27, 2008, Bern, Switzerland, 300-305. Springfield, VA: Society for Imaging Science and Technology, 2008.
    "The VidArch project is capturing YouTube videos and web pages associated with the 2008 U.S. presidential election. We are also exploring strategies and building tools for curators of digital collections to appraise and describe such materials. Blogs are an increasingly important source for documenting online deliberations. Blogs can provide commentary, but they can also serve as “contextual information bridges” for identifying and capturing resources to which the pages link. Web archiving literature usually defines collecting in terms of setting up a set of seeds for crawls based on specific URLs. However, a substantial portion of material on the Web is accessible through posing queries. Curators of digital collections will need tools and methods for combining information from queries and crawls to identify and collect materials. The VidArch project is developing and testing such approaches, in order to support what Hans Booms would call a “documentation plan” for reflecting the heterogeneous and interlinked conversation space surrounding contemporary events."
  • Longitudinal Analytics of Web Archive data (LAWA).
    "To support innovative Future Internet applications, we need a deep understanding of Internet content characteristics (size, distribution, form, structure, evolution, dynamic). The LAWA project on Longitudinal Analytics of Web Archive data will build an Internet-based experimental testbed for large-scale data analytics. Its focus is on developing a sustainable infra-structure, scalable methods, and easily usable software tools for aggregating, querying, and analyzing heterogeneous data at Internet scale. Particular emphasis will be given to longitudinal data analysis along the time dimension for Web data that has been crawled over extended time periods."
  • Masanès, Julien.  "Selection for Web Archives."  Chap. 3 in Web Archiving: Issues and Methods, edited by Julien Masanès, 71-91.  New York: Springer, 2006. (subscription required to access this resource)
    Covers the three phases of selection: preparation, discovery, and filtering.
  • Masanès, Julien. "Towards Continuous Web Archiving: First Results and an Agenda for the Future." D-Lib Magazine 8, no. 12 (2002).
    "In this article, I will outline the contribution of the national library of France (BnF)" which "began a research project on Web archiving in late 1999. Our project experiments have been ongoing even as the legal deposit law has been in the process of being updated—a process that has not yet ended. Our work on Web archiving is divided into two parts. The first part is to improve crawlers for continuous and adapted archiving. This means being able to automatically focus the crawler for satisfactory archiving. Apart from getting existing, hands-on tools, this part of our project, which is presented in this article, consists of defining and testing good parameters toward that aim. The second part of our work is testing every step of the process for depositing web content. In our view, deposit is a necessary part of archiving the Web, because a large amount of very rich Web content is out of the reach of crawlers."
  • Masanès, Julien. "Web Archiving Methods and Approaches: A Comparative Study." Library Trends 54, no. 1 (2005): 72-90.
    "This article will present various approaches undertaken today by different institutions; it will discuss their focuses, strengths, and limits, as well as a model for appraisal and identifying potential complementary aspects amongst them. A comparison for discovery accuracy is presented between the snapshot approach done by the Internet Archive (IA) and the event-based collection done by the Bibliothèque Nationale de France (BNF) in 2002 for the presidential and parliamentary elections."
  • Qin, Jialun, Yilu Zhou, and Michael Chau.  "Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method."  In Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries.  New York, NY: ACM, 2004.
    Collecting domain-specific documents from the Web using focused crawlers has been considered one of the most important strategies to build digital libraries that serve the scientific community. However, because most focused crawlers use local search algorithms to traverse the Web space, they could be easily trapped within a limited sub-graph of the Web that surrounds the starting URLs and build domain-specific collections that are not comprehensive and diverse enough to scientists and researchers. In this study, we investigated the problems of traditional focused crawlers caused by local search algorithms and proposed a new crawling approach, meta-search enhanced focused crawling, to address the problems. We conducted two user evaluation experiments to examine the performance of our proposed approach and the results showed that our approach could build domain-specific collections with higher quality than traditional focused crawling techniques.
  • Schneider, Steven M., et al.  "Building thematic Web collections: Challenges and experiences from the September 11 Web Archive and the Election 2002 Web Archive."  Paper presented at the 3rd Workshop on Web Archives.  2003.
    One method for creating large-scale collections of Web materials is to use a “thematic” approach. In this paper, we introduce the concept of a thematic Web collection, discuss the experience of our organizations which have collaborated in the development and presentation of two thematic Web collections, identify challenges associated with thematic archiving, and comment on the value of thematic archiving from a library, archivist and scholarly perspective.

Last updated on 08/26/13, 9:55 pm by callee


about seo | group_wiki_page