Skip to Content

Archiving Web Sites - Get

Q. How can I collect web content that falls within my collecting mission?

The following table outlines six ways to collect web content. All six approaches involve multiple selection decisions: which sources to engage, how often to collect content, what collecting parameters to use, and how much effort to invest in fixing specific problem cases.  For a discussion of five of these approaches (does not address capture at the source), see [Lee].

Approach

Explanation

Advantages

Disadvantages

Ask the provider

Through direct contact, the collector can request and negotiate for a direct transfer of the data that reside on the server(s) of the provider

Can yield information not directly accessible through other means, and can get data directly from the source (e.g. whole database, high-res images) rather than what’s served through the Web

Requires cooperation of the provider

See if someone else has it[i]

If content has been cached by a search service, harvested by the Internet Archive or collected by a peer institution, obtain copy of content from them

Allows for post hoc recovery

Coverage and success of recovery are subject to the systems and priorities of systems that were designed for other purposes

Follow links

Start with seed URLs, then recursively follow them—possibly feeding new URLs back into seed list (used by search engine bots and many web crawlers)

Tools and techniques are very well-established and well-understood

Many dimensions of interest (e.g. provenance, topic, time period) are not reflected in the link patterns of web content

Pull results of queries

Collecting institution issues queries to known sources (e.g., collecting videos from YouTube by queries for specific named individuals)

Can benefit from the structure and standardized interfaces of the content providers

Strongly dependent on interface and ranking algorithms of the content provider’s system

Receive results of pushed queries

Subscription model of tapping into alert services or “feeds” that are pushed to the collecting institution

Particularly good for communication forms (e.g. blogs) that are “post-centric” rather than “page-centric”

Feeds of content will often lose formatting and contextual information that could be important to retain

Capture data and changes at the source      


[i] See Warrick—Recover Your Lost Website, http://warrick.cs.odu.edu/.

 

Stated more simply, there are two fundamental approaches to capturing web content for purposes of building digital collections: recursive link-following and query submission. The former has been the most common and involves the identification of a set of seed uniform resource locators (URLs) and then recursively following links within a specified set of constraints (e.g., number of hops, specific domains). When collecting content from specific, database-driven web spaces, query submission is often the most effective
approach.

Effective web collecting strategies will often involve a combination of link-following and text-based queries.  For example, several projects have demonstrated methods for further scoping a topic-based crawl, based on automated analysis of the content of pages or their place within a larger network of pages.[i] There have also been many successful efforts to automatically populate web entry forms in order to collect pages that cannot be reached directly through link-following.[ii] Four fundamental parameters for any web collecting initiative are: environments crawled (e.g., blogosphere, YouTube); access points from those environments used as crawling or selection criteria (e.g., number of views, primary relevance based on term matching, number of in-links, channel or account from which an item was submitted); threshold values for scoping capture within given access points (e.g., one hundred most relevant query results, at least five in-links);[iii] and frequency of crawls. It is very likely that the most appropriate approaches will vary the environments, access points, and thresholds in different ways, depending on the materials and collecting goals.  For example, different types of web materials change or disappear from the Web at very different rates, which implies the need for different crawl frequencies.[iv]

There are multiple methods for obtaining micro-content in the form of feeds; current examples are Really Simple Syndication (RSS), Atom, and Twitter. Such content feeds can be a huge boon for collecting archivists, but they can also miss much of the contextual information that is so important to archivists and (presumably) future users. For example, the RSS feed from a blog often “undoes the idiosyncratic feel of many weblogs by stripping them of visual elements such as layout or logos, as well as eliminating the context produced by blogrolls (blog authors’ links to other weblogs) or the author’s biographical information (and any advertising).” [Gilmore]



 

 

 

[i] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, "Focused Crawling: A New Approach to Topic-Specific Resource Discovery," in Proceedings of the Eighth International World Wide Web Conference: Toronto, Canada, May 11–14, 1999 (Amsterdam: Elsevier, 1999), 545–62; Donna Bergmark, "Collection Synthesis," in Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries: July 14–18, 2002, Portland, Oregon, ed. Gary Marchionini and William R. Hersh (New York: ACM Press, 2002), 253–6; Donna Bergmark, Carl Lagoze, and Alex Sbityakov; "Focused Crawls, Tunneling, and Digital Libraries," in Research and Advanced Technology for Digital Libraries: 6th European Conference, ECDL 2002, Rome, Italy, September 2002: Proceedings, ed. Maristella Agosti and Constantino Thanos (Berlin: Springer, 2002), 91–106; Gautam Pant and Padmini Srinivasan, "Learning to Crawl: Comparing Classification Schemes," ACM Transactions on Information Systems 23, no. 4 (2005): 430–62; and Gautam Pant, Kostas Tsioutsiouliklis, Judy Johnson, and C. Lee Giles, "Panorama: Extending Digital Libraries with Topical Crawlers," in JCDL 2004: Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries: Global Reach and Diverse Impact: Tucson, Arizona, June 7–11, 2004, ed. Hsinchun Chen, Michael Christel and Ee-Peng Lim (New York: ACM Press, 2004), 142–50.

[ii] Sriram Raghavan and Hector Garcia-Molina, "Crawling the Hidden Web," in Proceedings of 27th International Conference on Very Large Data Bases, September 11–14, 2001, Roma, Italy, ed. Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao and Richard T. Snodgrass (Orlando, FL: Morgan Kaufmann, 2001), 129–38;  Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho, "Downloading Textual Hidden Web Content through Keyword Queries," In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries: Denver, Co, USA, June 7–11, 2005: Digital Libraries, Cyberinfrastructure for Research and Education (New York: ACM Press, 2005), 100–9; and Xiang Peisu, Tian Ke, and Huang Qinzhen, "A Framework of Deep Web Crawler," in Proceedings of the 27th Chinese Control Conference, ed. Dai-Zhan Cheng and Min Wu (Beijing, China: Beijing hang kong hang tian da xue chu ban she, 2008), 582–86.

[iii] Capra et al., "Selection of Context Scoping.”

[iv] Bernard Reilly, Carolyn Palaima, Kent Norsworthy, Leslie Myrick, Gretchen Tuchel, and James Simon, "Political Communications Web Archiving: Addressing Typology and Timing for Selection, Preservation and Access" (paper presented at the Third ECDL Workshop on Web Archives, Trondheim, Norway, August 21, 2003); Wallace Koehler, "A Longitudinal Study of Web Pages Continued: A Consideration of Document Persistence," Information Research 9, no. 2 (2004).

Explore Tools and Services

Read

  • Bragg, Molly and Lori Donovan. "Archiving Social Networking Sites w/ Archive-It." https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=3113092
  • Cooper, Brian F., and Hector Garcia-Molina. "InfoMonitor: Unobtrusively Archiving a World Wide Web Server." International Journal on Digital Libraries 5, no. 2 (2005): 106-19.
    "It is important to provide long-term preservation of digital data even when those data are stored in an unreliable system such as a filesystem, a legacy database, or even the World Wide Web. In this paper we focus on the problem of archiving the contents of aWeb site without disrupting users who maintain the site. We propose an archival storage system, the InfoMonitor, in which a reliable archive is integrated with an unmodified existing store. Implementing such a system presents various challenges related to the mismatch of features between the components such as differences in naming and data manipulation operations.  We examine each of these issues as well as solutions for the conflicts that arise.  We also discuss our experience using the InfoMonitor to archive the Stanford Database Group’sWeb site."
  • Fitch, Kent. "Web Site Archiving - an Approach to Recording Every Materially Different Response Produced by a Website." Paper presented at the Ninth Australian World Wide Web Conference, Hyatt Sanctuary Cove, Gold Coast, July 5-9, 2003. http://ausweb.scu.edu.au/aw03/papers/fitch/paper.html
    "This paper discusses an approach to capturing and archiving all materially distinct responses produced by a web site, regardless of their content type and how they are produced. This approach does not remove the need for traditional records management practices but rather augments them by archiving the end results of changes to content and content generation systems. It also discusses the applicability of this approach to the capturing of web sites by harvesters."
  • Gillmor, Dan. We the Media: Grassroots Journalism by the People, for the People. 1st ed. Sebastopol, CA: O'Reilly, 2004.
  • Marill, Jennifer, Andrew Boyko, and Michael Ashenfelder. "Web Harvesting Survey." International Internet Preservation Coalition, 2004. http://www.netpreserve.org/resources/web-harvesting-survey
    "The Metrics and Testbed Working Group of the IIPC conducted a survey which is an attempt to identify and classify many of the general conditions found on Web sites that influence the harvesting of content and the quality of an archival crawl. It is intended to provide a high-level overview of common Web crawling conditions, roughly prioritized by their significance, as background information for institutions beginning to engage in web harvesting. We also offer examples of the various issues, and characterize in which of the several phases of the harvesting process the described problems can occur."
  • Lee, Christopher A. "Collecting the Externalized Me: Appraisal of Materials in the Social Web." In I, Digital: Personal Collections in the Digital Era, edited by Christopher A. Lee, 202-238. Chicago, IL: Society of American Archivists, 2011.
    "With the adoption of highly interactive web technologies (frequently labeled “Web 2.0”), forms of individual documentation and expression also often are inherently social and public. Such online environments allow for personal documentation, but they also engage external audiences in ways not previously possible. This opens up new opportunities and challenges for collecting personal materials, particularly within the context of archival appraisal. This chapter explores various ways in which principles of archival appraisal can be operationalized in an environment in which collecting takes the form of submitting queries and following links."
  • Library of Congress. Quality and Functionality Factors For Archived Web Sites and Pages. http://www.digitalpreservation.gov/formats/content/webarch_quality.shtml
    "This discussion concerns Web sites as they may be collected and archived for research access and long-term preservation. What is at stake is harvesting sites as they present themselves to users at a particular time. The formats discussed here are those that might hold the results of a crawl of a Web site or set of Web sites, a dynamic action resulting from the use of a software package (e.g., Heritrix) that calls up Web pages and captures them in the form disseminated to users."
  • McCown, Frank, Catherine C. Marshall, and Michael L. Nelson. "Why Websites Are Lost (and How They're Sometimes Found)." Communications of the ACM 52, no. 11 (2009): 141-45.
    "We have surveyed 52 individuals who have "lost" their own personal website (through a hard drive crash, bankrupt ISP, etc.) or tried to recover a lost website that once belonged to someone else. Our survey investigates why websites are lost and how successful individuals have been at recovering them using a variety of methods, including the use of search engine caches and web archives. The findings suggest that personal and third party loss of digital data is likely to continue as methods for backing up data are overlooked or performed incorrectly, and individual behavior is unlikely to change because of the perception that losing digital data is very uncommon and the responsibility of others."
  • McCown, Frank, and Michael L. Nelson. "What Happens When Facebook Is Gone?" In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries: June 12-15, 2009, Austin, Texas, USA, edited by Fred Heath and Mary Lynn Rice-Lively, 251-54. New York, NY: ACM Press, 2009. http://www.cs.odu.edu/~mln/pubs/jcdl09/archiving-facebook-jcdl2009.pdf
    "Web users are spending more of their time and creative en- ergies within online social networking systems. While many of these networks allow users to export their personal data or expose themselves to third-party web archiving, some do not. Facebook, one of the most popular social networking websites, is one example of a \walled garden" where users' activities are trapped. We examine a variety of techniques for extracting users' activities from Facebook (and by ex- tension, other social networking systems) for the personal archive and for the third-party archiver. Our framework could be applied to any walled garden where personal user data is being locked."
  • Marchionini, Gary, Chirag Shah, Christopher A. Lee, and Robert Capra. "Query Parameters for Harvesting Digital Video and Associated Contextual Information." In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, 77-86. New York, NY: ACM Press, 2009. http://www.ils.unc.edu/callee/p77-marchionini.pdf
    "Video is increasingly important to digital libraries and archives as both primary content and as context for the primary objects in collections. Services like YouTube not only offer large numbers of videos but also usage data such as comments and ratings that may help curators today make selections and aid future generations to interpret those selections. A query-based harvesting strategy is presented and results from daily harvests for six topics defined by 145 queries over a 20-month period are discussed with respect to, query specification parameters, topic, and contribution patterns. The limitations of the strategy and these data are considered and suggestions are offered for curators who wish to use query-based harvesting."
  • Marill, Jennifer, Andrew Boyko, and Michael Ashenfelder. "Web Harvesting Survey." International Internet Preservation Coalition, 2004. http://www.netpreserve.org/sites/default/files/resources/WebArchivingSurvey.pdf
    "This survey is an attempt to identify and classify many of the conditions found on Web sites that influence the harvesting of content and the quality of an archival crawl. This table is based on Ketil Albertsen’s report, 'A taxonomy for the "the deep web",' and on discussions of the Library of Congress Web harvesting team (LCWHT)."
  • Schrenk, Michael. Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL. San Francisco, CA: No Starch Press, 2007. [See especially: Downloading Web Pages - http://www.nostarch.com/download/webbots_ch3.pdf]
    "This chapter will show you how to write simple PHP scripts that download web pages. More importantly, you’ll learn PHP’s limitations and how to overcome them with PHP/CURL , a special binding of the cURL library that facilitates many advanced network features. cURL is used widely by many computer languages as a means to access network files with a number of protocols and options."

Last updated on 08/26/13, 9:56 pm by callee

Groups:


about seo | group_wiki_page