Skip to Content

Archiving Web Sites - Identify

Q. What do I need to identify in order to archive web sites?

An important part of managing your digital collections is identifying everything with which you are working.  This includes identifying what digital content you have, what you are already preserving, and what content you may be acquiring.  You will also need to identify the file formats you have and assess the risk associated with these formats.  

In cases when the goal is to document the lives of individuals, there are two distinct selection strategies for honing in on materials related to the individuals [Lee]:

  • Work from the individual outward (e.g., ask the person or find information on his/her computer that helps to identify points of entry to his/her online presence, such as logins, browsing histories, and favorite sites). [See Garfinkel and Cox]
  • Work from the wider web inward toward the individual (e.g., use web searches to locate information that leads to elements of his/her web presence).

One of the primary challenges of collecting information about or by given individuals from the Web is “web presence identification” [Bekkerman]—determining what pages on the Web are actually by or about a given individual.

For many institutions, it is important to identify web resources that are "at risk."

Another essential activity can be identify what constitute records to be retained from web sites.

Take action

  • Identify the web content you have, what you are already preserving, and what content you may be acquiring
  • Identify the digital file formats you will be collecting with the web sites and assess risks associated with these file formats
  • Use file format identification tools to identify file formats you already have in your collection
  • Record date information such as the date the files were received, file creation date, file update date


  • Bekkerman, Ron, and Andrew McCallum, "Disambiguating Web Appearances of People in a Social Network," in Proceedings of the 14th International Conference on World Wide Web, WWW 2005: Chiba, Japan, May 10–14, 2005, ed. Allan Ellis and Tatsuya Hagino, 463-70. New York: ACM Press, 2005.
    "Say you are looking for information about a particular person. A search engine returns many pages for that person's name but which pages are about the person you care about, and which are about other people who happen to have the same name? Furthermore, if we are looking for multiple people who are related in some way, how can we best leverage this social network? This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/Conglomerative Double Clustering (A/CDC)|an application of a recently introduced multi-way distributional clustering method. To evaluate our methods, we collected and hand-labeled a dataset of over 1000 Web pages retrieved from Google queries on 12 personal names appearing together in someones in an email folder. On this dataset our methods outperform traditional agglomerative clustering by more than 20%, achieving over 80% F-measure."
  • Collaboration and Transformation Shared Interest Group. "Best Practices Study of Social Media Records Policies." Fairfax, VA: American Council for Technology. March 2011.
    "The purpose of this study is to build a discussion around the use of Web 2.0 collaborative technologies, also known as social media, to help government and its citizens connect more closely, collaboratively, and openly. The study involved interviews at 10 agencies regarding records management processes addressing the use of social media. The C&T SIG sought to explore and capture government best practices of retention policies for social media used to support agency missions."
  • Garfinkel, Simson, and David Cox. "Finding and Archiving the Internet Footprint." Paper presented at the First Digital Lives Research Conference: Personal Digital Archives for the 21st Century, London, UK, February 9-11, 2009.
    "With the move to “cloud” computing, archivists face the increasingly difficult task of finding and preserving the works of an originator so that they may be readily used by future historians. This paper explores the range of information that an originator may have left on computers “out there on the Internet,” including works that are publicly identified with the originator; information that may have been stored using a pseudonym; anonymous blog postings; and private information stored on web-based services like Yahoo Calendar and Google Docs. Approaches are given for finding the content, including interviews, forensic analysis of the originator’s computer equipment, and social network analysis. We conclude with a brief discussion of legal and ethical issues."
  • Koehler, Wallace. "A Longitudinal Study of Web Pages Continued: A Consideration of Document Persistence." Information Research 9, no. 2 (2004).
    "It is well established that Web documents are ephemeral in nature. The literature now suggests that some Web objects are more ephemeral than others. Some authors describe this in terms of a Web document half-life, others use terms like 'linkrot' or persistence. It may be that certain 'classes' of Web documents are more or less likely to persist than are others. This article is based upon an evaluation of the existing literature as well as a continuing study of a set of URLs first identified in late 1996. It finds that a static collection of general Web pages tends to 'stabilize' somewhat after it has 'aged'. However 'stable' various collections may be, their instability nevertheless pose problems for various classes of users. Based on the literature, it also finds that the stability of more specialized Web document collections (legal, educational, scientific citations) vary according to specialization. This finding, in turn, may have implications both for those who employ Web citations and for those involved in Web document collection development."
  • Kumar, B.T. Sampath, and Manoj Kumar. "Decay and half-life period of online citations cited in open access journals." International Information & Library Review 44, No. 4 (2012): 202-211.
    "This study investigates the decay and half-life of online citations cited in four open access journals published between 2000 and 2009. A total of 1158 online citations cited in 1086 research articles published in two science and social science journals spanning a period of 10 years (2000–2009) were extracted. Study found that 24.58% (267 out of 1086) of articles had online citations and these articles contained a substantially very less number of online citations (2.98%) compared to previous study results. 30.56% (26% in Science and 52.73% in Social Science) of online citations were not accessible and remaining 69.44% of online citations were still accessible. The ‘HTTP 404 error message-page not found’ was the overwhelming message encountered and represented 67.79% of all HTTP message. Domains associated with .ac and .net had higher successful access rates while .org and .com/.co had lowest successful access rates. The half-life of online citations was computed to be approximately 11.5 years and 9.07 years in Science and Social science journal articles respectively."
  • Lee, Christopher A. "Collecting the Externalized Me: Appraisal of Materials in the Social Web." In I, Digital: Personal Collections in the Digital Era, edited by Christopher A. Lee, 202-238. Chicago, IL: Society of American Archivists, 2011.
    "With the adoption of highly interactive web technologies (frequently labeled “Web 2.0”), forms of individual documentation and expression also often are inherently social and public. Such online environments allow for personal documentation, but they also engage external audiences in ways not previously possible. This opens up new opportunities and challenges for collecting personal materials, particularly within the context of archival appraisal. This chapter explores various ways in which principles of archival appraisal can be operationalized in an environment in which collecting takes the form of submitting queries and following links."
  • McCluskey, Michael. "Website content persistence and change: Longitudinal analysis of pro-white group identity." Journal of Information Science (2012): 1-10.
    "Despite the ability of websites to quickly evolve, little attention has been paid to persistence and change in site content. Longitudinal examination of 163 pro-white advocacy group websites, in which establishing a core group identity is a critical strategic goal, showed a half-life of 2.40 years and 34% remained active after five years. Analysis of text content from 28 sites collected annually from 2007 to 2012 (n=1947) showed that persistence was more likely for advocacy group identity, while examples of group goals were transient. Content persistence trends reflect broader phenomena of ideologically oriented website persuasive material."
  • Mardani, A.H, and M. Sangari. "An Analysis of the Availability and Persistence of Web Citations in Iranian LIS Journals." International Journal of Information Science and Management 11, No. 1 (2013).
    "To discover the current situation and characteristics of web citations accessibility, the present study examined the accessibility of 4,253 web citations in six key Iranian LIS journals published from 2006 to 2010. The proportion percentage of web citations increased from 11% in 2006 to 30% in 2010. The most widely cited top level domains in URLs include the .edu and .org with respectively 37% and 23%. This study provides further evidence that organizations websites have become increasingly vulnerable to URL decay. The results show that only 3467 web citations remain accessible in 2011, of which 71% allowed easy and long-term access to the authors' information intended in URLs. Long time inaccessibility to the authors' intended information was shown to be mostly from URLs that returned the 404 error and also the URLs that had gone through information update. An about 4 year half-life was estimated for Iran's LIS Publications. Ultimately, the results suggest that the decay of URLs is a grave problem in the publication of Iran's LIS researchers and cannot be overlooked. These authors need to gain the necessary knowledge about using web citations as major sources of information for their publications."
  • Moreau, Luc. "The Foundations for Provenance on the Web." Foundations and Trends in Web Science 2, No. 2/3 (2010): 99-241.
    "Using multiple data sources, we have compiled the largest bibliographical database on provenance so far. This large corpus allows us to analyse emerging trends in the research community. Specifically, using the CiteSpace tool, we identify clusters of papers that constitute research fronts, from which we derive characteristics that we use to structure our foundational framework for provenance on the Web. We note that such an endeavour requires a multi-disciplinary approach, since it requires contributions from many computer science sub-disciplines, but also other non-technical fields given the human challenge that is anticipated. To develop our vision, it is necessary to provide a definition of provenance that applies to the Web context. Our conceptual definition of provenance is expressed in terms of processes, and is shown to generalise various definitions of provenance commonly encountered." The "Open Provenance Model is an emerging community-driven representation of provenance, which has been actively used by some twenty teams to exchange provenance information according to the Open Provenance Vision. Having identified an open approach and a model for provenance, we then look at techniques that have been proposed to expose provenance over the Web. We also study how Semantic Web technologies have been successfully exploited to express, query and reason over provenance."
  • Saberi, M.K., and H. Abedi. "Accessibility and decay of web citations in five open access ISI journals", Internet Research 22, No. 2 (2012): 234-247.
    "After acquiring all the papers published by these journals during 2002-2007, their web citations were extracted and analyzed from an accessibility point of view. Moreover, for initially missed citations complementary pathways such as using Internet Explorer and the Google search engine were employed." "The study revealed that at first check 73 per cent of URLs are accessible, while 27 per cent have disappeared. It is notable that the rate of accessibility increased to 89 per cent and the rate of decay decreased to 11 per cent after using complementary pathways. The '.net' domain, with an availability of 96 per cent (a decay of 4 per cent) has the greatest stability and persistence among all domains, while the most stable file format is PDF, with an availability of 93 per cent (a decay of 7 per cent)."
  • U.S. National Archives and Records Administration. Guidance on Managing Records in Web 2.0/Social Media Platforms. October 20, 2010. 2011/2011-02.html


Last updated on 08/26/13, 9:52 pm by callee



about seo | group_wiki_page