Archiving Web Sites - Identify

Q. What do I need to identify in order to archive web sites?

An important part of managing your digital collections is identifying everything with which you are working.  This includes identifying what digital content you have, what you are already preserving, and what content you may be acquiring.  You will also need to identify the file formats you have and assess the risk associated with these formats.  

In cases when the goal is to document the lives of individuals, there are two distinct selection strategies for honing in on materials related to the individuals [Lee]:

  • Work from the individual outward (e.g., ask the person or find information on his/her computer that helps to identify points of entry to his/her online presence, such as logins, browsing histories, and favorite sites). [See Garfinkel and Cox]
  • Work from the wider web inward toward the individual (e.g., use web searches to locate information that leads to elements of his/her web presence).

One of the primary challenges of collecting information about or by given individuals from the Web is “web presence identification” [Bekkerman]—determining what pages on the Web are actually by or about a given individual.

For many institutions, it is important to identify web resources that are "at risk."

Another essential activity can be identify what constitute records to be retained from web sites.

Take action

  • Identify the web content you have, what you are already preserving, and what content you may be acquiring
  • Identify the digital file formats you will be collecting with the web sites and assess risks associated with these file formats
  • Use file format identification tools to identify file formats you already have in your collection
  • Record date information such as the date the files were received, file creation date, file update date


