Skip to Content

Archiving Web Sites - Get

Q. How can I collect web content that falls within my collecting mission?

The following table outlines six ways to collect web content. All six approaches involve multiple selection decisions: which sources to engage, how often to collect content, what collecting parameters to use, and how much effort to invest in fixing specific problem cases.  For a discussion of five of these approaches (does not address capture at the source), see [Lee].

Approach

Explanation

Advantages

Disadvantages

Ask the provider

Through direct contact, the collector can request and negotiate for a direct transfer of the data that reside on the server(s) of the provider

Can yield information not directly accessible through other means, and can get data directly from the source (e.g. whole database, high-res images) rather than what’s served through the Web

Requires cooperation of the provider

See if someone else has it[i]

If content has been cached by a search service, harvested by the Internet Archive or collected by a peer institution, obtain copy of content from them

Allows for post hoc recovery

Coverage and success of recovery are subject to the systems and priorities of systems that were designed for other purposes

Follow links

Start with seed URLs, then recursively follow them—possibly feeding new URLs back into seed list (used by search engine bots and many web crawlers)

Tools and techniques are very well-established and well-understood

Many dimensions of interest (e.g. provenance, topic, time period) are not reflected in the link patterns of web content

Pull results of queries

Collecting institution issues queries to known sources (e.g., collecting videos from YouTube by queries for specific named individuals)

Can benefit from the structure and standardized interfaces of the content providers

Strongly dependent on interface and ranking algorithms of the content provider’s system

Receive results of pushed queries

Subscription model of tapping into alert services or “feeds” that are pushed to the collecting institution

Particularly good for communication forms (e.g. blogs) that are “post-centric” rather than “page-centric”

Feeds of content will often lose formatting and contextual information that could be important to retain

Capture data and changes at the source      


[i] See Warrick—Recover Your Lost Website, http://warrick.cs.odu.edu/.

 

Stated more simply, there are two fundamental approaches to capturing web content for purposes of building digital collections: recursive link-following and query submission. The former has been the most common and involves the identification of a set of seed uniform resource locators (URLs) and then recursively following links within a specified set of constraints (e.g., number of hops, specific domains). When collecting content from specific, database-driven web spaces, query submission is often the most effective
approach.

Effective web collecting strategies will often involve a combination of link-following and text-based queries.  For example, several projects have demonstrated methods for further scoping a topic-based crawl, based on automated analysis of the content of pages or their place within a larger network of pages.[i] There have also been many successful efforts to automatically populate web entry forms in order to collect pages that cannot be reached directly through link-following.[ii] Four fundamental parameters for any web collecting initiative are: environments crawled (e.g., blogosphere, YouTube); access points from those environments used as crawling or selection criteria (e.g., number of views, primary relevance based on term matching, number of in-links, channel or account from which an item was submitted); threshold values for scoping capture within given access points (e.g., one hundred most relevant query results, at least five in-links);[iii] and frequency of crawls. It is very likely that the most appropriate approaches will vary the environments, access points, and thresholds in different ways, depending on the materials and collecting goals.  For example, different types of web materials change or disappear from the Web at very different rates, which implies the need for different crawl frequencies.[iv]

There are multiple methods for obtaining micro-content in the form of feeds; current examples are Really Simple Syndication (RSS), Atom, and Twitter. Such content feeds can be a huge boon for collecting archivists, but they can also miss much of the contextual information that is so important to archivists and (presumably) future users. For example, the RSS feed from a blog often “undoes the idiosyncratic feel of many weblogs by stripping them of visual elements such as layout or logos, as well as eliminating the context produced by blogrolls (blog authors’ links to other weblogs) or the author’s biographical information (and any advertising).” [Gilmore]



 

 

 

[i] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, "Focused Crawling: A New Approach to Topic-Specific Resource Discovery," in Proceedings of the Eighth International World Wide Web Conference: Toronto, Canada, May 11–14, 1999 (Amsterdam: Elsevier, 1999), 545–62; Donna Bergmark, "Collection Synthesis," in Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries: July 14–18, 2002, Portland, Oregon, ed. Gary Marchionini and William R. Hersh (New York: ACM Press, 2002), 253–6; Donna Bergmark, Carl Lagoze, and Alex Sbityakov; "Focused Crawls, Tunneling, and Digital Libraries," in Research and Advanced Technology for Digital Libraries: 6th European Conference, ECDL 2002, Rome, Italy, September 2002: Proceedings, ed. Maristella Agosti and Constantino Thanos (Berlin: Springer, 2002), 91–106; Gautam Pant and Padmini Srinivasan, "Learning to Crawl: Comparing Classification Schemes," ACM Transactions on Information Systems 23, no. 4 (2005): 430–62; and Gautam Pant, Kostas Tsioutsiouliklis, Judy Johnson, and C. Lee Giles, "Panorama: Extending Digital Libraries with Topical Crawlers," in JCDL 2004: Proceedings of the Fourth ACM/IEEE Joint Conference on Digital Libraries: Global Reach and Diverse Impact: Tucson, Arizona, June 7–11, 2004, ed. Hsinchun Chen, Michael Christel and Ee-Peng Lim (New York: ACM Press, 2004), 142–50.

[ii] Sriram Raghavan and Hector Garcia-Molina, "Crawling the Hidden Web," in Proceedings of 27th International Conference on Very Large Data Bases, September 11–14, 2001, Roma, Italy, ed. Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao and Richard T. Snodgrass (Orlando, FL: Morgan Kaufmann, 2001), 129–38;  Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho, "Downloading Textual Hidden Web Content through Keyword Queries," In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries: Denver, Co, USA, June 7–11, 2005: Digital Libraries, Cyberinfrastructure for Research and Education (New York: ACM Press, 2005), 100–9; and Xiang Peisu, Tian Ke, and Huang Qinzhen, "A Framework of Deep Web Crawler," in Proceedings of the 27th Chinese Control Conference, ed. Dai-Zhan Cheng and Min Wu (Beijing, China: Beijing hang kong hang tian da xue chu ban she, 2008), 582–86.

[iii] Capra et al., "Selection of Context Scoping.”

[iv] Bernard Reilly, Carolyn Palaima, Kent Norsworthy, Leslie Myrick, Gretchen Tuchel, and James Simon, "Political Communications Web Archiving: Addressing Typology and Timing for Selection, Preservation and Access" (paper presented at the Third ECDL Workshop on Web Archives, Trondheim, Norway, August 21, 2003); Wallace Koehler, "A Longitudinal Study of Web Pages Continued: A Consideration of Document Persistence," Information Research 9, no. 2 (2004).

Explore Tools and Services

Read

  • Bragg, Molly and Lori Donovan. "Archiving Social Networking Sites w/ Archive-It." https://webarchive.jira.com/wiki/pages/viewpage.action?pageId=3113092
  • Cooper, Brian F., and Hector Garcia-Molina. "InfoMonitor: Unobtrusively Archiving a World Wide Web Server." International Journal on Digital Libraries 5, no. 2 (2005): 106-19.
    "It is important to provide long-term preservation of digital data even when those data are stored in an unreliable system such as a filesystem, a legacy database, or even the World Wide Web. In this paper we focus on the problem of archiving the contents of aWeb site without disrupting users who maintain the site. We propose an archival storage system, the InfoMonitor, in which a reliable archive is integrated with an unmodified existing store. Implementing such a system presents various challenges related to the mismatch of features between the components such as differences in naming and data manipulation operations.  We examine each of these issues as well as solutions for the conflicts that arise.  We also discuss our experience using the InfoMonitor to archive the Stanford Database Group’sWeb site."
  • Fitch, Kent. "Web Site Archiving - an Approach to Recording Every Materially Different Response Produced by a Website." Paper presented at the Ninth Australian World Wide Web Conference, Hyatt Sanctuary Cove, Gold Coast, July 5-9, 2003. http://ausweb.scu.edu.au/aw03/papers/fitch/paper.html
    "This paper discusses an approach to capturing and archiving all materially distinct responses produced by a web site, regardless of their content type and how they are produced. This approach does not remove the need for traditional records management practices but rather augments them by archiving the end results of changes to content and content generation systems. It also discusses the applicability of this approach to the capturing of web sites by harvesters."
  • Gillmor, Dan. We the Media: Grassroots Journalism by the People, for the People. 1st ed. Sebastopol, CA: O'Reilly, 2004.
  • Marill, Jennifer, Andrew Boyko, and Michael Ashenfelder. "Web Harvesting Survey." International Internet Preservation Coalition, 2004. http://www.netpreserve.org/resources/web-harvesting-survey
    "The Metrics and Testbed Working Group of the IIPC conducted a survey which is an attempt to identify and classify many of the general conditions found on Web sites that influence the harvesting of content and the quality of an archival crawl. It is intended to provide a high-level overview of common Web crawling conditions, roughly prioritized by their significance, as background information for institutions beginning to engage in web harvesting. We also offer examples of the various issues, and characterize in which of the several phases of the harvesting process the described problems can occur."
  • Lee, Christopher A. "Collecting the Externalized Me: Appraisal of Materials in the Social Web." In I, Digital: Personal Collections in the Digital Era, edited by Christopher A. Lee, 202-238. Chicago, IL: Society of American Archivists, 2011.
    "With the adoption of highly interactive web technologies (frequently labeled “Web 2.0”), forms of individual documentation and expression also often are inherently social and public. Such online environments allow for personal documentation, but they also engage external audiences in ways not previously possible. This opens up new opportunities and challenges for collecting personal materials, particularly within the context of archival appraisal. This chapter explores various ways in which principles of archival appraisal can be operationalized in an environment in which collecting takes the form of submitting queries and following links."
  • Library of Congress. Quality and Functionality Factors For Archived Web Sites and Pages. http://www.digitalpreservation.gov/formats/content/webarch_quality.shtml
    "This discussion concerns Web sites as they may be collected and archived for research access and long-term preservation. What is at stake is harvesting sites as they present themselves to users at a particular time. The formats discussed here are those that might hold the results of a crawl of a Web site or set of Web sites, a dynamic action resulting from the use of a software package (e.g., Heritrix) that calls up Web pages and captures them in the form disseminated to users."
  • McCown, Frank, Catherine C. Marshall, and Michael L. Nelson. "Why Websites Are Lost (and How They're Sometimes Found)." Communications of the ACM 52, no. 11 (2009): 141-45.
    "We have surveyed 52 individuals who have "lost" their own personal website (through a hard drive crash, bankrupt ISP, etc.) or tried to recover a lost website that once belonged to someone else. Our survey investigates why websites are lost and how successful individuals have been at recovering them using a variety of methods, including the use of search engine caches and web archives. The findings suggest that personal and third party loss of digital data is likely to continue as methods for backing up data are overlooked or performed incorrectly, and individual behavior is unlikely to change because of the perception that losing digital data is very uncommon and the responsibility of others."
  • McCown, Frank, and Michael L. Nelson. "What Happens When Facebook Is Gone?" In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries: June 12-15, 2009, Austin, Texas, USA, edited by Fred Heath and Mary Lynn Rice-Lively, 251-54. New York, NY: ACM Press, 2009. http://www.cs.odu.edu/~mln/pubs/jcdl09/archiving-facebook-jcdl2009.pdf
    "Web users are spending more of their time and creative en- ergies within online social networking systems. While many of these networks allow users to export their personal data or expose themselves to third-party web archiving, some do not. Facebook, one of the most popular social networking websites, is one example of a \walled garden" where users' activities are trapped. We examine a variety of techniques for extracting users' activities from Facebook (and by ex- tension, other social networking systems) for the personal archive and for the third-party archiver. Our framework could be applied to any walled garden where personal user data is being locked."
  • Marchionini, Gary, Chirag Shah, Christopher A. Lee, and Robert Capra. "Query Parameters for Harvesting Digital Video and Associated Contextual Information." In Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, 77-86. New York, NY: ACM Press, 2009. http://www.ils.unc.edu/callee/p77-marchionini.pdf
    "Video is increasingly important to digital libraries and archives as both primary content and as context for the primary objects in collections. Services like YouTube not only offer large numbers of videos but also usage data such as comments and ratings that may help curators today make selections and aid future generations to interpret those selections. A query-based harvesting strategy is presented and results from daily harvests for six topics defined by 145 queries over a 20-month period are discussed with respect to, query specification parameters, topic, and contribution patterns. The limitations of the strategy and these data are considered and suggestions are offered for curators who wish to use query-based harvesting."
  • Marill, Jennifer, Andrew Boyko, and Michael Ashenfelder. "Web Harvesting Survey." International Internet Preservation Coalition, 2004. http://www.netpreserve.org/sites/default/files/resources/WebArchivingSurvey.pdf
    "This survey is an attempt to identify and classify many of the conditions found on Web sites that influence the harvesting of content and the quality of an archival crawl. This table is based on Ketil Albertsen’s report, 'A taxonomy for the "the deep web",' and on discussions of the Library of Congress Web harvesting team (LCWHT)."
  • Schrenk, Michael. Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL. San Francisco, CA: No Starch Press, 2007. [See especially: Downloading Web Pages - http://www.nostarch.com/download/webbots_ch3.pdf]
    "This chapter will show you how to write simple PHP scripts that download web pages. More importantly, you’ll learn PHP’s limitations and how to overcome them with PHP/CURL , a special binding of the cURL library that facilitates many advanced network features. cURL is used widely by many computer languages as a means to access network files with a number of protocols and options."

Last updated on 12/31/69, 7:00 pm by Anonymous

Storage Media - Identifying and Redacting Sensitive Information

Q. How can I identify and redact sensitive information?

Acquiring Data from Storage Media - Get

Q. How should I get data off digital storage media?

Take action

  • Find or obtain the right equipment to read data from the drive - including power, I/O hardware (e.g. USB), and possibly a fan (for hard drives)
  • Be sure to attach drives in a read-only configuration - mount as read-only and use a hardware write-blocker if you can
  • Generate disk images as well as copies of individual files

 

Explore Tools

  • dc3dd
    a patched version of dd with added features for computer forensics
  • dd (Unix)
    dd is a free, standard Unix/Linux utility that can be used to generate raw disk images.
  • Disc Ferret
    DiscFerret is a combination of hardware and software that allows a standard desktop computer to read, analyse and decode the data on almost any floppy disc, and most MFM and RLL hard disc drives.  http://discferret.com/wiki/DiscFerret
  • FC5025
    A USB 5.25" floppy controller that plugs into any computer's USB port and enables you to attach a 5.25" floppy drive.  http://www.deviceside.com/fc5025.html
  • FTK Imager - AccessData. http://www.accessdata.com/support/product-downloads
    FTKImager is a commercial (but free to download) tool for creation - as raw (dd), SMART, E01 or AFF - and basic navigation of disk images. One can also use FTK Imager to extract files and metadata from disk images. NOTE: This is not the same product as FTK, which has many other features and requires paying for a license to run.
  • Guymager (Linux). http://guymager.sourceforge.net/
    Free, open source disk imaging tool.
  • Kryoflux
    A USB-based floppy controller designed specifically for reliability, precision, and getting low-level reads suitable for software preservation. http://www.kryoflux.com/

Read

  • Brezinski, Dominique, and Tom Killalea. "Guidelines for Evidence Collection and Archiving." Request for Comments 3227. 2002. http://www.ietf.org/rfc/rfc3227.txt
    This document was designed to "provide System Administrators with guidelines on the collection and archiving of evidence relevant to...a security incident," but it provides a good summary of the main steps and considerations related to forensic acquisitions that can be useful to information professionals.
  • Carrier, Brian. File System Forensic Analysis. Boston, MA: Addison-Wesley, 2005. [Note: The appendix about The Sleuth Kit and Autopsy is now quite out of date. For more current information, see instead: http://www.sleuthkit.org/]
    "This is an advanced cookbook and reference guide for digital forensic practitioners. File System Forensic Analysis focuses on the file system and disk. The file system of a computer is where most files are stored and where most evidence is found; it also the most technically challenging part of forensic analysis. This book offers an overview and detailed knowledge of the file system and disc layout. The overview will allow an investigator to more easily find evidence, recover deleted data, and validate his tools. The cookbook section will show how to use the many open source tools for analysis, many of which Brian Carrier has developed himself."
  • Farmer, Dan, and Wietse Venema. Forensic Discovery. Upper Saddle River, NJ: Addison-Wesley, 2005. [Note: The appendix about the coroner's toolkit and related software is now quite out of date.  For more current information, see instead the Forensics Wki - http://www.forensicswiki.org/.]
    "The authors draw on their extensive firsthand experience to cover everything from file systems, to memory and kernel hacks, to malware. They expose a wide variety of computer forensics myths that often stand in the way of success. Readers will find extensive examples from Solaris, FreeBSD, Linux, and Microsoft Windows, as well as practical guidance for writing one's own forensic tools."
  • Jarocki, John. "Forensics 101: Acquiring an Image with FTK Imager." June 18, 2009. http://computer-forensics.sans.org/blog/2009/06/18/forensics-101-acquiring-an-image-with-ftk-imager/

 

  • Jones, Keith J., Richard Bejtlich, and Curtis W. Rose. Real Digital Forensics: Computer Security and Incident Response. Upper Saddle River, NJ: Addison-Wesley, 2006. [See especially: "Acquiring a Forensic Duplication" (161-204), "Common Forensic Analysis Techniques" (207-246), "Forensic Duplication and Analysis of Personal Digital Assistants" (515-570), "Forensic Duplication of USB and Compact Flash Memory Devices" (571-576), "Forensic Analysis of USB and Compact Flash Memory Devices" (577-594).]
    "In this book, a team of world-class computer forensics experts walks you through six detailed, highly realistic investigations and provides a DVD with all the data you need to follow along and practice. From binary memory dumps to log files, this DVD's intrusion data was generated by attacking live systems using the same tools and methods real-world attackers use. The evidence was then captured and analyzed using the same tools the authors employ in their own investigations. This book relies heavily on open source tools, so you can perform virtually every task without investing in any commercial software. You'll investigate environments ranging from financial institutions to software companies and crimes ranging from intellectual property theft to SEC violations. As you move step by step through each investigation, you'll discover practical techniques for overcoming the challenges forensics professionals face most often."
  • Thomas, Susan, Renhart Gittens, Janette Martin, and Fran Baker. "Capturing directory structures." In Workbook on Digital Private Papers. 2007. http://www.paradigm.ac.uk/workbook/record-creators/capturing-directory-structures.html
    "Capturing the directory structure of an archive creates a record of the [original] order of digital materials accessioned by the repository. This can be achieved using screenshots, but generating a textual file allows the archivist to record all the information in one file that can be searched."

 

 

Last updated on 12/31/69, 7:00 pm by Anonymous

Storage Media - Managing Acquired Data

Q: What should I do with data that I've acquired from storage media (process, metadata, tools and workflow)?

Watch

  • Chan, Peter.  "Processing Born Digital Materials Using AccessData FTK at Special Collections, Stanford University Libraries."  YouTube, 14:46, posted by peterchanws, March 11, 2011. http://www.youtube.com/watch?v=hDAhbR8dyp8 
    This video covers: how to create a case in FTK, technical metadata, obsolete file formats, viewing image file thumbnails, restricted files, filters, series, bookmarks, and labels.
  • Shaw, Seth.  "Managing Storage Media: Authenticity."  YouTube, 2:01, July 2, 2012.  Posted by CDCGUNC.  February 15, 2013.  http://www.youtube.com/watch?v=Z7wzmQS5rlM 
    Seth Shaw is the Electronic Records Archivist at Duke University. He described the importance of preserving the authenticity of records acquired on storage media.
  • Shaw, Seth.  "Managing Storage Media: Resources."  YouTube, 1:00, July 2, 2012.  Posted by CDCGUNC.  February 15, 2013.    http://www.youtube.com/watch?v=cAQntgMcVhY  
    Seth Shaw is the Electronic Records Archivist at Duke University.  He shared some resources for dealing with acquired storage media.

Read

  • AIMS Working Group. "AIMS Born-Digital Collections: An Inter-Institutional Model for Stewardship." 2012. http://www2.lib.virginia.edu/aims/whitepaper/
    "The AIMS project evolved around a common need among the project partners — and most libraries and archives — to identify a methodology or continuous framework for stewarding born-digital archival materials." "The AIMS Framework was developed to define good practice in terms of archival tasks and objectives necessary for success. The Framework, as defined in the White Paper found below, presents a practical approach but also a recognition that there is no single solution for many of the issues that institutions face when dealing with born-digital collections. Instead, the AIMS project partners developed this framework as a further step towards best practice for the profession."
  • bwFla (Baden-Wuerttemberg Functional Longterm Archiving and Access) Project. http://bw-fla.uni-freiburg.de/wordpress/?page_id=7
    "The bwFla project (Baden-Wuerttemberg Functional Longterm Archiving and Access) is a two-year state sponsored project with the goal of defining and providing a practical implementation of archival workflows for the rendering of digital objects (i.e. easily accessed by users) in its original environment (i.e. application). Thereby, the project focuses on supporting the user during object INGEST to identify, provide and describe all secondary objects required as well as create necessary technical meta data for long-term ACCESS through emulation. The emulation proposed uses an INGEST workflow, which requires no further migration of other objects. The further aim of these newly developed workflows is to have them integrated into existing library and archival systems.
  • Elford, Douglas, Nicholas Del Pozo, Snezana Mihajlovic, David Pearson, Gerard Clifton, and Colin Webb. "Media Matters: Developing Processes for Preserving Digital Objects on Physical Carriers at the National Library of Australia." Paper presented at the 74th IFLA General Conference and Council, Québec, Canada, August 10-14, 2008. http://www.ifla.org/IV/ifla74/papers/084-Webb-en.pdf
    "The National Library of Australia has a relatively small but important collection of digital materials on physical carriers, including both published materials and unpublished manuscripts in digital form. To date, preservation of the Library’s physical format digital collections has been largely hand-crafted, but this approach is insufficient to deal effectively with the volume of material requiring preservation. The Digital Preservation Workflow Project aims to produce a semi-automated, scalable process for transferring data from physical carriers to preservation digital mass storage, helping to mitigate the major risks associated with the physical carriers: deterioration of the media and obsolescence of the technology required to access them. The workflow system, expected to be available to Library staff from June 2008, also aims to minimise the time required for acquisition staff to process relatively standard physical media, while remaining flexible to accommodate special cases when required. The system incorporates a range of primarily open source tools, to undertake processes including media imaging, file identification and metadata extraction. The tools are deployed as services within a service-oriented architecture, with workflow processes that use these services being coordinated within a customised system architecture utilising Java based web services. This approach provides flexibility to add or substitute tools and services as they become available and to simplify interactions with other Library systems."
  • Garfinkel, Simson L. "AFF: A New Format for Storing Hard Drive Images." Communications of the ACM 49, no. 2 (2006): 85-87. http://simson.net/clips/academic/2006.CACM.AFF.pdf
    "...we designed a new file format for our forensic work. Called the Advanced Forensics Format (AFF), this format is both open and extensible. Like the EnCase format, AFF stores the imaged disk as a series of pages or segments, allowing the image to be compressed for significant savings. Unlike EnCase, AFF allows metadata to be stored either inside the image file or in a separate, companion file. Although AFF was specifically designed for use in projects involving hundreds or thousands of disk images, it works equally well for practitioners who work with just one or two images. And in the event the disk image is corrupted, AFF internal consistency checks are designed to allow the recovery of as much image data as possible. The AFF format is unencumbered by any patents or trade secrets, and the open source implementation is distributed under a license that allows the code to be freely integrated into either open source or propriety programs."
  • Garfinkel, Simson. “Digital Forensics XML and the DFXML Toolset.” Digital Investigation 8 (2012): 161-174.
    "Digital Forensics XML (DFXML) is an XML language that enables the exchange of structured forensic information. DFXML can represent the provenance of data subject to forensic investigation, document the presence and location of file systems, files, Microsoft Windows Registry entries, JPEG EXIFs, and other technical information of interest to the forensic analyst. DFXML can also document the specific tools and processing techniques that were used to produce the results, making it possible to automatically reprocess forensic information as tools are improved. This article presents the motivation, design, and use of DFXML. It also discusses tools that have been creased that both ingest and emit DFXML files."
  • Garfinkel, Simson L. "Forensic feature extraction and cross-drive analysis." Digital Investigation 3S (2006): S71-81. http://simson.net/clips/academic/2006.DFRWS.pdf [Specifically: Sections 1-3, p.S71-75]
    "This paper introduces Forensic Feature Extraction (FFE) and Cross-Drive Analysis (CDA), two new approaches for analyzing large data sets of disk images and other forensic data. FFE uses a variety of lexigraphic techniques for extracting information from bulk data; CDA uses statistical techniques for correlating this information within a single disk image and across multiple disk images. An architecture for these techniques is presented that consists of five discrete steps: imaging, feature extraction, first-order cross-drive analysis, cross-drive correlation, and report generation. CDA was used to analyze 750 images of drives acquired on the secondary market; it automatically identified drives containing a high concentration of confidential financial records as well as clusters of drives that came from the same organization. FFE and CDA are promising techniques for prioritizing work and automatically identifying members of social networks under investigation. We believe it is likely to have other uses as well."
  • Gengenbach, Martin J. “‘The Way We Do it Here”’ Mapping Digital Forensics Workflows in Collecting Institutions.” A Master’s Paper for the M.S. in L.S degree. August, 2012. http://digitalcurationexchange.org/system/files/gengenbach-forensic-workflows-2012.pdf
    "This paper presents the findings of semi-structured interviews with archivists and curators applying digital forensics tools and practices to the management of born-digital content. The interviews were designed to explore which digital forensic tools are in use, how they are implemented within a digital forensics workflow, and what further challenges and opportunities such use may present. Findings indicate that among interview participants these tools are beneficial in the capture and preservation of born-digital content, particularly with digital media such as external hard drives, and optical or floppy disks. However, interviews reveal that metadata generated from the use of such tools is not easily translated into the arrangement, description, and provision of access to born-digital content."
  • Kirschenbaum, Matthew G., Erika Farr, Kari M. Kraus, Naomi L. Nelson, Catherine Stollar Peters, Gabriela Redwine, and Doug Reside."Approaches to Managing and Collecting Born-Digital Literary Materials for Scholarly Use." College Park, MD: University of Maryland, 2009. http://mith.umd.edu/wp-content/uploads/whitepaper_HD-50346.Kirschenbaum.WP.pdf
    This white paper reports on "a series of site visits and planning meetings for personnel working with the born-digital components of three significant collections of literary material: the Salman Rushdie papers at Emory University’s Manuscripts, Archives, and Rare Books Library (MARBL), the Michael Joyce Papers (and other collections) at the Harry Ransom Humanities Research Center at The University of Texas at Austin, and the Deena Larsen Collection at the Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland."
  • Lee, Christopher A., Matthew Kirschenbaum, Alexandra Chassanoff, Porter Olsen, and Kam Woods. "BitCurator: Tools and
    Techniques for Digital Forensics in Collecting Institutions." D-Lib Magazine 18, No. 5/6 (May/June 2012).
    http://www.dlib.org/dlib/may12/lee/05lee.html
    This paper introduces the BitCurator Project, which aims to incorporate digital forensics tools and methods into collecting institutions' workflows. BitCurator is a collaborative effort led by the School of Information and Library Science (SILS) at the University of North Carolina at Chapel Hill and Maryland Institute for Technology in the Humanities (MITH) at the University of Maryland. The project arose from a perceived need in the library/archives community to develop digital forensics tools with interfaces, documentation, and functionality that can support the workflows of collecting institutions. This paper describes current efforts, ongoing work, and implications for future development of forensic-based, analytic software for born-digital materials.
  • Underwood, William, Marlit Hayslett, Sheila Isbell, Sandra Laib, Scott Sherrill, and Matthew Underwood. “Advanced Decision Support for Archival Processing of Presidential Electronic Records: Final Scientific and Technical Report.” Technical Report ITTL/CSITD 09-05. October 2009. http://perpos.gtri.gatech.edu/publications/TR%2009-05-Final%20Report.pdf
    "The overall objective of this project is to develop and apply advanced information technology to decision problems that archivists at the Presidential Libraries encounter when processing electronic records. Among issues and problems to be addressed are areas responsive to national security, including automated content analysis, automatic summarization, advanced information retrieval, advanced support of decision making for access restrictions and declassification, information security, and Global Information Grid technology, which are also important research areas for the U.S. Army." "A method for automatic document type recognition and metadata extraction has been implemented and successfully tested. The method is based on the method for automatically annotating semantic categories such as person’s names, dates, and postal addresses. It extends this method by: (1) identifying about 100 types of intellectual elements of documents, (2) parsing these elements using context-free grammars defining the documentary form of document types, (3) interpreting the pragmatics of the form of the document to identify some or all of the following metadata: the chronological date, author(s), addressee(s), and topic. This metadata can be used for indexing and searching collections of records by person, organization and location names, topics, dates, author’s and addressee’s names and document types. It can also be used for automatically describing items, file units and record series."
  • Woods, Kam and Christopher A. Lee. “Acquisition and Processing of Disk Images to Further Archival Goals." In Proceedings of Archiving 2012 (Springfield, VA: Society for Imaging Science and Technology, 2012), 147-152. http://www.ils.unc.edu/callee/p147-woods.pdf
    "Disk imaging can provide significant data processing and information extraction benefits in archival ingest and preservation workflows, including more efficient automation, increased accuracy in data triage, assurance of data integrity, identifying personally identifying and sensitive information, and establishing environmental and technical context. Information located within disk images can also assist in linking digital objects to other data sources and activities such as versioning information, backups, related local and network user activity, and system logs. We examine each of these benefits and discuss the incorporation of modern digital forensics technologies into archival workflows."
  • Woods, Kam, Christopher Lee, and Sunitha Misra. “Automated Analysis and Visualization of Disk Images and File Systems for Preservation.” In Proceedings of Archiving 2013 (Springfield, VA: Society for Imaging Science and Technology, 2013), 239-244.
    "We present work on the analysis and visualization of disk images and associated filesystems using open source digital forensics software as part of the BitCurator project. We describe methods and software designed to assist in the acquisition of forensically-packaged disk images, analysis of the filesystems they contain, and associated triage tasks in archival workflows. We use open source forensics tools including fiwalk, bulk extractor, and The Sleuth Kit to produce technical metadata. These metadata are then reprocessed to aid in triage tasks, serve as documentation of digital collections, and to support a range of preservation decisions."

Last updated on 12/31/69, 7:00 pm by Anonymous

Curating and Managing Research Data for Re-Use

Focus: 
Date: 
Monday, July 29, 2013 - 09:00 - Friday, August 2, 2013 - 17:00

This ICPSR course will be held in Ann Arbor, Michigan. For more information see http://www.icpsr.umich.edu/icpsrweb/sumprog/courses/0149

Digital Production Manager, University of Florida

Focus: 

Digital Projects Librarian, Amherst College

Focus: 

Amherst College seeks a Digital Projects Librarian to make the College's unique materials available to a universal audience, thus playing a central

Digital Archives Assistant, New York Public Library

Focus: 

Reporting to the digital archivist, the digital archives assistant is responsible for supporting workflows for the management of born digital archival records.  The digital

DigiCenter Manager, Furman University

Focus: 

Furman University seeks an energetic and innovative individual to manage the day-to-day operations of the Library’s DigiCenter which is responsible for the development and maintenance of the

Director of Research Data Curation Services, University of California at San Diego

Focus: 

Reporting to the Associate University Librarian for Collection Services, the Director, Research Data Curation Services, provides vision, leadership, and management, and strategic planning for all p

Research Data Librarian, American University

Focus: 

American University Library invites applications for a multi-year, non-tenure track contract, Research Data Librarian position at the rank of Assistant or Associate Term Librarian, beginning summer

Digitization and Preservation Librarian, McGill University

Focus: 

Digitization & Preservation Librarian eScholarship, ePublishing & Digitization
Appointment: Tenure track Assistant or Associate Librarian depending upon experience

Institutional Repository Coordinator, Vanderbilt University

Focus: 

The Institutional Repository Coordinator assists with developing, maintaining, and promoting Vanderbilt University’s institutional repository: Discover Archive.  As a member of

Emerging Technologies Librarian, Memphis University

Focus: 

This position is responsible for monitoring the latest technological developments and considering the relevance and usefulness for delivering the services and resources of the University Libraries.

Metadata Librarian, University of Cincinnati

Focus: 

The University of Cincinnati Libraries invites applications and nominations for the Tenure-track, 12-month Faculty Appointment position of Metadata Libr

Data Management Librarian, Colorado State University

Focus: 

Research Computing Advisor

Focus: 
Focus: 

IT Services, University of St Andrews,
Salary: £30,424 - £36,298 per annum,
Start: As soon as possible
 

Applications Developer (Research Computing)

Focus: 
Focus: 

IT Services, University of St Andrews
Salary: £30,424 - £36,298 per annum
Start Date: As soon as possible
 

Secondment opportunity, Digital Preservation Coalition

Focus: 

The Digital Preservation Coalition (DPC) seeks to recruit an experienced and capable member of staff to work closely with the Executive Director of the

Head of Digital Initiatives, Ohio State University Libraries

Focus: 

The Ohio State University Libraries invites applications and nominations for the position of Head, Digital Initiatives to lead the evolution of digital library services and collections that support

Digital Asset Metadata and Taxonomy Specialist, T3 Media

Focus: 

The Digital Asset Metadata and Taxonomy Specialist is a mid-level position for a detail oriented and hard-working professional with the energy and passion for working with basic and complex metadat

National Digital Stewardship Residency

Focus: 

The mission of the National Digital Stewardship Residency (NDSR) is to build a dedicated community of professionals who will advance our nation's capabilities in managing, preserving, and makin

Administrative Assistant, Digital Library Federation

Focus: 

Project Archivist, Johns Hopkins University

Focus: 

The Johns Hopkins University Sheridan Libraries is hiring a Project Archivist to process the print and born-digital records of the Roland Park Company and Martin L.

Digital Projects Librarian, New Mexico State University

Focus: 

The Digital Projects Librarian manages digital projects as assigned, providing leadership and technical expertise to ensure their successful completion.

Digital Projects Librarian, Boston Public Library

Focus: 

Under supervision and within the framework of goals, policies

Digital Archivist, University of Texas at San Antonio

Focus: 

The University of Texas at San Antonio Libraries is seeking forward-thinking, dynamic applicants for the position of Digital Archivist.

Electronic Resource Metadata Management Librarian, New York University

Focus: 

In support of the increased reliance on access to scholarly electronic resources in NYUs creation of a Global Network University, the Electronic Resource Metadata Management Librarian will bring le

Web Archiving Project Librarian, Columbia University

Focus: 

We are seeking an experienced information professional to take the lead in a 2 year grant-funded project to expand and improve its program for collecting and archiving web content.  Over the p

Assistant Professor in Digital Media, Leiden University

Focus: 

LUC offers a full-time Assistant Professorship, starting on 1 August 2013, in the dynamic and growing academic setting of Leiden University and The Hague.

Syndicate content


about seo