Skip to Content

Storage Media - Identifying and Redacting Sensitive Information

Q. How can I identify and redact sensitive information?

Born-digital materials can includes various forms of sensitive information (e.g. credit card numbers, account information).  Information professionals must take measures to ensure that information that is known to be sensitive is not disclosed to the wrong individuals.
 

Take action

Consider:

  • Personal data
  • Sensitive data
  • Confidential data
  • Informed consent
  • Anonymity
  • Copyright

Explore Tools

  • BitCurator Environment. http://wiki.bitcurator.net

    The BitCurator environment provides a variety of tools that can be used to identify, extract and redact sensitive information (including Bulk Extractor an iredact.py listed below.

  • Bulk Extractor. Simson Garinfinkel. http://www.forensicswiki.org/wiki/Bulk_extractor

    Bulk Extractor "scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. The results are stored in feature files that can be easily inspected, parsed, or processed with automated tools. bulk_extractor also created a histograms of features that it finds, as features that are more common tend to be more important. bulk_extractor is distinguished from other forensic tools by its speed and thoroughness. Because it ignores file system structure, bulk_extractor can process different parts of the disk in parallel."  It "automatically detects, decompresses, and recursively re-processes compressed data that is compressed with a variety of algorithms."  Because it ignores the filesystem on a drive, Bulk Extractor "can be used to process any digital media. We have used the program to process hard drives, SSDs, optical media, camera cards, cell phones, network packet dumps, and other kinds of digital information."

  • DBAN (Darik's Boot and Nuke). http://www.dban.org/>

    "DBAN is a self-contained boot disk that automatically deletes the contents of any hard disk that it can detect. This method can help prevent identity theft before recycling a computer. It is also a solution commonly used to remove viruses and spyware from Microsoft Windows installations. DBAN prevents all known techniques of hard disk forensic analysis. It does not provide users with a proof of erasure, such as an audit-ready erasure report."

  • Firefly (and Firefly4Mac). University of Illinois. http://www.cites.illinois.edu/ssnprogram/firefly/

    Scans drives for social security numbers (SSNs) and credit card numbers.

  • iredact.py. Simson Garfinkel. http://www.forensicswiki.org/wiki/Fiwalk

    A program written in Python that "allows the removal of specific files matching specific criteria."

Read

  • Beek, Christiaan. "Introduction to File Carving." McAfee. 2011. http://www.mcafee.com/us/resources/white-papers/foundstone/wp-intro-to-file-carving.pdf

    "'File carving,' or sometimes simply 'carving,' is the process of extracting a collection of data from a larger data set. Data carving techniques frequently occur during a digital investigation when the unallocated file system space is analyzed to extract files. The files are 'carved' from the unallocated space using file type-specific header and footer values. File system structures are not used during the process. File carving is a powerful technique for recovering files and fragments of files when directory entries are corrupt or missing. The block of data is searched block by block for residual data matching the file type-specific header and footer values."

  • Byers, Simon. "Information Leakage Caused by Hidden Data in Published Documents." IEEE Security and Privacy 2, no. 2 (2004): 23-27.

    "This article demonstrates mining for hidden text in published data and concludes that user behavior - in combination with default program settings - creates an uncomfortable state of affairs for Microsoft Word users concerned about information security. The article also presents some countermeasures."

  • Cook, Timothy. "A Regular Expression Search Primer for Forensic Analysts." SANS Institute, 2012. http://www.sans.org/reading_room/whitepapers/forensics/regular-expression-search-primer-forensic-analysts_33929

    "Often forensic texts and articles assume a level of experience and comfort with Linux command line string searching and text manipulation that a reader does not possess. This assumption tends to leave the reader to their own devices to puzzle out how to locate and extract specific string content from files. The focus of this paper is to introduce the reader to Linux string search and text manipulation commands and provide specific use cases and search patterns that will be of use to Forensic Analysts. The intent of this paper is to serve as an introduction to regular expressions and some Linux commands that can be used to locate and extract text for individuals who either do not have Linux command line experience or who use the Linux command line infrequently and can benefit from a refresher."

  • Farmer, Dan, and Wietse Venema. "Persistence of deleted file information." In Forensic Discovery. Upper Saddle River, NJ: Addison-Wesley, 2005. http://www.porcupine.org/forensics/forensic-discovery/chapter7.html

    "In this chapter we study how deleted file information can escape destruction intact for months or even years, and how deleted file attribute information can provide insight into past system activity. We examine several systems and discover how well past activity can be preserved in unallocated disk space. At the end of the chapter we explain why deleted file information can be more persistent than ordinary file information."

  • Garfinkel, Simson L. "Forensic feature extraction and cross-drive analysis." Digital Investigation 3S (2006): S71-81. http://simson.net/clips/academic/2006.DFRWS.pdf [Specifically: Sections 1-3, p.S71-75]

    "This paper introduces Forensic Feature Extraction (FFE) and Cross-Drive Analysis (CDA), two new approaches for analyzing large data sets of disk images and other forensic data. FFE uses a variety of lexigraphic techniques for extracting information from bulk data; CDA uses statistical techniques for correlating this information within a single disk image and across multiple disk images. An architecture for these techniques is presented that consists of five discrete steps: imaging, feature extraction, first-order cross-drive analysis, cross-drive correlation, and report generation. CDA was used to analyze 750 images of drives acquired on the secondary market; it automatically identified drives containing a high concentration of confidential financial records as well as clusters of drives that came from the same organization. FFE and CDA are promising techniques for prioritizing work and automatically identifying members of social networks under investigation. We believe it is likely to have other uses as well."

  • Garfinkel, Simson L., and James Migletz. "The New XML Office Document Files: Implications for Forensics." 2009. http://simson.net/clips/academic/2009.IEEE.DOCX.pdf

    "Two new office document file formats (Office Open XML and OpenDocument Format) make it easier to glean time stamps and unique document identifiers while also improving opportunities for file carving and data recovery."

  • Garfinkel, Simson L., and Abhi Shelat. "Remembrance of Data Passed: A Study of Disk Sanitization Practices." IEEE Security and Privacy 1 (2003): 17-27. http://cdn.computerscience1.net/2005/fall/lectures/8/articles8.pdf

    "Many discarded hard drives contain information that is both confidential and recoverable, as the authors’ own experiment shows. The availability of this information is little publicized, but awareness of it will surely spread."

  • Jones, Jeffrey R. "Document Metadata and Computer Forensics." James Madison University, Department of Computer Science, 2006. http://www.infosec.jmu.edu/reports/jmu-infosec-tr-2006-003.pdf

    "Metadata contained within documents serves a valid purpose in many circumstances, such as facilitating the collaboration among a group of people. However, many are not aware of the type of information stored with their documents, spreadsheets, and presentations. Due diligence is required by responsible users to ensure that sensitive information is not leaked to third-parties. Until then, forensic investigators could have access to a plethora of hidden document information. This paper examines how metadata is used in PDF documents and documents, spreadsheets, and presentations created in Microsoft Office and OpenOffice.org. Several instances are examined where metadata has led to the discovery of hidden information. This paper also shows how metadata is stored in documents, spreadsheets, and presentations created in the aforementioned applications. Finally, this paper will test and discuss the functionality of several tools available to users and investigators that test for the presence of metadata."

  • "The Top Ten Hidden Data Threats." ManTech International. http://docdet.mantech.com/docdet/Presskit/The%20Top%20Ten%20Hidden%20Data%20Threats.pdf

    Illustrates common cases of accidentally disclosing "hidden data" within files

  • Wright, Craig, Dave Kleiman, and Shyaam Sundhar. "Overwriting Hard Drive Data: The Great Wiping Controversy." In Information Systems Security: 4th International Conference, ICISS 2008, Hyderabad, India, December 16-20, 2008: Proceedings, edited by R. Sekar and A.K. Pujari, 243–57. Berlin: Springer, 2008. http://www.vidarholen.net/~vidar/overwriting_hard_drive_data.pdf

    "Often we hear controversial opinions in digital forensics on the required or desired number of passes to utilize for properly overwriting, sometimes referred to as wiping or erasing, a modern hard drive. The controversy has caused much misconception, with persons commonly quoting that data can be recovered if it has only been overwritten once or twice. Moreover, referencing that it actually takes up to ten, and even as many as 35 (referred to as the Gutmann scheme because of the 1996 Secure Deletion of Data from Magnetic and Solid-State Memory published paper by Peter Gutmann) passes to securely overwrite the previous data. One of the chief controversies is that if a head positioning system is not exact enough, new data written to a drive may not be written back to the precise location of the original data. We demonstrate that the controversy surrounding this topic is unfounded."

Groups:


about seo | group_wiki_page