Skip to Content

What file formats will I need to be familiar with?

Q. What file formats will I need to be familiar with?

Digital data can come hundreds of different file formats. Being familiar with as many file formats as you can will help you best manage data sets that are in your care. 

 

Take action

  • Determine what formats you will be working with
  • Become familiar with digital data file formats -- confusing formatting here (presumably what follows are the various file formats, but that's not how it seems visually) (CB)

Quantitative tabular data with extensive metadata: a dataset with variable labels, code labels, and defined missing values, in addition to the matrix of data

  • SPSS portable format (.por)
  • delimited text and command (‘setup’) file (SPSS, Stata, SAS, etc.) containing metadata information
  • some structured text or mark-up file containing metadata information, e.g. DDI XML file

Quantitative tabular data with minimal metadata: a matrix of data with or without column headings or variable names, but no other metadata or labelling

  • comma-separated values (CSV) file (.csv)
  • tab-delimited file (.tab)
  • including delimited text of given character set with SQL data definition statements where appropriate

Geospatial data: vector and raster data

  • ESRI Shapefile (essential: .shp, .shx, .dbf ; optional: .prj, .sbx, .sbn)
  • geo-referenced TIFF (.tif, .tfw)
  • CAD data (.dwg)
  • tabular GIS attribute data

Qualitative data: textual

  • eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml)
  • Rich Text Format (.rtf)
  • plain text data, ASCII (.txt)

Digital image data

  • TIFF version 6 uncompressed (.tif)

Digital audio data

  • Free Lossless Audio Codec (FLAC) (.flac)

Digital video data

  • MPEG-4 (.mp4) motion
  • JPEG 2000 (.jp2)

Documentation

  • Rich Text Format (.rtf)
  • PDF/A or PDF (.pdf)
  • OpenDocument Text (.odt)

 

Review use cases

  • Arc/Info Binary Coverage Format Analysis.  Last updated June 14, 2006.  http://avce00.maptools.org/docs/v7_bin_cover.html 
    This is an attempt to document the binary vector coverage files used by Arc/Info V7.x for Unix and Windows NT.
  • Arc/Info Export (E00) Format Analysis.  Last updated February 24, 2000.  http://avce00.maptools.org/docs/v7_e00_cover.html 
    This is an updated version of the (world famous) "ANALYSIS OF ARC EXPORT FILE FORMAT FOR ARC/INFO (REV 6.1.1)."
  • JHOVE - JSTOR/Harvard Object Validation Environment  http://hul.harvard.edu/jhove/ 
    JHOVE provides functions to perform format-specific identification, validation, and characterization of digital objects.
  • PRONOM: The Technical Registry http://www.nationalarchives.gov.uk/PRONOM/Default.aspx

    PRONOM is a resource for anyone requiring impartial and definitive information about the file formats, software products and other technical components required to support long-term access to electronic records and other digital objects of cultural, historical or business value.

 

Read

  • Abrams, Stephen.  "File Formats."  Digital Curation Centre, October 2007.  http://www.dcc.ac.uk/sites/default/files/documents/resource/curation-manual/chapters/file-formats/file-formats.pdf   
    The DCC Digital Curation Manual instalments provide detailed and practical information aimed at digital curation practitioners. They are designed to assist data creators, curators and re-users to better understand and address the challenges they face and to fulfil the roles they play in creating, managing, and preserving digital information over time.  Each instalment will place the topic on which it is focused in the context of digital curation by providing an introduction to the subject, case studies, and guidelines for best practice.
  • Van den Eynden, Veerle, Louise Corti, et al.  "Managing and sharing data."  Colchester, UK Data Archive, 2011.  http://www.data-archive.ac.uk/media/2894/managingsharing.pdf

    Section on Formatting Your Data (pages 11-13) lists recommended file formats.

  • Florida Center for Library Automation.  "Recommended Data Formats for Preservation Purposes.  http://fclaweb.fcla.edu/uploads/recFormats.pdf 
    This table is intended to help Florida university administrators develop guidelines for preparing and submitting files to the Florida Digital Archive.
  • Rog, Judith and Carolina van Wijk.  "Evaluating File Formats for Long-term Preservation.”  National Library of the Netherlands, 2007.  http://www.kb.nl/sites/default/files/docs/KB_file_format_evaluation_method_27022008.pdf

    Describes the quantifiable file format risk assessment method, which can be used to define digital preservation strategies for specific file formats, and intends to inspire other cultural heritage institutions to define their own quantifiable file format evaluation method.

  • Arms, Carolyn and Carl Fleischhauer.  "Digital Formats: Factors for Sustainability, Functionality and Quality."  Proceedings Society for Imaging Science and Technology, 2005.  http://memory.loc.gov/ammem/techdocs/digform/Formats_IST05_paper.pdf 
    The Library of Congress is drafting a decision-support framework pertaining to the preservation of digital content.  The framework is presented through a Web site that identifies and documents digital content formats that are promising (or unpromising) for long-term sustainability, together with some explanatory essays.
  • DataONE.  "Document and store data using stable file formats."  http://www.dataone.org/best-practices/document-and-store-data-using-stable-file-formats

    Outlines best practice for file formats.

  • Lord, Philip and Alison Macdonald.  "e-Science Curation Report: Data curation for e-Science in the UK: an audit to establish requirements for future curation and provision."  Joint Information Systems Committee, 2003.  http://www.jisc.ac.uk/uploaded_documents/e-ScienceReportFinal.pdf 

    This study examined the current provision and future needs of curation of primary research data in the UK, particularly within the e-Science context.  It summarises the strategic and policy analyses and outlines proposals for the organisational structuring of curation provision and provides a table showing which recommendations address the findings.  Pages 31-34 include section 4.10 Heterogeneity and categories of data.

 

Last updated on 09/27/13, 2:52 pm by tlchristian

 

Groups:


about seo | group_wiki_page