Recommended File Formats

Introduction

The University of Wisconsin Digital Collections Center (UWDCC) is committed to ensuring the long-term accessibility and sustainability of its collections. Towards that end, we have adopted a number of recommended formats for the purposes of submitting, processing, distributing, and preserving digital resources. We encourage submitters to provide resources in these formats, and we will internally process, distribute, and preserve our resources in these formats.

These recommended formats are derived from the Library of Congress Recommended Formats Statement 2019-2020.

Text and documents

Text submissions. Page layout formats:

  1. XML in TEI P5 format
  2. PDF/A (which version? LoC says ISO 19005)
  3. PDF (highest quality available, with features such as searchable text, embedded fonts, lossless compression, high resolution images, device-independent specifications of colorspace, content tagging; includes document formats such as PDF/X).

If submitting plain text files: UTF-8 encoding

Still Images

UWDC follows recommendations laid out in the Federal Agencies Digital Guidelines Initiative for content digitized in-house. For image submissions from project partners, we request that files be no smaller than 1500 pixels on the longest side; 300 dpi for non-textual content; we prefer 600 dpi for textual (bitonal) content. We prefer image submission in color. We accept 16 bit and 8 bit depth, in a preferred color space of Adobe RGB. Currently we downsample to 8 bit, but we may change that for preservation masters in 16-bit. We accept 8 bit depth grayscale images.

These are the file formats we prefer, in descending order of preference:

  1. TIFF (lossless)
  2. JPG2000 (lossless)
  3. PNG (lossless)
  4. JPG

Audio

We prefer the final production version of audio resources over pre-production versions; files in native sampling frequency rather than up-sampled frequency; uncompressed files rather than compressed; with embedded metadata rather than without (metadata TBD). If a file is compressed, we will only accept standard compression schemes.

We prefer the following formats in order of preference:

  1. Highest native resolution PCM WAVE file of final version produced (44.1 kHz / 16 bit or higher) (lossless)
  2. FLAC (lossless) (44.1 kHz/16 bit or higher)
  3. MP3 (minimum of 128Kbps for recorded speech; minimum of 192Kbps for musical recordings)

Video

We prefer final production version of video resources over pre-production versions, with the original production resolution and frame rate (i.e. 1080p24; 720p60, etc.).

File-based, in order of preference:

  1. 10-bit MKV / FFVI files
    • Video Stream: 4:2:2 YUV, 29.97fps, 720 x 486 or higher (for high-def video, for example), 4:3 aspect ratio
    • Audio Stream: 48Khz/24-bit PCM
    • Packaging data XML files (asset map, packing list, volume index)
  2. mp4 / H.264
    • Video Stream: 4:2:2 YUV, 29.97fps, 720 x 486, 4:3 aspect ratio
    • Audio Stream: 48Khz / AAC(Low Complexity) compressed at 256kpbs

Websites

The following is substantially copied from the Library of Congress Recommended Formats for websites.

Website creators can improve the archivability of web content by following best practices such as:

Resources that address this further and may be helpful to content creators can be found on the Library of Congress Guide to Creating Preservable Websites (https://www.loc.gov/programs/web-archiving/for-site-owners/creating-preservable-websites/)

Websites are generally crawled and content is downloaded for packaging and preservation using a number of tools designed for this purpose.

  • Websites should not contain measures (such as content behind logins or only accessible through search functions) that control access to or prevent capture of the digital work.
  • Robots.txt restrictions should be set so as not to block crawlers from capturing important content, such as image and style sheets, which allow for replay of the site as it looked at the time of capture.
  • Tools currently available cannot capture all web content, so certain types of web content may not be preservable through web capture at this time. These include:
    • Multi-media rich content
    • Streaming media
    • Deep web content
    • Databases
    • Dynamically generated content (server-side or client-side javascript)

We use the Web ARChive (WARC) format to package and preserve websites.

  • Metadata
    1. Refer to the WARC ISO-standard specification for mandatory and recommended metadata fields
    2. When displaying archived content, the following should be clearly indicated:
      1. archiving institution,
      2. dates and time of capture,
      3. statements about functionality within the archive to distinguish from the live site