Text and Data Mining Platforms

Text and data mining is a process that analyzes large volumes of text or other data to extract relevant information or derive new insights. Depending on the discipline and the type of data, the process may be referred to as TDM, content mining, text analytics, distant reading, etc. TDM frequently uses machine learning techniques such as natural language processing.

The UW-Madison Libraries provides access to text and mining data platforms through licensing or subscription. Users can access or build data sets and access algorithms and tools for analysis. Please see the description of each platform below for more information.

Policies for Mining Licensed Content

If you are interested in mining library licensed resources, please reach out to us through our contact form before your project starts. Please note that unauthorized content mining of our licensed resources can shut down all access to those resources for the entire UW-Madison community in addition to shutting down access for that user’s IP address. Please see our Responsible Use of Licensed Electronic Resources page for more information. 

Best Practices & What to Consider Before Starting a Project

  • Check the terms of use and license – When accessing data from any source, it is always important to make sure you understand the source’s terms of use and license. The Libraries acquires materials and data sets from a variety of sources, or vendors. Each vendor is different – be sure to check what actions related to text and data mining are allowed.
    • Can the text or data be downloaded?
    • How much data can be accessed or downloaded? At a time? Total?
    • What techniques can be done to it?
    • What can be shared from the data set or your work?
  • Check our existing datasets, tools, and resources to see if any existing licensed tool can support your project.
  • Time and cost – some vendors may require individual project licenses for text and data mining and may charge for access to their datasets. This process of working with the vendor can be slow and take time so reach out as early as possible.
  • Open access alternatives may be available. Depending on the types of materials sought for analysis, there may be open datasets available that are not in our library databases. For assistance with finding alternatives, reach out to your subject librarian or Research Data Services.

Text and Data Mining Platforms Available at UW-Madison Libraries

The English Corpora

The English Corpora site contains links to numerous free and searchable online corpora. Included are: the Corpus of Contemporary American English (COCA); the Corpus of Historical American English (COHA); the British National Corpus; the TIME Corpus of American English; Google Book (American English) Corpus; Corpus del Español; and Corpus do Português. (Updates vary with each Corpus)

Gale Digital Scholar Lab

Gale Digital Scholar Lab is an online tool for collecting data sets comprised of content from the UW-Madison Libraries’ subscriptions to Gale Primary Sources databases. They can be analyzed using text analysis and data visualization tools built into the Digital Scholar Lab. Digital humanities analysis methods include: Named Entity Recognition, Topic Modelling, Parts of Speech, and more. Learn how to access the lab.

HathiTrust Research Center

HathiTrust is a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. The HathiTrust Research Center (HTRC) enables large-scale analysis of works in the HathiTrust Digital Library (HTDL) to facilitate non-profit research and educational uses of the collection. Learn how to access the HathiTrust Research Center.

TDM Studio

TDM Studio is an online text and data mining tool for research, teaching and learning. It allows users to collect datasets from content available through UW-Madison Libraries’ subscription to ProQuest. Content available includes current and historical newspapers, dissertations and theses, scholarly journals, and primary sources from collections in the fields of science, technology, medicine, public policy, history and literature. Learn how to access TDM Studio.