Website Search
Find information on spaces, staff, and services.
Find information on spaces, staff, and services.
Text and data mining is a process that analyzes large volumes of text or other data to extract relevant information or derive new insights. Depending on the discipline and the type of data, the process may be referred to as TDM, content mining, text analytics, distant reading, etc. TDM frequently uses machine learning techniques such as natural language processing.
The UW-Madison Libraries provides access to text and mining data platforms through licensing or subscription. Users can access or build data sets and access algorithms and tools for analysis. Please see the description of each platform below for more information.
If you are interested in mining library licensed resources, please reach out to us through our contact form before your project starts. Please note that unauthorized content mining of our licensed resources can shut down all access to those resources for the entire UW-Madison community in addition to shutting down access for that user’s IP address. Please see our Responsible Use of Licensed Electronic Resources page for more information.
The English Corpora site contains links to numerous free and searchable online corpora. Included are: the Corpus of Contemporary American English (COCA); the Corpus of Historical American English (COHA); the British National Corpus; the TIME Corpus of American English; Google Book (American English) Corpus; Corpus del Español; and Corpus do Português. (Updates vary with each Corpus)
Gale Digital Scholar Lab is an online tool for collecting data sets comprised of content from the UW-Madison Libraries’ subscriptions to Gale Primary Sources databases. They can be analyzed using text analysis and data visualization tools built into the Digital Scholar Lab. Digital humanities analysis methods include: Named Entity Recognition, Topic Modelling, Parts of Speech, and more. Learn how to access the lab.
HathiTrust is a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. The HathiTrust Research Center (HTRC) enables large-scale analysis of works in the HathiTrust Digital Library (HTDL) to facilitate non-profit research and educational uses of the collection. Learn how to access the HathiTrust Research Center.
TDM Studio is an online text and data mining tool for research, teaching and learning. It allows users to collect datasets from content available through UW-Madison Libraries’ subscription to ProQuest. Content available includes current and historical newspapers, dissertations and theses, scholarly journals, and primary sources from collections in the fields of science, technology, medicine, public policy, history and literature. Learn how to access TDM Studio.