Web of Science CHTC Tutorial

What follows is a guide to deploying a codebase developed by the UW-Madison Library Technology Group (LTG) that is designed for researchers interested in performing citation analysis on the Clarivate Web of Science (WOS) dataset. Because it contains millions of article records and billions of cited references within those articles, this dataset is large enough to pose challenges to standard models of analysis while also facilitating large scale analysis of citation patterns.

This workflow solves the scale issues by capitalizing on the powerful computing resources available at UW-Madison’s Center for High Throughput Computing (CHTC). The combination of the code base and the CHTC resources provides researchers with the precision and power to locate particular items from a massive dataset while maintaining complete metadata detail for every record.

Who would use this code?

The process outlined in this guide is meant as a general introduction for any researcher interested in performing citation analysis with computational tools. The code base is designed to extract a subset of article records from the WOS dataset and then trace each reference within each article. This allows researchers to find highly specific items that are related to one another from within the dataset’s massive network of citations.

The inputs

  • Custom search results through the WOS user interface
  • Search results serve as the criteria for matching records with those in the WOS dataset

The outputs

  • Selection of article records from out of the full dataset
  • Fullest form of metadata records for those article records
  • Full metadata records for the references that are cited by the original article records
  • References to cited articles that are unambiguous and linked by IDs

The value of the outputs

The outputs of the analysis have vast potential for revealing large scale citation patterns extending from the present back through the year 1900. Researchers can analyze citation chains extending back over a century while preserving all metadata to every record in each chain. The results can thus simultaneously accommodate broad network analysis as well as the contents of specific article records.