About the Dataset

Clarivate, the publisher of the Web of Science, supplies a copy of the dataset in an XML format. The UW-Madison Libraries have created a derivative version that has been reserialized in the JSON format to increase processing efficiency. The JSON format is more compact in terms of the amount of bytes stored on disk. Additionally, this JSON copy of the data set stores one article record per line, which makes it substantially easier to stream through the data using a low memory footprint and Unix-style data processing techniques. This allows the CHTC cluster servers to quickly process the data and extract necessary information.

Every article record includes an ID number (also referred to as accession number in the Web of Science UI) as well as author and publication information, references cited by the article, subheadings, subjects, keywords, abstract text, etc.

There are a maximum of 100,000 records per file in this data set, organized by publication year. The first group of files contains articles published form 1900-1944, and then every subsequent year is grouped individually until 2022. The JSON files for the later years are larger in size, averaging around 10-15 GB when not compressed.

If you want to view the dataset in the original XML format, the files are stored on the CHTC data staging server along with the JSON files. The XML files are compressed and stored in files with a .zip file extension.