Datasets: Inside Mac App Store Applications Metadata

Overview and Data Collection

This data aims to cover our internal company research needs and start collecting and sharing the macOS app dataset since we have yet to find a suitable existing one.

Full application metadata for over 87,000 samples was sourced by the public iTunes search API for the US, Germany, and Ukraine between December 2023 and January 2024.

For the convenience of further analysis, we present the data as three separate datasets.

Metadata Dataset

The data is provided in its original form without any additional cleaning or transformations. It is organized into 43 columns ranging from such essential details as app names, descriptions, and genres, along with release and version information, user ratings, and asset links. All the data is grouped by the country API and is presented in three corresponding CSV files.

Release Notes Dataset

This dataset is a combined and refined subset of the metadata dataset. The main idea behind its creation is to separate the release notes text of the macOS apps for further analysis.

Key fields related to release notes were selected, and entries were additionally classified by language using the langdetect library. The relevant fields include the app name, release date, current version release date, language, and the release notes themselves. The dataset primarily focuses on the release notes texts, so was additionally deduplicated by this attribute, resulting in over 24,000 apps.

The dominant languages are English (75%) and German (13%); the others have fewer than 600 entries (3%) each.

The data is organized in a single CSV file.

Descriptions Dataset

By analogy with the Release Notes dataset, we also formed a separate one from the metadata dataset with app descriptions. The corresponding fields were selected, and each record was additionally assigned with the description language detected by the langdetect library so that the dataset is organized into five columns, including app name, bundle ID, ID, language, and description text. In the final step, the descriptions were deduplicated, resulting in almost 39,000 unique records.

The prevalent language is English (78%), followed by German (16%); the rest are less than 2% each.