Back to Publications

Datasets: Inside Mac App Store Applications Metadata

    Tech Note
  • Software Analysis

Overview and Data Collection

This data aims to cover our internal company research needs and start collecting and sharing the macOS app dataset since we have yet to find a suitable existing one.

Full application metadata for over 87,000 samples was sourced by the public iTunes search API for the US, Germany, and Ukraine between December 2023 and January 2024.

For the convenience of further analysis, we present the data as three separate datasets.

Metadata Dataset

The data is provided in its original form without any additional cleaning or transformations. It is organized into 43 columns ranging from such essential details as app names, descriptions, and genres, along with release and version information, user ratings, and asset links. All the data is grouped by the country API and is presented in three corresponding CSV files.

Metadata distribution by store
Metadata distribution by store

Release Notes Dataset

This dataset is a combined and refined subset of the metadata dataset. The main idea behind its creation is to separate the release notes text of the macOS apps for further analysis.

Key fields related to release notes were selected, and entries were additionally classified by language using the langdetect library. The relevant fields include the app name, release date, current version release date, language, and the release notes themselves. The dataset primarily focuses on the release notes texts, so was additionally deduplicated by this attribute, resulting in over 24,000 apps.

The dominant languages are English (75%) and German (13%); the others have less than 600 entries (3%) each.

Top release notes languages
Top release notes languages

The data is organized in a single CSV file.

Release notes samples
Release notes samples

Descriptions Dataset

By analogy with the Release Notes dataset, we also formed a separate one from the metadata dataset with app descriptions. The corresponding fields were selected, and each record was additionally assigned with the description language detected by the langdetect library so that the dataset is organized into five columns, including app name, bundle id, id, language, and description text. In the final step, the descriptions were deduplicated, resulting in almost 39,000 unique records.

The prevalent language is English (78%), followed by German (16%); the rest are less than 2% each.

Descriptions distribution by languages
Descriptions distribution by languages

The data is organized in a single CSV file.

Description samples
Description samples
Sep 16, 2024

macOS Applications Metadata

Metadata for over 87000 macOS apps, sourced from the public API for the US, Germany, and Ukraine. It contains essential details like app names, descriptions, and genres, along with release and version information, user ratings, and many more

Related publications