Back to Publications

Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation

    Paper
  • Machine Learning

Desktop accessibility metadata provides textual representations of applications that support both AI agents and users relying on screen readers. Our investigation shows that 33% of macOS applications provide complete accessibility support. Yet, prior work on structured screen representation has not attempted to reconstruct the full hierarchical structure of desktop interfaces.

We present Screen2AX, the first framework that automatically generates real-time, tree-structured accessibility metadata from a screenshot. Screen2AX combines vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS's system-level accessibility structure.

We compiled and publicly released three datasets covering 302 macOS apps, annotated with accessibility trees. Screen2AX reconstructs accessibility hierarchies with an F1 score of 79% and substantially improves GUI grounding in complex desktop environments. We show that Screen2AX improves grounding accuracy by 2.2× over native accessibility metadata and outperforms state-of-the-art OmniParser V2 on the ScreenSpot benchmark from textual representations.

@misc{muryn2025screen2axvisionbasedapproachautomatic,
      title={Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation}, 
      author={Viktor Muryn and Marta Sumyk and Mariya Hirna and Sofiya Garkot and Maksym Shamrai},
      year={2025},
      eprint={2507.16704},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.16704},  
}

Related publications