How can I detect Linux GUI elements such as close/minimize/maximize buttons for cursor snapping in Python?
01:12 25 May 2026

I am developing a Python application that lets me control the desktop cursor using hand gestures. I am using MediaPipe to track the hand position and map it to screen coordinates.

The basic cursor movement works, but I have a usability problem: when the user tries to click small GUI elements, such as the close, minimize, or maximize buttons of a window, it is difficult to keep the hand perfectly still. Because of small hand tremors and tracking noise, the cursor moves around and it becomes hard to click accurately.

I already tried smoothing the cursor movement with filtering/math, but this does not fully solve the problem. My next idea is to implement a “magnetic snap” behavior:

  1. Detect GUI elements near the cursor.

  2. If the cursor is close enough to one of them, snap/stick the cursor to the center of that element.

  3. Release the snap when the hand moves far enough away.

The part I am struggling with is detecting the GUI elements themselves.

I would like to detect elements such as:

  • window close button

  • minimize button

  • maximize button

  • possibly taskbar/dock icons

  • possibly search bars or text fields

The application is written in Python and is currently targeting Linux. I am testing on Ubuntu/Linux MInt with GNOME. I can switch between X11 and Wayland if needed, but I would prefer to understand which approach is possible or recommended.

Is there a Python library or Linux API that can expose the positions of these GUI elements? For example, can this be done using X11/EWMH, AT-SPI accessibility, or another approach?

I am not necessarily looking for a full implementation, but I would like to know which API/library is the correct direction for this problem.

What I have considered so far:

  • Smoothing the cursor movement, but it does not make small targets easy enough to click.

  • Estimating positions of window control buttons from window geometry, but this seems unreliable because different desktop environments and themes place buttons differently.

  • Using X11/EWMH with something like python-xlib or ewmh to read window geometry and metadata, but I am not sure whether this is reliable for title-bar buttons.

  • Using _NET_FRAME_EXTENTS to estimate the title-bar area and button positions, but I am not sure how portable this is across window managers.

  • Using AT-SPI accessibility APIs, but I am not sure whether they expose window title-bar buttons or only application-level controls.

  • On Wayland, I am not sure whether this kind of global GUI inspection is possible because applications are more isolated.

What is the recommended approach for detecting clickable GUI targets on Linux for this kind of cursor snapping behavior?

More specifically:

  • Is X11/EWMH the right approach for detecting window control button positions?

  • Can AT-SPI expose close/minimize/maximize buttons, or only widgets inside applications?

  • Is this possible on Wayland, or would it require a desktop-environment-specific API/extension?

  • Are there existing Python libraries that can help with this, or is this usually implemented with custom platform-specific code?

python opencv mediapipe