I am developing a Python application that lets me control the desktop cursor using hand gestures. I am using MediaPipe to track the hand position and map it to screen coordinates.
The basic cursor movement works, but I have a usability problem: when the user tries to click small GUI elements, such as the close, minimize, or maximize buttons of a window, it is difficult to keep the hand perfectly still. Because of small hand tremors and tracking noise, the cursor moves around and it becomes hard to click accurately.
I already tried smoothing the cursor movement with filtering/math, but this does not fully solve the problem. My next idea is to implement a “magnetic snap” behavior:
Detect GUI elements near the cursor.
If the cursor is close enough to one of them, snap/stick the cursor to the center of that element.
Release the snap when the hand moves far enough away.
The part I am struggling with is detecting the GUI elements themselves.
I would like to detect elements such as:
window close button
minimize button
maximize button
possibly taskbar/dock icons
possibly search bars or text fields
The application is written in Python and is currently targeting Linux. I am testing on Ubuntu/Linux MInt with GNOME. I can switch between X11 and Wayland if needed, but I would prefer to understand which approach is possible or recommended.
Is there a Python library or Linux API that can expose the positions of these GUI elements? For example, can this be done using X11/EWMH, AT-SPI accessibility, or another approach?
I am not necessarily looking for a full implementation, but I would like to know which API/library is the correct direction for this problem.
What I have considered so far:
Smoothing the cursor movement, but it does not make small targets easy enough to click.
Estimating positions of window control buttons from window geometry, but this seems unreliable because different desktop environments and themes place buttons differently.
Using X11/EWMH with something like
python-xliborewmhto read window geometry and metadata, but I am not sure whether this is reliable for title-bar buttons.Using
_NET_FRAME_EXTENTSto estimate the title-bar area and button positions, but I am not sure how portable this is across window managers.Using AT-SPI accessibility APIs, but I am not sure whether they expose window title-bar buttons or only application-level controls.
On Wayland, I am not sure whether this kind of global GUI inspection is possible because applications are more isolated.
What is the recommended approach for detecting clickable GUI targets on Linux for this kind of cursor snapping behavior?
More specifically:
Is X11/EWMH the right approach for detecting window control button positions?
Can AT-SPI expose close/minimize/maximize buttons, or only widgets inside applications?
Is this possible on Wayland, or would it require a desktop-environment-specific API/extension?
Are there existing Python libraries that can help with this, or is this usually implemented with custom platform-specific code?