Skip to main content
Enterprise AI Analysis: Poster: A Multi-Modal Autonomous Tutoring System for Indian Percussion

AI-POWERED PERCUSSION TUTORING

Poster: A Multi-Modal Autonomous Tutoring System for Indian Percussion

Mridangam and Tabla are key percussion instruments in Indian classical music. This poster presents our broad goal of achieving a fully autonomous percussion tutoring system that uses multi-sensor fusion (camera, audio, motion and haptic - force or pressure). In this manuscript we present our proposed design and preliminary explorations for mridangam and discuss how it can translate to tabla analysis.

Authors: VIGNESH A. M. RAJA, PRATEEK PRASANNA, ANU BOURGEOIS, ASHWIN ASHOK
Published: 02 March 2026
DOI: 10.1145/3789514.3796243

Key Performance Indicators & Cultural Impact

Our system demonstrates significant advancements in automating the instruction of complex percussion, aiming to preserve rich cultural heritage through accessible digital tools and robust performance analysis.

0% Audio Onset F1-Score
0% Vision Precision
0 Core Subsystems Integrated
0 Cultural Preservation Initiative

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Introduction and System Architecture

Mridangam and tabla are unique percussion instruments that feature complex finger placements and damping techniques executed at low (2-4 strokes/sec) to high (8-16 strokes/sec) speeds. Unlike isolated western drumming, mridangam and tabla strokes involve simultaneous multi-finger configurations where visually similar positions produce acoustically distinct sounds based on subtle pressure differences. Our proposed autonomous tutoring system (Figure 1) comprises three tightly-coupled subsystems; Audio Onset Detection uses spectral analysis via Librosa [2] for beat onset time-stamping and beat classification, Spatial Segmentation employing SAM2 [3] for zero-shot segmentation of the drum's inner and outer membranes, with RANSAC-based circle fitting to maintain accuracy under occlusion, Hand Pose using MediaPipe [1] that extracts finger landmarks and projects them onto segmented zones, and a rule-based classifier that maps finger configurations to different strokes.

Enterprise Process Flow

Multi-Sensor Setup
Audio-Based Pipeline
Vision-Based Pipeline
Time-Synced Fusion
Rule-Based Classification
Syllable Recognition

Challenges and Cultural Impact

Digitizing mridangam performance presents unique challenges: capturing Micro-Movement Nuances like sliding motions (Gumki) that standard vision models miss, handling High-Speed Occlusion where hands block the drum face, and resolving Acoustic Ambiguity where identical regions produce different timbres. Solving these enables a broader impact: Cultural Preservation. By digitizing the pedagogical process itself, we ensure that the intricate mechanics of oral traditions can be passed to future generations even if human masters become inaccessible. Furthermore, the framework's modular design allows it to be translated to other instruments like the Tabla, creating a scalable platform for archiving global percussion heritage.

Preliminary Evaluation

We have evaluated the feasibility of our approach on mridangam strokes using a self-created dataset containing all 7 fundamental strokes (70-75 samples each). Our system achieved 92% audio onset F1-score (how accurately is the beat onset detected). Vision precision for the seven fundamental strokes averaged 80%, successfully discriminating visually similar configurations.

92% Achieved Audio Onset F1-score

Acknowledgments

This work has been supported by U.S. Army Research Laboratory (ARL W911NF-23-2-0224). Content reflect authors' views, not official ARL/U.S. Government policy. The U.S. Government retains reproduction rights.

References

  • [1] Camillo Lugaresi et al. 2019. MediaPipe: A Framework for Perception Pipelines. In Proc. CVPR Workshops.
  • [2] Brian McFee et al. 2015. librosa: Audio and Music Signal Analysis in Python. In Proc. SciPy, Vol. 8. 18-25.
  • [3] Nikhila Ravi et al. 2024. SAM 2: Segment Anything in Images and Videos. arXiv:2408.00714 (2024).

Calculate Your Potential ROI

See how an AI-powered tutoring system can reclaim valuable time and reduce training costs within your organization.

Annual Cost Savings (Estimated) $0
Hours Reclaimed Annually 0

Your AI Implementation Roadmap

A typical rollout of our multi-modal AI tutoring system from concept to operational impact.

Phase 1: Discovery & Customization

Initial consultation to understand your specific percussion training needs and dataset availability. Customization of models for instrument variants or unique teaching methodologies.

Phase 2: Data Integration & Model Training

Integration with existing sensor hardware (or recommendations for new setup) and deployment of initial multi-modal data capture. Iterative model training with your specific instructional content.

Phase 3: Pilot Deployment & Feedback

Deployment of the tutoring system in a controlled pilot environment with a select group of students/instructors. Gathering feedback for performance tuning and user experience enhancements.

Phase 4: Full-Scale Rollout & Ongoing Support

Full implementation across your educational or institutional setting. Continuous monitoring, updates, and dedicated support to ensure optimal system performance and user adoption.

Ready to Transform Percussion Education?

Leverage cutting-edge AI to preserve cultural arts and enhance learning. Schedule a personalized consultation to explore how our autonomous tutoring system can benefit your institution.

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs


AI Consultation Booking