Research Intern: Hand Pose-Language Representation Learning

Job Number: P25INT-42

Honda Research Institute USA (HRI-US) is seeking a self-motivated and independent intern to join our research team working on multimodal machine learning for egocentric perception and skill understanding. This role will focus on developing learning methods that align hand pose information with language representations, enabling fine-grained hand motion and pose cues to be more effectively incorporated into modern downstream models, including large language model-based systems. The intern will work with egocentric video and hand pose data, including data captured from wearable devices such as Meta Aria glasses, to build representations that support tasks such as error detection, temporal action segmentation, long-horizon video understanding, and skill assessment. This is a research-oriented internship with the goal of producing high-impact results suitable for publication at a top-tier conference.

San Jose, CA

Key Responsibilities

Conduct literature reviews to identify relevant prior work, formulate research hypotheses, and propose promising technical directions in hand pose-language representation learning.
Design and implement machine learning methods for aligning hand pose, egocentric video, and language using large-scale multimodal datasets.
Develop or adapt hand pose encoders that learn semantically rich representations for downstream multimodal reasoning and language-conditioned tasks.
Investigate pretraining strategies that align structured hand motion and pose information with natural language, analogous to image-language pretraining in vision-language models.
Design rigorous experiments, evaluation protocols, and ablation studies to assess representation quality and downstream task performance.
Apply and evaluate learned representations on downstream tasks such as error detection, temporal action segmentation, action detection, long (procedural) video understanding, skill estimation, and dexterity assessment.
Build data processing, training, and evaluation pipelines for multimodal learning with hand pose, egocentric video, and text.
Analyze experimental results, synthesize technical insights, and communicate findings through internal presentations, technical reports, and research publications.
Collaborate closely with researchers to refine problem formulations, iterate on model designs, and identify high-impact publication opportunities (e.g., CVPR, ICCV, ICLR, ECCV, NeurIPS).

Minimum Qualifications

Currently pursuing a Ph.D. in Computer Science, Electrical Engineering, Robotics, or a related field.
Strong background in machine learning and deep learning, with hands-on experience training neural networks for representation learning or multimodal learning.
Proficiency in Python and deep learning frameworks such as PyTorch.
Demonstrated understanding of one or more of the following areas: computer vision, sequence modeling, multimodal learning, human pose modeling, action recognition, or large language models.
Experience designing and running empirical ML experiments, including data preprocessing, model training, evaluation, ablation studies, and model fine-tuning.
Familiarity with fine-tuning strategies for deep learning or foundation models in research settings.
Strong written and verbal communication skills, with the ability to read and implement ideas from recent research papers.

Bonus Qualifications

Research experience in egocentric video understanding, human/hand pose estimation, action recognition, temporal action segmentation, or vision-language learning.
Experience with video understanding tasks and human activity datasets, especially in the context of action recognition and/or action segmentation.
Experience working with structured motion signals such as 2D/3D key points, skeletal data, trajectories, or sensor-based representations.
Experience with large-scale pretraining of vision-language models (VLMs), large language models (LLMs), or fine-tuning of foundation models.
Familiarity with efficient fine-tuning techniques such as LoRA and adapters.
Familiarity with foundation models or methods for integrating non-visual modalities into language-driven systems.
Experience with multimodal alignment across modalities such as video, pose, and language.
Familiarity with 3D vision or SLAM, particularly for modeling hand or human motion in a world coordinate frame.
Experience with large-scale training, distributed experimentation, or multimodal dataset curation.
Track record of research publications in leading venues such as CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, ACL, or EMNLP.
Interest in applying AI to human skill modeling, worker assistance, and human-centered intelligent systems.

Years of Work Experience Required	0
Desired Start Date	9/8/2026
Internship Duration	3 Months
Position Keywords	Multimodal Learning, Hand Pose Modeling, Egocentric Video Understanding, Pose-Language Alignment, Representation Learning, Large Language Models

Alternate Way to Apply

Send an e-mail to careers@honda-ri.com with the following:
- Subject line including the job number(s) you are applying for
- Recent CV
- A cover letter highlighting relevant background (Optional)

Please, do not contact our office to inquiry about your application status.

Navigation

Navigation

Research Intern: Hand Pose-Language Representation Learning - Honda Research Institute USA

Navigation

Research Intern: Hand Pose-Language Representation Learning

Alternate Way to Apply