Action Understanding Using Narration

Job Number: P24INT-56

Honda Research Institute USA (HRI-US) is seeking a highly motivated and independent PhD research intern to join our team in advancing the frontiers of human action understanding and computer vision in long procedural videos. The project focuses on learning from narration as supervision for downstream tasks such as action segmentation and proficiency estimation in procedural videos. This role is ideal for a researcher with a strong background in video understanding and vision-language models. The intern will work on real-world challenges involving long-horizon human activity videos, and contribute to high-impact publications and patents.

San Jose, CA

Key Responsibilities

Conduct cutting-edge research in learning from narration to detect key points or evaluate proficiency in procedural video understanding.
Design and implement novel algorithms for aligning video and text representations or training multi-modal learning using open source language models.
Perform literature review, formulate hypotheses, run experiments, and analyze results.
Lead or contribute to research paper writing, including potential submission to top-tier computer vision or machine learning conferences (e.g., CVPR, ICCV, NeurIPS, ECCV).
Write well-structured, efficient code using deep learning frameworks such as PyTorch.

Minimum Qualifications

Currently enrolled in a PhD program in Computer Vision, Machine Learning, Artificial Intelligence, or a closely related field.
Publication record in top-tier conferences (e.g., CVPR, ICCV, ECCV, WACV, NeurIPS, ICLR).
Prior experience with multimodal language models (i.e, Q formers, LoRA, and LLMs), OR vision-language representation alignment (e.g., CLIP).
Excellent programming skills, ability to write reproducible research code, and proficiency in deep learning frameworks, especially PyTorch.
Strong written and verbal communication skills.
Ability to independently drive research, from ideation to experimentation and publication.

Bonus Qualifications

Familiarity with procedural video downstream tasks such as proficiency estimation or action segmentation/detection, multimodal learning, and video understanding.

Years of Work Experience Required

Desired Start Date

1/5/2026

Internship Duration

3 Months

Position Keywords

Video understanding, vision language models, learning from narration, multimodal learning, computer vision intern

Alternate Way to Apply

Send an e-mail to careers@honda-ri.com with the following:
- Subject line including the job number(s) you are applying for
- Recent CV
- A cover letter highlighting relevant background (Optional)

Please, do not contact our office to inquiry about your application status.

Navigation

Navigation

Action Understanding Using Narration - Honda Research Institute USA

Navigation

Action Understanding Using Narration

Alternate Way to Apply