Research Intern: Long Video Understanding

Job Number: P25INT-47

Honda Research Institute USA (HRI-US) is seeking a highly motivated and independent intern to join our team in advancing the frontiers of human action understanding and computer vision. This role focuses on temporal video understanding in long, untrimmed procedural videos and sits at the intersection of applied research and engineering, with an emphasis on building efficient models for real-world systems. The intern will work on challenging problems involving egocentric and exocentric video streams. Potential project topics include, but are not limited to, developing methods for real-time/online action segmentation, as well as multi-view action understanding, including cross-view transfer and learning view-invariant representations. Additional directions include multimodal alignment (e.g., hand/body pose and video) for action segmentation and proficiency estimation in procedural videos. Depending on project outcomes, the intern may also contribute to high-impact publications and patents.

San Jose, CA

Key Responsibilities

Design and implement approaches for temporal video understanding and action segmentation in long, untrimmed procedural videos, including:
- Multi-view action segmentation
- Cross-view knowledge transfer
- Learning view-invariant representations
- Multimodal alignment (e.g., human/hand pose with egocentric and exocentric video streams)
- Low-latency models for real-time/online inference
Evaluate model performance through rigorous benchmarking and error analysis, and build tools for efficient experimentation, visualization, and reproducibility.
Conduct literature reviews, formulate hypotheses, design experiments, and analyze results.
Build well-structured, efficient, and reproducible codebases using modern deep learning frameworks (e.g., PyTorch).
Collaborate with the team to iterate ideas, models, and system designs.
Optionally contribute to research publications (e.g., CVPR, ICCV, ECCV, NeurIPS).

Minimum Qualifications

Currently enrolled in an MS or PhD program in Computer Vision, Machine Learning, Robotics, Artificial Intelligence, or a closely related field.
Experience with video understanding, including long-form video or temporal modeling, and familiarity with video encoders (e.g., VideoMAE, LaViLa, TimeSformer) and/or fine-tuning approaches.
Strong programming skills, with the ability to write clean, efficient, and reproducible code, and proficiency in deep learning frameworks (e.g., PyTorch).
Familiarity with model efficiency considerations (e.g., latency, memory, or throughput).
Solid problem-solving skills and the ability to independently drive projects from ideation to experimentation (and optionally publication).
Strong written and verbal communication skills.

Bonus Qualifications

Familiarity with procedural untrimmed video understanding and downstream tasks such as online action segmentation/detection.
Experience working with procedural human action datasets (e.g., long-form, untrimmed videos) and associated annotations.
Experience in multiview and multimodal learning (e.g., ego, exo views, hand pose, body pose).
Background in learning video representations, modern video understanding methods, and efficient fine-tuning techniques.

Years of Work Experience Required	0
Desired Start Date	8/31/2026
Internship Duration	3 Months
Position Keywords	Long Video understanding, multiview learning, multimodal language models, procedural video understanding, temporal reasoning, multimodal learning, hand pose, body pose, representation learning

Alternate Way to Apply

Send an e-mail to careers@honda-ri.com with the following:
- Subject line including the job number(s) you are applying for
- Recent CV
- A cover letter highlighting relevant background (Optional)

Please, do not contact our office to inquiry about your application status.

Navigation

Navigation

Research Intern: Long Video Understanding - Honda Research Institute USA

Navigation

Research Intern: Long Video Understanding

Alternate Way to Apply