- Conduct literature reviews to identify relevant prior work, formulate research hypotheses, and propose promising technical directions in hand pose-language representation learning.
- Design and implement machine learning methods for aligning hand pose, egocentric video, and language using large-scale multimodal datasets.
- Develop or adapt hand pose encoders that learn semantically rich representations for downstream multimodal reasoning and language-conditioned tasks.
- Investigate pretraining strategies that align structured hand motion and pose information with natural language, analogous to image-language pretraining in vision-language models.
- Design rigorous experiments, evaluation protocols, and ablation studies to assess representation quality and downstream task performance.
- Apply and evaluate learned representations on downstream tasks such as error detection, temporal action segmentation, action detection, long (procedural) video understanding, skill estimation, and dexterity assessment.
- Build data processing, training, and evaluation pipelines for multimodal learning with hand pose, egocentric video, and text.
- Analyze experimental results, synthesize technical insights, and communicate findings through internal presentations, technical reports, and research publications.
- Collaborate closely with researchers to refine problem formulations, iterate on model designs, and identify high-impact publication opportunities (e.g., CVPR, ICCV, ICLR, ECCV, NeurIPS).
Minimum Qualifications
|
|
- Currently pursuing a Ph.D. in Computer Science, Electrical Engineering, Robotics, or a related field.
- Strong background in machine learning and deep learning, with hands-on experience training neural networks for representation learning or multimodal learning.
- Proficiency in Python and deep learning frameworks such as PyTorch.
- Demonstrated understanding of one or more of the following areas: computer vision, sequence modeling, multimodal learning, human pose modeling, action recognition, or large language models.
- Experience designing and running empirical ML experiments, including data preprocessing, model training, evaluation, ablation studies, and model fine-tuning.
- Familiarity with fine-tuning strategies for deep learning or foundation models in research settings.
- Strong written and verbal communication skills, with the ability to read and implement ideas from recent research papers.
Bonus Qualifications
- Research experience in egocentric video understanding, human/hand pose estimation, action recognition, temporal action segmentation, or vision-language learning.
- Experience with video understanding tasks and human activity datasets, especially in the context of action recognition and/or action segmentation.
- Experience working with structured motion signals such as 2D/3D key points, skeletal data, trajectories, or sensor-based representations.
- Experience with large-scale pretraining of vision-language models (VLMs), large language models (LLMs), or fine-tuning of foundation models.
- Familiarity with efficient fine-tuning techniques such as LoRA and adapters.
- Familiarity with foundation models or methods for integrating non-visual modalities into language-driven systems.
- Experience with multimodal alignment across modalities such as video, pose, and language.
- Familiarity with 3D vision or SLAM, particularly for modeling hand or human motion in a world coordinate frame.
- Experience with large-scale training, distributed experimentation, or multimodal dataset curation.
- Track record of research publications in leading venues such as CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, ACL, or EMNLP.
- Interest in applying AI to human skill modeling, worker assistance, and human-centered intelligent systems.
|
| Years of Work Experience Required |
0 |
| Desired Start Date |
9/8/2026 |
| Internship Duration |
3 Months |
| Position Keywords |
Multimodal Learning, Hand Pose Modeling, Egocentric Video Understanding,
Pose-Language Alignment, Representation Learning, Large Language Models
|
|
|
|