Logo
Sesame

Sesame is hiring: ML Vision Engineer (Visual Language Modeling) in Bellevue

Sesame, Bellevue, WA, United States, 98009

Save Job

About Sesame

Sesame believes in a future where computers are lifelike - with the ability to see, hear, and collaborate with us in ways that feel natural and human. With this vision, we're designing a new kind of computer, focused on making voice companions part of our daily lives. Our team brings together founders from Oculus and Ubiquity6, alongside proven leaders from Meta, Google, and Apple, with deep expertise spanning hardware and software. Join us in shaping a future where computers truly come alive.

About the Role

Vision understanding is a critical addition to conversational AI to bring them the context to feel truly present with users. We are seeking a skilled Machine Learning Vision Engineer specializing in vision language modeling to develop and implement advanced models that bridge vision and language understanding. The ideal candidate will have a strong background in multimodal deep learning, vision-language pretraining, and transformer-based architectures, with a passion for creating AI systems that can interpret and generate visual-text representations.

Responsibilities:

  • Design, train, and deploy models integrating vision and language, including contrastive learning, captioning, and image/video understanding.

  • Develop algorithms that enable deep fusion of textual and visual data, leveraging architectures like CLIP, BLIP, and Vision Transformers (ViTs).

  • Optimize training strategies, loss functions, and data pipelines to improve performance across various multimodal tasks.

  • Fine-tune and adapt large-scale pre-trained vision-language models for downstream applications.

  • Work closely with researchers, software engineers, and product teams to integrate visual-language AI capabilities into real-world applications.

  • Stay at the forefront of advancements in vision-language modeling, contributing to novel techniques and methodologies.

  • Construct and preprocess large-scale multimodal datasets to enhance model generalization and robustness.

  • Maintain clear and comprehensive documentation of model architectures, training procedures, and evaluation results.

Required Qualifications:

  • Bachelor’s, Master’s, or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field.

  • Proven experience in developing vision-language models and multimodal machine-learning systems.

  • Strong proficiency in deep learning frameworks such as PyTorch or Jax.

  • Experience with transformer-based architectures, contrastive learning, and generative models (e.g., LLaVA, Flamingo, GPT-4V).

  • Familiarity with large-scale dataset handling, including multimodal datasets.

  • Knowledge of retrieval-augmented generation (RAG), vision-language alignment, and model interpretability.

  • Experience in deploying multimodal models in production environments.

  • Strong analytical and problem-solving skills, with attention to detail, when handling complex data structures.

  • Excellent communication skills, with the ability to articulate technical concepts to diverse audiences.

  • Ability to work collaboratively in a fast-paced, interdisciplinary team environment.

Benefits:

  • 401k matching

  • 100% employer-paid health, vision, and dental benefits

  • Unlimited PTO and sick time

  • Flexible spending account matching (medical FSA)

Sesame is committed to a workplace where everyone feels valued, respected, and empowered. We welcome all qualified applicants, embracing diversity in race, gender, identity, orientation, ability, and more. We provide reasonable accommodations for applicants with disabilities—contact careers@sesame.com for assistance.

#J-18808-Ljbffr