Job Information
MD Anderson Cancer Center Principal Machine Learning Engineer in Houston, Texas
MD Anderson is expanding our Enterprise MLOps and Analytics Platform capabilities to enhance support for MLOps and ModelOps, facilitating the operationalization, monitoring, and management of in-house and third-party AI solutions. This expansion is integral to strengthening our overall AI Governance framework.
We are actively seeking a Principal MLOps Engineer to lead and establish a team of MLOps engineers tasked with designing and enhancing our Enterprise MLOps and Analytics Platform. The Principal MLOps Engineer will be pivotal in managing the operations and overseeing the entire MLOps and ModelOps tech stack, ensuring a robust end-to-end lifecycle management and governance of machine learning models throughout the organization. As the principal figure in MLOps, this role involves defining the team dynamics, shaping the culture, and implementing processes and technologies essential for supporting ML solutions within our hybrid data and compute framework. Collaborating closely with IT, cybersecurity, and compliance teams, the Principal MLOps Engineer will be instrumental in creating a secure and compliant infrastructure for the scalable management of AI/ML models.
Key responsibilities include:
Lead and mentor a team of MLOps Engineers to create a scalable MLOps and Analytics platform within a hybrid compute environment, including Kubernetes and Azure.
Design, implement, and oversee CI/CD pipelines, ensuring the infrastructure is conducive to ML model training, deployment, and monitoring while upholding security, scalability, reliability, and performance.
Develop, refine, and standardize Model Governance integrations, performance tracking for bias and impact, and a model catalog with standardized scorecards and deployments.
Innovate automated validation, deployment, observability, and management tools for scalable and reproducible AI solutions.
Design fallback and decommissioning strategies for AI solutions to ensure operational continuity.
Deliver training on AI solutions to enhance understanding and application across the organization.
Engage with technology trends, contribute to tech communities, and foster a culture of continuous learning and innovation.
Technical Expertise
Demonstrate deep understanding of the AI/ML Platform infrastructure and cloud architecture.
Experience developing and deploying AI/ML algorithms into production.
Strong proficiency in Python and C++ or C#, complemented by experience with machine learning libraries such as TensorFlow, PyTorch, and Scikit-learn.
Knowledge of DevOps practices, CI/CD pipelines, including tools like Azure DevOps or Git Actions.
Proficiency in working with containers such as Docker and container orchestration systems like Kubernetes. Familiarity with process orchestration/DAGs tools.
Experiences with AI/ML algorithms, packages (e.g. scikit, PyTorch, tensor flow).
Experience with data, code, and model artifact management processes and MLOps tools.
Experience with on-premises, cloud-based, and hybrid computing environments, as well as cloud-native tools and services.
Knowledge of ISO standards for software and/or AI development lifecycle management.
Analytical Skills
Experience with project management methodologies (e.g. SAFe agile, PRINCE2, Lean methodology)
Deep understanding of the AI/ML Model Lifecycle Management.
Proficient in decision-making, problem-solving, and the successful execution of AI/ML solutions in a healthcare environment.
Manage AI/ML projects throughout their lifecycle, ensuring timely delivery, budget adherence, and quality standards compliance.
Experience leading an ML engineering and/or data scientist team focused on developing, deploying, and maintaining production-ready models.
Experience working closely with third-party vendors and partners to integrate new AI solutions into existing infrastructure and workflows.
Experience implementing risk identification and mitigation strategies, including contingency planning, to prevent project delays and complications.
Preference for working knowledge of hospital workflows.
Preference for experience with healthcare data privacy and security protocols, such as HIPAA.
Preference for experience with flowcharts, business process models, mapping tools, data flow diagrams, and process flow diagrams.
Professionalism: Oral and Written
Work closely with data scientists, ML engineers, software engineers, and other stakeholders to understand requirements and integrate machine learning models into the overall system.
Create and maintain comprehensive documentation for CI/CD pipelines, deployment processes, and infrastructure configurations.
Experience reporting on project progress, impact, and risks to leadership and stakeholders, providing strategic advice to help prioritize AI/ML solutions use-cases.
Experience stakeholder management to drive adoption, address concerns, and prioritize solution support.
Other duties as assigned
Education Required: Bachelor's degree in Computer Science, Software Engineering, Data Science, Physics, Math & Statistics, or another related engineering discipline.
Preferred Education: Master's Level Degree
Experience Required : Seven years of experience in machine learning engineering, data science, data engineering, and/or software engineering. With Master's degree, five years' experience required. With PhD, three years of experience required.
Preferred Experience: Experience working on production quality healthcare focused machine learning solutions.
Proficiency in architecting, implementing, and using MLOps solutions.
It is the policy of The University of Texas MD Anderson Cancer Center to provide equal employment opportunity without regard to race, color, religion, age, national origin, sex, gender, sexual orientation, gender identity/expression, disability, protected veteran status, genetic information, or any other basis protected by institutional policy or by federal, state or local laws unless such distinction is required by law. http://www.mdanderson.org/about-us/legal-and-policy/legal-statements/eeo-affirmative-action.html
Additional Information
Requisition ID: 166432
Employment Status: Full-Time
Employee Status: Regular
Work Week: Days
Minimum Salary: US Dollar (USD) 160,500
Midpoint Salary: US Dollar (USD) 203,000
Maximum Salary : US Dollar (USD) 245,500
FLSA: exempt and not eligible for overtime pay
Fund Type: Hard
Work Location: Remote (within Texas only)
Pivotal Position: Yes
Referral Bonus Available?: Yes
Relocation Assistance Available?: Yes
Science Jobs: No
#LI-Remote