Need SRE Contractor AI/ML Infrastructure and Ops Engineer (AI/ML Training) // Remote

Role: SRE Contractor AI/ML Infrastructure and Ops Engineer  (AI/ML Training)

Location:  (Remote)

 

 

Job Description –

SRE Contractor - AI/ML Infrastructure and Ops Engineer  (AI/ML Training)

As the Infrastructure and Ops Engineer, you will work on operations related to UAIS (United AI Studio  - enterprise AI/ML platform), and in particular in relation to AI/ML training initiative supporting thousands of learners on the platform. This individual contributor (IC) role requires experience on working on large-scale AI/ML platforms guaranteeing stability, reliability, scalability, and performance. Experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and GCP is a must.

 

Primary Responsibilities:

 

    • Continuous support: Provide continuous SRE support to thousands of geographically distributed learners on the UAIS platform: respond to tickets, triage support, liaise with customers.  
    • Automation & DevOps: Improve existing Infrastructure as Code (IaC) according to best DevOps practices.
    • Systems Monitoring: Develop and maintain monitoring frameworks for UAIS infrastructure in relation to AI/ML training program
    • Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.
    • Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML training environment, while identifying opportunities to reduce costs without compromising performance.

 

Required Qualifications:

 

    • Bachelor’s degree in computer science, information technology, or a related field.
    • 5+ years of infrastructure experience: Proven experience working on large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and GCP, with hands-on experience in cloud management.
    • 3+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike.
    • 2+ years of practical experience in containerization technologies (Kubernetes, Docker) and orchestration
    • 2+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts.

 

Preferred Qualifications:

 

    • Security & Compliance Knowledge: Strong understanding of security best practices and experience ensuring compliance with relevant regulatory frameworks.
    • Machine Learning and LLM Operations: Exposure to modern tools and techniques in MLOps and LLMOps fields. 
    • Exposure to AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale.
    • Exposure to a Regulated Industry: Experience working within a healthcare or regulated industry, with solid understanding of the unique challenges and compliance requirements.
    • Ability to work independently, manage multiple projects simultaneously, and adapt to

 

 

Thanks,

Rahul Srivastava

TekisHub® Consulting Services

Work: 302-613-2500 Ext 262 

Mailto: rahul.kumar@tekishub.com

 

Comments

Popular Posts