Project Role : Infrastructure Architect
Project Role Description : Lead the definition, design and documentation of technical environments. Deploy solution architectures, conduct analysis of alternative architectures, create architectural standards, define processes to ensure conformance with standards, institute solution-testing criteria, define a solutions cost of ownership, and promote a clear and consistent business vision through technical architectures.
Must have skills : Red Hat OS Administration
Good to have skills : NA
Minimum 3 year(s) of experience is required
Educational Qualification : 15 years full time education
Summary: We are looking for an experienced in various Linux OS (Redhat, Ubuntu etc.) along with Azure CycleCloud Expert with strong expertise in managing HPC environments, VM scale sets, and Hybrid Infrastructure Integration with Linux. The ideal candidate will have deep knowledge in Redhat OS administrator, InfiniBand Link Layer, Azure Networking, and experience with integrating Azure ANF as a storage solution. You will be responsible for configuring, managing, and troubleshooting CycleCloud clusters to ensure high-performance computing (HPC) workloads are running smoothly, while supporting continuous optimization and improvement. Key Responsibilities: • System Installation & Configuration: Install, configure, and maintain Linux-based operating systems, software, and hardware. • Server Maintenance: Regularly monitor system health and performance, performing routine updates, patches, and troubleshooting issues to ensure continuous server uptime. • Security Management: Implement and enforce security protocols, conduct system audits, manage firewalls, and ensure the systems are compliant with organizational security policies. • Performance Monitoring: Monitor server performance using tools like Nagios, Zabbix, or others, identify bottlenecks, and optimize system resources for peak performance. • Backup and Recovery: Set up and manage backup systems, implement disaster recovery strategies, and ensure data integrity. • Automation & Scripting: Develop and maintain shell scripts, automation workflows, and deployment pipelines to enhance operational efficiency. • User Management: Manage user accounts, permissions, and groups, ensuring appropriate access controls. • Log Management: Analyze and manage system logs, perform log rotation, and address any alerts or errors. • Collaboration: Work with cross-functional teams such as network administrators, developers, and security teams to implement best practices and resolve issues. • Troubleshooting: Provide troubleshooting support for Linux servers, identifying hardware, software, and network issues, and resolving them in a timely manner. • Documentation: Create and maintain technical documentation, including system configurations, processes, and procedures. • Capacity Planning: Assist in server capacity planning and the scaling of infrastructure as needed to accommodate business growth. • CycleCloud Management: o Configure, manage, and maintain Azure CycleCloud clusters for HPC and distributed workloads, ensuring high performance, scalability, and reliability. o Manage VM scale sets through Azure CycleCloud to scale compute resources efficiently for various workloads. o Integrate Azure CycleCloud (AZCC) with hybrid infrastructures, including Linux-based environments, to ensure seamless operations across on-premises and cloud-based resources. • InfiniBand Management: o Deep understanding and management of InfiniBand Link Layer used in Azure CycleCloud (AZCC) for high-performance networking. o Troubleshoot and resolve any issues related to InfiniBand connections, ensuring low-latency, high-throughput communication between nodes in the HPC environment. • Azure CycleCloud Web UI Management: o Manage the CycleCloud Web UI interface to configure and monitor clusters, jobs, and compute resources. o Ensure the web interface provides accurate reporting and insights into cluster status, usage, and performance. • CycleCloud App Services: o Manage and monitor CycleCloud App services to support various applications and ensure that they are properly integrated into the cloud infrastructure. o Troubleshoot and optimize app services within the CycleCloud ecosystem to meet performance and availability requirements. • Azure ANF Integration: o Manage Azure NetApp Files (ANF) integration with CycleCloud to provide a scalable and high-performance storage solution for HPC environments. o Ensure that ANF is effectively utilized to meet storage and performance demands. • Azure Networking: o Expertise in managing Azure networking components such as Virtual Networks (VNets), Network Security Groups (NSGs), VPNs, and Load Balancers to ensure optimal connectivity and security for CycleCloud clusters. • Scripting & Automation: o Write and maintain Bash scripts to automate common tasks such as cluster configuration, monitoring, and job submission. o Implement automated workflows to enhance efficiency and ensure consistency across the CycleCloud environment. • Troubleshooting & Support: o Provide expert-level troubleshooting for CycleCloud and Azure environments, diagnosing issues related to HPC workload execution, performance, or infrastructure. o Collaborate with other teams to resolve technical issues in a timely manner and ensure minimal downtime. Required Skills & Qualifications: • Education: Bachelor’s degree in computer science, Information Technology, or related field (or equivalent experience). • Experience: Minimum of 3-5 years of experience in Linux system administration. • Linux Expertise: Proficient in managing Linux distributions such as Red Hat, CentOS, Ubuntu, Debian, etc. • Shell Scripting: Strong experience with scripting languages such as Bash, Python, or Perl. • Networking: Solid understanding of networking concepts, including DNS, DHCP, HTTP, FTP, and SSH. • Deep Understanding in NIS user centralized authentication. • Virtualization & Cloud: Experience with virtualization tools (e.g., VMware, KVM) and cloud platforms (e.g., Azure). • Security: Knowledge of Linux security best practices, firewalls, SELinux, and system hardening techniques. • Performance Tuning: Experience with performance monitoring tools (e.g., Top, iostat) and system optimization. • Backup Solutions: Familiarity with backup and disaster recovery solutions, such as rsync, tar, Bacula, or others. • Troubleshooting: Excellent problem-solving skills with the ability to analyze and resolve complex system issues. • Communication: Strong verbal and written communication skills to collaborate with teams and document processes effectively. • Experience with Azure CycleCloud (AZCC): Hands-on experience with managing Azure CycleCloud clusters, particularly in high-performance computing (HPC) environments. • InfiniBand Expertise: Deep understanding of InfiniBand link layer and its integration into Azure CycleCloud, particularly for HPC workloads. • VM Scale Set Management: Experience managing VM scale sets within Azure CycleCloud. • Hybrid Infrastructure Integration: Skillset in integrating Linux-based environments with CycleCloud for hybrid infrastructure management. • Azure Networking: Strong knowledge of Azure networking concepts and practices, including VNets, NSGs, VPNs, and Load Balancers. • Azure ANF Integration: Proven experience with Azure NetApp Files (ANF) integration for storage solutions. • Bash Scripting: Proficiency in Bash scripting for automating tasks and processes within the CycleCloud environment. • Troubleshooting & Problem-Solving: Strong analytical and troubleshooting skills to resolve complex issues in CycleCloud and Azure environments. • Cloud Security Knowledge: Understanding of security best practices, particularly around networking, storage, and access control in the Azure environment. Additional Information: - The candidate should have a minimum of 3 years of experience in Red Hat OS Administration. - This position is based at our Gurugram office. - A 15 years full-time education is required.
|