The Scientific Computing (SC) program within the Information Management & Technology (IMT) function provides end-to-end technology platforms ranging from corporate IT systems through to leading edge high performance data processing tools and platforms. The team manages a hybrid on-prem and multi-cloud of computing and data services that enable multiple millions of core hours of computing time and host over 200 petabytes of data. The team manages and supports an edge to cloud computational and data fabric that includes several high performance computing clusters, multiple hyperscale public cloud tenancies, and a mass and high performance tiered storage infrastructure with a national footprint. The group also provides hosting services for CSIRO’s corporate applications through provision of a highly versatile and robust virtualised platform, alongside support for organisational data management practices, and provides a suite of applications to effectively support the management of research data throughout its lifecycle.
- Provide system support for advanced research computing environments, including the installation, integration, management and monitoring of hardware systems and interconnects, operating system, platform software, and management software to ensure the systems are operating at optimal performance and stability levels.
- Monitor and manage workloads on Kubernetes and HPC to ensure optimal use of resources.
- Collaborate with other teams to design, implement and maintain fit-for-purpose computing platforms based on private and public cloud resources, including Kubernetes and HPC.
- Contribute to the implementation and maintenance of a container-based scientific application environment for applications such as artificial intelligence and machine learning, e.g. Jupyter.
- Contribute to the development of best practice guides and policies, such as application workload management, quota management and cyber security.
- Contribute to service management processes, including change management, configuration management and incident management.
- Ability to monitor system usage and performance statistics to ensure optimal use of resources.
- Ability to analyse complex problems, interpret operational needs, and develop integrated, creative solutions.
Qualification & Experience:
- Experience in Kubernetes platform administration and container management for scientific applications and workflows, such as monitoring, logging as well as the management of networks, secrets, data and applications.
- Experience in HPC system administration, parallel computing architectures and Linux operating systems.
- Experience in managing a computing environment for machine learning, deep learning and artificial intelligence, such as Jupyter notebooks and relevant Python packages.
- Experience with high level languages such as Python, shell script and Go.
Vacancy Type: Full Time
Job Location: Geelong, Victoria, Australia
Application Deadline: N/A