Senior Site Reliability Engineer (SRE)

Qdrant

5 months ago

Worldwide

Qdrant is an Open-Source Vector Database.

We help businesses take advantage of modern AI technologies. We are developing neural search solutions that allow everyone to use state-of-the-art neural network encoders at the production scale. At the same time, we help companies to integrate our technology into their infrastructure. Our flagship product is the open-source Vector Database: https://github.com/qdrant/qdrant

Among the technical challenges, we are facing is the implementation of our cloud infrastructure to serve our engine as a scalable cloud API solution. We are looking for a Site Reliability Engineer to ensure stable and secure operability of our managed solutions. If you're passionate about Site Reliability Engineering, Python, Go, Kubernetes, and contributing to the growth of a cutting-edge Database as a Service, we want to hear from you! Apply now and become a key player in shaping the reliability and scalability of our DBaaS platform.

Tasks

Infrastructure Automation: Design, implement, and manage infrastructure code using Terraform, focusing on the reliability and scalability of our Database as a Service (DBaaS) platform.
Programming Mastery: Utilize Python and Go to improve our service quality and develop automation scripts and tools for monitoring, deployment, and maintenance tasks specific to database operations.
Kubernetes Expertise: Demonstrate a deep understanding of Kubernetes, ensuring optimal performance, scalability, and reliability for our DBaaS platform.
Operator Frameworks: Develop and maintain Kubernetes Operators for automating database platform operations, enhancing the reliability of our services.
Multi-Cloud Management: Architect and maintain infrastructure in multi-cloud environments (AWS, GCP, Azure) to provide a resilient and available DBaaS solution.
Monitoring and Incident Response: Implement effective monitoring solutions tailored for database services and collaborate on incident response procedures to maintain the high availability of our systems.
Service Level Objectives (SLOs) and Agreements (SLAs): Define, measure, and maintain SLOs and SLAs specific to database performance and reliability, actively monitoring and optimizing systems to meet these targets.

Requirements

Site Reliability Engineering Focus: Proven experience in a Site Reliability Engineering or similar role, with a strong emphasis on database systems.
Programming Languages: Proficiency in Python and Go; experience with other languages is a plus.
Kubernetes Skills: Proven hands-on experience managing and optimizing Kubernetes clusters, particularly in the context of database services.
Operator Frameworks: Strong background in developing and maintaining Kubernetes Operators, with a focus on database automation.
Infrastructure as Code (IaC): Solid understanding and experience with Terraform, Ansible, or Pulumi, specifically applied to database infrastructure.
Multi-Cloud Expertise: Experience working with multi-cloud environments (AWS, GCP, Azure), ensuring seamless database operations across platforms.
Container Orchestration: Deep understanding of containerization concepts and orchestration tools (Docker, Kubernetes) within the DBaaS context.
SLOs and SLAs: Demonstrated experience in defining, implementing, and meeting Service Level Objectives and Agreements, particularly in the context of database reliability.
Problem Solving: Strong analytical and problem-solving skills, with a keen attention to detail.
Communication Skills: Excellent communication and collaboration skills, with the ability to convey complex technical concepts to diverse audiences.

Benefits

Working in a passionate international team
Competitive salary plus perks
Flexible working hours
Company events
Choose any hardware
Remote first/home office
Relocation option