A fast-growing provider of AI povered solutions is scaling its operations. With a strong customer base and increasing demand, the existing engineering team is under pressure to handle both infrastructure improvements and customer-facing support.
To meet this growth, the company is looking to add an Infrastructure Engineer in a team of two (will be a third engineer), supporting Kafka, Redis, Opensearch, RabbitMq, ClickHouse for products.
Tasks
- Manage, monitor, and optimize ClickHouse clusters in production, including schema design, query performance tuning, replication configuration, and capacity planning;
- Operate and maintain Kafka clusters, OpenSearch deployments, and other distributed systems, ensuring high availability and optimal performance;
- Deploy, configure, and manage containerized applications and stateful workloads on Kubernetes, implementing best practices for resource management and scaling;
- Implement and maintain GitOps workflows for infrastructure and application deployments, ensuring version-controlled and automated deployment processes;
- Design and implement comprehensive monitoring, logging, and alerting solutions for distributed systems, enabling proactive issue detection and rapid troubleshooting;
- Conduct performance analysis, identify bottlenecks, and implement optimizations across distributed systems to meet SLA requirements and improve system resilience;
- Create and maintain technical documentation, runbooks, and operational procedures while collaborating with development teams to ensure smooth integration and operations.
Requirements
- Hands-on experience operating distributed systems in production environments, with strong understanding of distributed computing concepts, data consistency, and fault tolerance;
- Solid experience with ClickHouse, including cluster management, MergeTree engine families, data modeling, query optimization, and replication strategies;
- Practical experience deploying and managing applications on Kubernetes, including StatefulSets, persistent volumes, networking, and security configurations;
- Working knowledge of Apache Kafka (brokers, topics, partitions, consumer groups) and OpenSearch or similar search and analytics engines;
- Experience with GitOps practices and Infrastructure as Code tools (Terraform, Helm, or similar), with ability to manage infrastructure through declarative configuration;
- Proficiency with monitoring and observability platforms (Prometheus, Grafana, or similar) and experience implementing metrics collection and alerting strategies;
- Hands-on experience with at least one major cloud platform (AWS, GCP, or Azure), including compute, storage, and networking services;
- Strong scripting and programming skills in Python, Go, or Bash for automation, tooling development, and operational tasks.
Nice to have:
- Experience with other distributed databases (Redis, Spark, Flink, etc.);
- Knowledge of data streaming patterns and event-driven architectures;
- Strong analytical and troubleshooting skills with ability to diagnose complex distributed systems issues, coupled with clear communication skills for cross-functional collaboration.
Benefits
Working conditions:
- This role availible only for candidates from Croatia, Serbia, Portugal, Poland
- Duration: 1 year+, with extension possibility;
- Locations: Serbia, Portugal, Croatia, Poland;
- Overlap: Until 11:00 AM PST at max.
- Employment Type: Full-time