A fast-growing provider of AI povered solutions is scaling its operations. With a strong customer base and increasing demand, the existing engineering team is under pressure to handle both infrastructure improvements and customer-facing support.
To meet this growth, the company is looking to add an Infrastructure Engineer in a team of two (will be a third engineer), supporting Kafka, Redis, Opensearch, RabbitMq, ClickHouse for products.
Tasks
- Manage, monitor, and optimize ClickHouse clusters in production, including schema design, query performance tuning, replication configuration, and capacity planning;
- Operate and maintain Kafka clusters, OpenSearch deployments, and other distributed systems, ensuring high availability and optimal performance;
- Deploy, configure, and manage containerized applications and stateful workloads on Kubernetes, implementing best practices for resource management and scaling;
- Implement and maintain GitOps workflows for infrastructure and application deployments, ensuring version-controlled and automated deployment processes;
- Design and implement comprehensive monitoring, logging, and alerting solutions for distributed systems, enabling proactive issue detection and rapid troubleshooting;
- Conduct performance analysis, identify bottlenecks, and implement optimizations across distributed systems to meet SLA requirements and improve system resilience;
- Create and maintain technical documentation, runbooks, and operational procedures while collaborating with development teams to ensure smooth integration and operations.
Requirements
- Strong programming skills in Python, TypeScript, or similar.
- Experience building software that uses LLM or generative AI APIs.
- Hands on experience using AI coding assistants like GitHub Copilot, Cursor, Claude Code, or similar.
- Understanding of LLM fundamentals, for example tokenization, context limits, temperature, prompt structure.
- Experience with RAG, agents, or function calling in production.
- Ability to design experiments and measure impact on quality, cost, and performance.
- Good communication skills and ability to work with cross functional teams.
Nice to have
- Experience with cloud environments like AWS, GCP, or Azure.
- Background in NLP, ML, or developer tools.Working conditions: CET business hours
Other:
Location: Serbia, Portugal, Poland
Working conditions: CET business hours