DescriptionAt RH we believe deeply that the "right" people are our greatest asset. We value people with high energy, who possess the ability to energize others. People who are smart, creative and have a point of view. People who see the answer in every problem, versus those who see the problem in every answer. People who are driven, determined and won't take "no" for an answer. We value team players, people who are more concerned with what's right, rather than who's right.
RESPONSIBILITIES:
We are looking for a principal SRE Engineer to provide strategic support and execute infrastructure, security, continuous integration, deployment, and IT operations practices, scaling and metrics, as well as running day-to-day operations of production and development infrastructure for a cloud-based commerce /enterprise platform.
If you possess a "can do" attitude, are driven by research and problem-solving, and thrive on challenges, this opportunity will interest you.
You’ll work closely with the Development and QA teams to continuously improve existing features and roll out new services, ensuring the high availability of our platform.
You’re comfortable with infrastructure and configuration, but also happy to roll up your sleeves, fix code, write tests, debug, and ship features.
REQUIREMENTS:
- Obsess about site reliability and performance, and ways to continuously improve the same
- Own and lead initiatives to define, design, and implement solutions that help prevent issues impacting availability/performance and reduce time to resolution
- Understand the overall ecomm architecture and identify opportunities to optimize with an eye on availability/performance
- Identify and execute on automation opportunities in the context of code deployment, problem identification, and resolution
- Act as a subject matter expert on SRE/DevOps best practices with Cloud Formation, Auto Scaling Groups, Build tools, Monitoring, and Configuration Management.
- Perform analysis of best practices and emerging concepts in DevOps, Infrastructure Automation, Akamai configuration management, and Enterprise Security;
- Continuously improve observability capabilities (e.g., Prometheus, Grafana, Splunk) to ensure the right leading indicators are monitored and appropriate response workflows are set up.
- Review and audit of existing solution, design, and system architecture;
- Perform profiling, troubleshooting of existing solutions, and improving performance of the systems under coverage;
Create technical documentation and maintain CI/CD pipeline ( Jenkins)
Job Qualifications:
- BS/MS (MS preferred) in Computer Science or equivalent work experience
- 4+ years experience supporting mission critical workloads like ecommerce in a distributed architecture environment
- Solid technical know-how and proven record of problem-solving in a distributed architecture setting
- Excellent critical thinking skills with demonstrated compelling work ethic
- Solid team player with the ability to collaborate cross-functionally with tech and business
- Excellent communication skills; demonstrated ability to explain complex technical issues to technical and non-technical audiences; owns a collaborative, partnership mentality.