Senior SRE
Debtsy
At January, we're transforming the lives of borrowers by bringing humanity to consumer finance. Our data-driven products empower financial institutions to streamline their collections, providing borrowers with straightforward and compassionate solutions to regain financial stability and control over their lives. We're not just expanding access to credit. We're restoring dignity and paving the way for millions to achieve financial freedom.
About the Role
As a Senior SRE you will ensure the reliability, scalability, and performance of January's production and internal systems as we scale from thousands to millions of borrowers. You'll establish SRE practices from the ground up - architecting resilient infrastructure, implementing proactive monitoring solutions, and building sustainable on-call processes that evolve with our rapid growth. Your work will directly tackle our current scaling challenges including database optimization, async workflow infrastructure, and data pipeline reliability while ensuring our engineering team can ship with confidence.
What You’ll Work on
Lead incident response and establish sustainable on-call practices, including comprehensive runbooks, blameless postmortems, and systematic improvements that reduce MTTR
Develop and maintain self-service observability solutions using modern monitoring tools that provide actionable insights for troubleshooting and performance optimization
Create and maintain infrastructure as code (using Terraform, CloudFormation) that allows for consistent, scalable, and secure cloud environments on AWS
Partner closely with feature teams to architect resilient infrastructure for critical components (databases, networking, async workflows, data pipelines) that scale seamlessly
Work closely with DevX to design and implement robust CI/CD pipelines with advanced deployment strategies (blue/green, canary) that enable teams to ship confidently and rapidly
Advocate for best practices early in feature design, ensuring we design with reliability in mind and future-proof our services
What You Bring to the Table
Expertise leading incident response for high-availability production systems, thorough root cause analysis, and fostering blameless postmortem culture
Experience designing highly available deployment architectures across multiple targets (e.g. EC2, Fargate), with expertise in auto-scaling, health checks, and graceful degradation strategies
Track record of implementing effective monitoring & observability solutions (e.g. Datadog, Prometheus, ELK), and evangelizing best practices
Strong knowledge of AWS cloud services and infrastructure-as-code practices using tools like Terraform
Experience with CI/CD pipelines and automation to enable reliable, efficient deployments
Excellent communication skills with experience documenting processes and collaborating across engineering teams
We encourage you to apply even if your experience isn’t an exact match. We value professional development and on-the-job learning!
We are currently hiring for this position in our New York office.
As a New York City-based company, we are dedicated to transparent, fair, and equitable compensation practices that reflect our commitment to fostering an environment where all team members are valued and supported. We encourage individuals from all backgrounds to apply.
We are an equal opportunity employer committed to diversity and inclusion in the workplace. We do not discriminate based on race, color, religion, sex, sexual orientation, gender identity, national origin, disability status, age, veteran status, or any other legally protected characteristic.