Principal Site Reliability Engineer
Oracle Cloud Infrastructure
About Oracle Cloud:
Oracle Cloud is a comprehensive suite of cloud services—including infrastructure, platform, and applications—designed to help organizations build, deploy, and manage workloads securely at scale. At Oracle, we are building the most intelligent future of cloud computing. Our team is composed of talented, motivated, and diverse individuals committed to empowering our customers to accomplish their most important missions using Oracle Cloud Fusion Applications. We center our work around our customers’ needs, striving to continuously enhance our cloud capabilities based on their challenges.
About the Team:
Join the Fusion Site Reliability Engineering Middleware (FSRE-MW) —a critical group dedicated to maintaining the high availability of Oracle’s Cloud Fusion Applications. We minimize the frequency and duration of customer-impacting events through large-scale incident management and automation. As a team, we combine the agility of a start-up with the scale and customer focus of a leading enterprise software company.
As a Principal Site Reliability Engineer, you will be a key member of a high-impact team focused on the availability, performance, and operational excellence of Fusion SRE Middleware. You will take ownership of production environments—including systems and the Fusion Middleware stack—and support mission-critical business operations for Cloud Fusion Applications. Your role will emphasize automation and optimization of operations across multiple production environments, recommending AI-driven solutions to enhance availability, performance, and supportability. You will harness AI-based tools and predictive analytics to proactively identify issues, automate incident responses, and continuously improve system resilience. Additionally, you will provide escalation support for complex production problems, guide junior engineers, participate in major incident bridges, and help build and refine processes and procedures using AI-powered insights to drive smarter, data-driven decisions.
Our team is front-and-center in reducing event duration, leveraging operational experience, best practices, and tool development to automate incident management and drive continual improvement.
About the Role:
We seek a Principal SRE to join our globally distributed team, responsible for detecting, triaging, and mitigating service-impacting events rapidly and effectively through automation and AI-powered insights. You will be part of a regional team, minimizing Fusion services’ downtime through exceptional incident management and system operations, with a strong emphasis on scalability, performance, security, and AI-driven optimization. In this dynamic role, you will gain deep insight into the inner workings of Oracle Cloud Fusion Apps, using AI tools to predict, identify, and address potential issues before they impact services. You’ll influence cross-functional leaders and drive programs that boost service availability while leveraging AI to enhance real-time decision-making and improve operational efficiency.
Career Level: IC4
As a world leader in cloud solutions, Oracle uses tomorrow’s technology to tackle today’s challenges. We’ve partnered with industry-leaders in almost every sector—and continue to thrive after 40+ years of change by operating with integrity.
We know that true innovation starts when everyone is empowered to contribute. That’s why we’re committed to growing an inclusive workforce that promotes opportunities for all.
Oracle careers open the door to global opportunities where work-life balance flourishes. We offer competitive benefits based on parity and consistency and support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.
We’re committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing accommodation-request_mb@oracle.com or by calling +1 888 404 2494 in the United States.
Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
Oracle SQL/PLSQL, Database Administrator and WebLogic with expertise in performance tuning.
Disclaimer:
Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.
Range and benefit information provided in this posting are specific to the stated locations only
US: Hiring Range in USD from: $86,400 to $199,500 per annum. May be eligible for bonus and equity.
Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business.
Candidates are typically placed into the range based on the preceding factors as well as internal peer equity.
Oracle US offers a comprehensive benefits package which includes the following:
1. Medical, dental, and vision insurance, including expert medical opinion
2. Short term disability and long term disability
3. Life insurance and AD&D
4. Supplemental life insurance (Employee/Spouse/Child)
5. Health care and dependent care Flexible Spending Accounts
6. Pre-tax commuter and parking benefits
7. 401(k) Savings and Investment Plan with company match
8. Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.
9. 11 paid holidays
10. Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.
11. Paid parental leave
12. Adoption assistance
13. Employee Stock Purchase Plan
14. Financial planning and group legal
15. Voluntary benefits including auto, homeowner and pet insurance
The role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted.
Career Level - IC4
Work with Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, with focus on security, resiliency, scale, and performance. Authority for end-to-end performance and operability. Partner with development teams in defining and implementing improvements in service architecture. Articulate technical characteristics of services and technology areas and guide Development Teams to engineer and add premier capabilities to the Oracle Cloud service portfolio. Understand and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack. Demonstrate clear understanding of automation and orchestration principles. Act as ultimate escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs). Utilize a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. Understand and explain the affect of product architecture decisions on distributed systems. Professional curiosity and a desire to a develop deep understanding of services and technologies.
Key Responsibilities:
- Automation:
Develop and optimize operations through AI-powered automation. Apply machine learning and orchestration principles to every possible opportunity, reducing manual intervention and technical debt. Enhance operational outcomes with scalable, AI-driven automation solutions that anticipate issues and optimize system performance proactively. - Middleware Technology Expert:
Lead L3 WebLogic Administration, managing server lifecycle, configuring and deploying applications, and monitoring server and application resources. Leverage AI-driven monitoring tools to proactively detect and resolve issues across application and infrastructure layers, ensuring efficient and automated troubleshooting. - Service Ownership:
Act as a Service Owner for Fusion Apps customers, sharing full-stack ownership of critical services in partnership with Service Development and Operations. Utilize AI-based analytics to predict potential service disruptions and optimize service delivery to improve customer satisfaction and minimize downtime. - Technical Expertise:
Provide deep technical guidance and serve as the ultimate escalation point for complex issues not documented in SOPs. Participate in major incident management as a subject matter expert, leveraging your understanding of service topologies, AI-driven insights, and dependencies to troubleshoot and resolve issues quickly and effectively. - Ownership Scope:
Understand end-to-end configuration, dependencies, and behavioral characteristics of production services. Use AI-powered telemetry and monitoring systems to ensure mission-critical delivery with a focus on system health, security, resiliency, scale, and performance. - Service Requirements:
Provide strategic direction and prioritization to Product Management and Service Development teams, guiding the addition of AI-enhanced capabilities to Oracle SaaS/ERP services. Act as an escalation point for undocumented or critical issues, leveraging AI tools to aid in faster resolution and proactive service improvements.
Professional Skills Requirements:
- Excellent written and verbal communication, facilitation, and interpersonal skills.
- Strong collaboration, customer service, empathy, flexibility, and conflict resolution abilities.
- Ability to communicate clearly with technical and non-technical stakeholders.
- Effective at working independently and managing multiple projects or responsibilities.
- Highly motivated with the ability to thrive in fast-paced, team-oriented environments.
- Strong analytical and problem-solving skills.
- Adaptability to evolving priorities and deadlines.
- Strong global teamwork skills.
- Proven ability to handle multiple, competing priorities.
Required Qualifications:
- Bachelor’s degree in Computer Science or a related field, or equivalent experience.
- Overall 8+ years of experience in IT industry.
- 6+ years of experience in Site Reliability Engineering (SRE) or DevOps, or Systems Engineering.
- 6+ years of hands-on automation experience using Python or Unix Shell Scripting.
Required Skills - Database & Middleware
- Excellent proficiency in Oracle Database, SQL, and PL/SQL & performance tuning
- Hands-on expertise with Oracle WebLogic Server.
- Strong background in WebLogic performance tuning, monitoring
- Proven expertise in designing and implementing solutions for telemetry, monitoring, scalability, performance, and reliability at both platform and application layers.
- Correlate WebLogic/JVM metrics (heap, GC, threads, connection pools) with oracle database performance indicators.
- Perform JVM Heap sizing, Garbage Collection tuning, and thread analysis.
- Analyze database, middleware, and application metrics to resolve performance bottlenecks.
- Administration experience with web servers such as OHS (Oracle HTTP Server) or Apache.
- Deep understanding of performance concepts (response time, throughput, resource utilization).
- Perform capacity planning and scalability analysis based on workload growth and usage patterns.
Good to Have:
- Experience with Fusion Apps functional flows.
- Java programming experience and understanding structured SQL statements.
- Knowledge of Oracle Business Intelligence Enterprise Edition (OBIEE) and Oracle Service-Oriented Architecture (SOA).
If you’re ready to shape the future of cloud services at Oracle, we want to connect!
Apply today to join our innovative team.