EOP - System Reliability Engineer - TS/SCI Required

Remote
Full Time
EOP - IT & Cyber
Experienced
cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires a TS/SCI clearance.

Qualifications:
  • 5+ years and Bachelor's Degree in Computer Programming, Science, Engineering or a related technical discipline, or the equivalent combination of education, technical training, or work/military experience, including:
  • 3+ years of related systems programming experience
  • Experience maintaining an operational environment and use of monitoring tools and dashboard interfaces (ie. Kibana, Grafana)
  • Experience working with container images and platforms (Kubernetes/Docker)
  • Strong understanding of DevOps and software/application development processes
  • Understanding of GitLab, Jenkins, ArgoCD, and other DevOps/Continuous Integration tools for Kubernetes
  • Understanding of microservice design and architectural pattern best practices
  • Understanding of Python, Bash, and Shell scripting
  • Knowledge of network technologies, common infrastructure components, load balancers, firewalls, virtual and physical infrastructure design
  • problem solving and troubleshooting skills
  • communication and interpersonal skills
  • Must possess excellent time management skills and the drive to work unsupervised
  • Experience with deploying to on prem/data center infrastructure
  • Experience using Jira and Confluence on a daily basis
  • Experience in building processes for deploying to a Kubernetes based environment using Gitlab and Helm
  • Understanding of access management and security groups (i.e. IAM, S3 bucket, SSH, VPN, etc.)
  • Ability to write and use unit and functional testing
  • Technical Skills: Proficiency in programming languages (such as Python, Go, or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial, as SREs often work in these environments.
  • Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
  • Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets is important for measuring and maintaining system reliability.
  • Reliability and Availability: SRE practices help ensure that services are consistently available and reliable, which is critical for user satisfaction and business success.
  • Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases, ensuring that performance remains optimal even under heavy load.
  • Cost Management: By optimizing resource usage and reducing downtime, SREs contribute to cost savings for organizations.
  • Programming and Scripting: Proficiency in languages like Python, Go, or Ruby is crucial for automating tasks and managing infrastructure.
  • Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
  • Cloud Computing: Familiarity with cloud platforms like AWS, Azure, or Google Cloud is vital for deploying and managing applications in distributed environments.
  • Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
  • Monitoring and Logging: Proficiency in tools like Prometheus, Grafana, or Elasticsearch, Logstash, and Kibana (ELK) Stack is necessary for tracking metrics, setting up alerts, and analyzing logs.
  • Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
  • Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
  • Incident Response: Ability to respond quickly and effectively to incidents, including documenting and learning from them.
  • Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
  • These skills are essential for SREs to maintain high availability and performance, balancing the demands of development and operations.
  • Support required during core business hours of 8am – 5pm, Monday through Friday. 
  • On-call for evenings or weekends, if needed for outages, application upgrades, security patches or other unplanned activities.   
Duties:
  • Monitor system health, availability, and performance using centralized monitoring and logging tools.
  • Administration of accounts (role-based access and rights).
  • Manage accessibility to the application through EOP’s authentication systems.
  • Manage the workflow templates to ensure consistent and predictable task flows.
  • Configure workflow management for new or adjustments based on user requests, while adhering to EOP template standards. 
  • Maintain configurations and configurable fields for users and workflows.
  • Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
  • Design and maintain a secure and reliable form of backups, ensuring High Availability (HA) and resiliency.
  • Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
  • Maintain unique instances that support various offices.
  • Configure and support integrations with complementary systems.
  • Establish and Improve system monitoring while maintaining established security protocols within development, test, and production systems.
  • Architect, build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
  • Maintain and improve existing infrastructure (build out autoscaling, support new services, optimize for cost efficiencies/authentication/search, etc.).
  • Administer production, staging and development environments.
  • Manage and aggregate server logs and monitor for security and system related incidents.
  • Monitor and analyze system performance, such as server load and resource usage.
  • Maintain and improve existing build and deployment processes using CI/CD tools.
  • Apply configuration management disciplines to maintain software revisions, security patches, hardening, and documentation.
  • Enforce best practices for security and reliability, and drive security initiatives, like access control and vulnerability testing.
  • Maintain up to date documentation of designs/configurations, ensuring team members have continuity of recurring tasks.
  • Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
  • Create and determine required metrics for dashboards and service health.
  • Follow up on engineering tasks for operational solutions, and validate completion
  • Manage operational readiness board – present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
  • Track and ensure routine operations maintenance tasks are completed in a timely manner.
  • Align to the customer's strategies for configuration of workflows, without compromising the integrity of the workflow tool and templates.
  • Build, maintain, and utilize the customer's enterprise Development, Security, and Operations (DevSecOps) pipeline.
  • Work with other service providers to support areas of common interest.
  • On-call support may be required.
Share

Apply for this position

Required*
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

To comply with government Equal Employment Opportunity and/or Affirmative Action reporting regulations, we are requesting (but NOT requiring) that you enter this personal data. This information will not be used in connection with any employment decisions, and will be used solely as permitted by state and federal law. Your voluntary cooperation would be appreciated. Learn more.

Invitation for Job Applicants to Self-Identify as a U.S. Veteran
  • A “disabled veteran” is one of the following:
    • a veteran of the U.S. military, ground, naval or air service who is entitled to compensation (or who but for the receipt of military retired pay would be entitled to compensation) under laws administered by the Secretary of Veterans Affairs; or
    • a person who was discharged or released from active duty because of a service-connected disability.
  • A “recently separated veteran” means any veteran during the three-year period beginning on the date of such veteran's discharge or release from active duty in the U.S. military, ground, naval, or air service.
  • An “active duty wartime or campaign badge veteran” means a veteran who served on active duty in the U.S. military, ground, naval or air service during a war, or in a campaign or expedition for which a campaign badge has been authorized under the laws administered by the Department of Defense.
  • An “Armed forces service medal veteran” means a veteran who, while serving on active duty in the U.S. military, ground, naval or air service, participated in a United States military operation for which an Armed Forces service medal was awarded pursuant to Executive Order 12985.
Veteran status



Voluntary Self-Identification of Disability
Voluntary Self-Identification of Disability Form CC-305
OMB Control Number 1250-0005
Expires 04/30/2026
Why are you being asked to complete this form?

We are a federal contractor or subcontractor. The law requires us to provide equal employment opportunity to qualified people with disabilities. We have a goal of having at least 7% of our workers as people with disabilities. The law says we must measure our progress towards this goal. To do this, we must ask applicants and employees if they have a disability or have ever had one. People can become disabled, so we need to ask this question at least every five years.

Completing this form is voluntary, and we hope that you will choose to do so. Your answer is confidential. No one who makes hiring decisions will see it. Your decision to complete the form and your answer will not harm you in any way. If you want to learn more about the law or this form, visit the U.S. Department of Labor’s Office of Federal Contract Compliance Programs (OFCCP) website at www.dol.gov/ofccp.

How do you know if you have a disability?

A disability is a condition that substantially limits one or more of your “major life activities.” If you have or have ever had such a condition, you are a person with a disability. Disabilities include, but are not limited to:

  • Alcohol or other substance use disorder (not currently using drugs illegally)
  • Autoimmune disorder, for example, lupus, fibromyalgia, rheumatoid arthritis, HIV/AIDS
  • Blind or low vision
  • Cancer (past or present)
  • Cardiovascular or heart disease
  • Celiac disease
  • Cerebral palsy
  • Deaf or serious difficulty hearing
  • Diabetes
  • Disfigurement, for example, disfigurement caused by burns, wounds, accidents, or congenital disorders
  • Epilepsy or other seizure disorder
  • Gastrointestinal disorders, for example, Crohn's Disease, irritable bowel syndrome
  • Intellectual or developmental disability
  • Mental health conditions, for example, depression, bipolar disorder, anxiety disorder, schizophrenia, PTSD
  • Missing limbs or partially missing limbs
  • Mobility impairment, benefiting from the use of a wheelchair, scooter, walker, leg brace(s) and/or other supports
  • Nervous system condition, for example, migraine headaches, Parkinson’s disease, multiple sclerosis (MS)
  • Neurodivergence, for example, attention-deficit/hyperactivity disorder (ADHD), autism spectrum disorder, dyslexia, dyspraxia, other learning disabilities
  • Partial or complete paralysis (any cause)
  • Pulmonary or respiratory conditions, for example, tuberculosis, asthma, emphysema
  • Short stature (dwarfism)
  • Traumatic brain injury
Please check one of the boxes below:

PUBLIC BURDEN STATEMENT: According to the Paperwork Reduction Act of 1995 no persons are required to respond to a collection of information unless such collection displays a valid OMB control number. This survey should take about 5 minutes to complete.

You must enter your name and date
Human Check*