1
0
Fork 0
mirror of https://github.com/IRS-Public/direct-file.git synced 2025-06-28 04:25:52 +00:00

initial commit

This commit is contained in:
sps-irs 2025-05-29 13:12:11 -04:00 committed by Alexander Petros
parent 2f3ebd6693
commit e0d5c84451
3413 changed files with 794524 additions and 1 deletions

83
docs/rfc/2023_and_me.md Normal file
View file

@ -0,0 +1,83 @@
Engineering decisions to be made early in the Dec 2022 - Spring 2023 development push
===========================
Background:
As we prepare to build a tax filing service, engineering decisions need to be made on a vast array of topics ranging from the overall architecture to language choices to hosting platforms to deployment tools. Some of these choices can be easily changed as the team grows and requirements evolve, while others will have high switching costs and put us on a path that will be hard to deviate from. The goal of this document is to identify the areas that are in greatest need of thoughtful treatment early in the process, so that we can prioritize them accordingly. Also provided are some initial thoughts on likely directions, associated tradeoffs, and a brief list of the things we need to know in order to proceed with a given choice.
Insight on each topic area should take into consideration our expectations about what skill sets we'll likely have available, what tools we can get authorization to use, and any lessons that can be derived from the prototype that was built in spring/summer 2022.
Topics:
* Division of responsibilities between front end and back end
* Modularization of back end (API for front end interaction? separation of MeF integration?)
* Language choices
* Front end web frameworks
* Data storage product
* Definition of a user account
* Authentication levels and tools
* Rough expectations for document ingest features and resulting dependencies
* Rough expected scope of Taxpert interface support and resulting dependencies
* Rough expectations for integration with third party customer support tool / CRM
Division of responsibilities between front end and back end
===========================
Modularization of back end (API for front end interaction? separation of MeF integration?)
===========================
Language choices
===========================
Front end web frameworks
===========================
Data storage product
===========================
* what are we storing?
* how long do we need to store it?
* what are the security requirements?
Definition of a user account
===========================
For joint filers, we need to make a choice as to whether a user account represents an individual human or the joint filing couple. There are examples of services in which this determination can be made independently by each user (for example, I may choose to share an email account with my spouse), but in our case, identity is a crucial concept that needs to be well defined with respect to user accounts.
Considerations:
* Handling access and information history from year to year as people change filing statuses and become married/unmarried.
* Future data prepopulation features that may require identity proofing.
Authentication levels and tools
===========================
We believe that, at least up through and including filing season 2024 (TY2023), we will:
* Allow taxpayers to provide data
* Pass back information from the IRS indicating whether returns are accepted or rejected
We will NOT grant users access to any sensitive tax data that they have not explicitly provided to the application.
For this set of features, we contend that the application must support similar levels of authentication as existing non-government applications which access the MeF API. It is important for users to demonstrate that they have the access keys (e.g. username, password, MFA methods) that were defined in any previous sessions during which they provided data, but it is not necessary to prove that they are a particular human.
Therefore, IAL1 is appropriate, while IAL2 is not.
In future years when we provide mechanisms to ingest data from sources other than user input, authentication should be enhanced to accommodate the data access restrictions appropriate for those data sources. Examples may include:
* SSO or password authentication for financial institutions
* IAL2 identity proofing for access to personal tax data held by the IRS
Rough expectations for document ingest features and resulting dependencies
===========================
Assumption: Initial scope does not include any automated ingestion of documents like W-2s, so we do not expect to rely on OCR libraries or similar.
Rough expected scope of Taxpert interface support and resulting dependencies
===========================
Assumption: Tax rules and logic for combining fields needs to be editable by people who are not software engineers.
Rough expectations for integration with third party customer support tool / CRM
===========================

View file

@ -0,0 +1,28 @@
# Design and a11y review process
The below steps are specifically the design and accessibility testing part of the overall acceptance flow process. The code submitter should have done their due dilligence for any user interface changes so that a design/a11y reviewer can focus on finer details. Although there are several steps, if we do this regularly it will be a light lift and will avoid any design/a11y debt.
## Verify that any changes match the design intention
- [ ] Check that the design translated visually
- [ ] Check interaction behavior
- [ ] Check different states (empty, one, some, error)
- [ ] Check for landmarks, page heading structure, and links
- [ ] Try to break the intended flow
- [ ] Confirm this works at 5Mbps down and 1Mbps up. You might use [Firefox dev tools](https://firefox-source-docs.mozilla.org/devtools-user/network_monitor/throttling/index.html) or Chrome's [WebRTC Network Limiter](https://chrome.google.com/webstore/detail/webrtc-network-limiter/npeicpdbkakmehahjeeohfdhnlpdklia)
## Verify that any changes meet accessibility targets
- [ ] Check responsiveness in mobile, tablet, and desktop at 200% zoom
- [ ] Check keyboard navigability
* Test general usability, landmarks, page header structure, and links with a screen reader (different from what the original dev used in their checklist):
- [ ] [VoiceOver](https://dequeuniversity.com/screenreaders/voiceover-keyboard-shortcuts#vo-mac-basics) in Safari
- [ ] [JAWS](https://dequeuniversity.com/screenreaders/jaws-keyboard-shortcuts#jaws-the_basics) in Chrome
- [ ] [NVDA](https://dequeuniversity.com/screenreaders/nvda-keyboard-shortcuts#nvda-the_basics) in Chrome
* Use an a11y tool to check these changes conform to at least WCAG 2.1 AA
- [ ] [WAVE](https://wave.webaim.org/)
- [ ] [axe](https://www.deque.com/axe/devtools/)
- [ ] [ANDI](https://www.ssa.gov/accessibility/andi/help/install.html#install)
- [ ] Browser inspector ([firefox](https://firefox-source-docs.mozilla.org/devtools-user/accessibility_inspector/#accessing-the-accessibility-inspector) / [chrome](https://developer.chrome.com/docs/lighthouse/accessibility/))
- [ ] For storybook-only components, use its [accessibility addon](https://medium.com/storybookjs/instant-accessibility-qa-linting-in-storybook-4a474b0f5347#c703).

145
docs/rfc/eng-rfc-process.md Normal file
View file

@ -0,0 +1,145 @@
Created At: May 5, 2024
Updated At: May 14, 2024
RFC: Evolving How We Align on Technical Decisions
# Problem Statement
Direct File has matured as an application over the past year (read: we went to production) and grew its headcount to a 70+ member product team. However, we have not invested a lot of resources into adapting our processes around technical decision making to our burgeoning scale. This has manifested itself in last-mile delivery delays for various initiatives, primarily due to the fact that the right stakeholders were not in the room at the appropriate moments or they did not know that a decision was being made until after the fact.
Similarly, as Direct File grows in product scope, our relationship with the larger IRS IT organization will both change and become increasingly important. Integrating our ways of working during the pilot to the processes of the IRS enterprise more broadly will require a different approach than what served us during the pilot.
# BLUF: Proposed Process Changes
1. Distinguish RFCs from ADRs and clarify when to leverage each one
2. Establish a dedicated meeting time, with defined decision-makers, for reviewing RFCs that did not achieve alignment during async review
3. Include IRS IT SMEs within the RFC process and identify the criteria under which they should be engaged for certain changes
# Definitions
Request For Comments (RFC): In the engineering context, a formal document that recommends a technical specification for a set of product requirements. This can include anything from a change solely at the application level to a system(s)-wide, architecture that engages with multiple external services. Typically contains a set of working definitions, articulation of key product requirements, proposed implementation and analysis of alternatives. If the RFC is approved, it becomes the touchpoint for technical implementation and should be revised according to if new requirements appear. Some RFC are designated indefinitely with Experimental or Draft status.
Architecture Decision Record (ADR): A “lightweight” document that captures the rationale of a Architectural Decision (AD), i.e. a justified design choice that addresses a functional or non-functional requirement that is architecturally significant. ADRs captures a single AD and its rationale; the collection of ADRs created and maintained in a project constitute its decision log. ADRs typically contains a title, status, context, decision, and consequences.
# Proposal
## Goal
The goal of this proposal is to find the right balance between:
1) Functioning in an product-centric, agile manner;
2) Not constraining ourselves with unnecessary overhead; and
3) Clearly understanding the when, who and how of engaging with our IRS IT counterparts at the appropriate times.
The remainder of this proposal deals with some ways we can balance these needs and achieve these goals.
## Deepen our understand and engagement with the IRS IT organization and get our IRS IT SMEs involved as early as possible
We should engage with, at a minimum, our embedded Cyber SME and Technical Advisor as initial reviewers for any major system change, rather than as stakeholders that must be looped in only when the "paperwork" stage is reached, i.e. post-decision. Our IRS SMEs are our colleagues and champions within the enterprise - the earlier they are aware of changes that might require their support, the better the process will be for getting the features to production.
IRS IT SMEs should be involved when a system-wide change is being proposed, in particular one that might involve an update to any part of our ATO and especially control implementation statements in our System Security Plan (SSP). These changes typically require updating the `application.boundary` or `application.context` compliance documentation at some stages.
### When should I loop in IRS IT SMEs?
Examples of "system-wide changes" includes, but are not limited to:
- Provisioning new cloud infrastructure, both compute and storage
- Requesting changes to how our network boundaries are configured
- Adding new API endpoints or modifying where a URI can be located (i.e. changing the structure of the endpoint)
- New vulnerability scan findings that cannot be remediated timely
- Changing any part of our authorization or authentication flow
- Storing or accessing SBU/PII/FTI in an environment that isn't the application database or cache (in particular SaaS tools)
- Deviation from IRS IRMs
- Data or container security
- Integrating with other parts of the IRS (enterprise services, on-prem, cloud, etc.)
- Establishing connections external to the IRS
- Requesting new tools (whether or not they are ESP or COE approved)
- Major version upgrades of software framework components
- Standing up new deployed services
## Leverage RFCs as the primary mechanism to propose technical specifications
We currently conflate the concept of an ADR with an RFC. ADRs are static artifacts that are post-decisional; an engineer should be able to read the entire decision log of ADRs and roughly understand why the system is configured as it is. RFCs, on the other hand, are live documents meant for discussion and iteration. They are better suited for soliciting input and collecting feedback on a proposed approach, which is the right first step for proposing a technical specification. The outcome of an RFC might be an ADR, if the scale of proposed change merits it.
In practice, this means that once engineers are given a sufficiently large or complex set of requirements, they articulate their proposed approach in an RFC, rather than a combination of Gitlab tickets, Slack threads, and markdown files committed to the codebase with the word "ADR" in the title. This forces the engineer to spend more time substantiating their reasoning for a certain implementation and weighing various alternatives, as well as investigating the various upstream and downstream dependencies of the proposal. It also requires the engineer to consolidate their thoughts into a single SOT document, reducing the cognitive overhead of all participants of tracking the outcome across multiple surfaces (Github, Gitlab, Slack, Teams, etc.).
Writing an RFC **does not** negate the need to prototype various solutions; rather, prototyping should be considered part of the RFC process as a way to demonstrate the feasability of a given approach and that alternatives were considered.
Importantly, designing larger features and broader architecture design has historically been limited to a small number of developers relative to the size of the engineering organization, limiting the ability for other engineers to contribute and grow as system designers. This is mostly due to reasons of velocity and speed necessary to get the pilot out the door. RFCs provide a mechanism through which to enable other engineers to own their features end-to-end, instead of relying on another engineer to propose the implementation which they then action.
## Add a synchronous, cross-org forum for discussing RFCs that are not resolved asynchronously
In addition to moving to a world where RFCs are the formal document that facilitate discussion, we should also move away from a model where major engineering decisions are both reviewed and approved by a single functional team in a predominately asynchronous, i.e. in Github PRs. Instead, we should move towards a model where a broader audience can weigh in on proposed changes and discuss outstanding questions in a synchronous manner.
Making RFC review into a blocking mechanism is not the goal. Synchronous time should be leveraged **only** when there are outstanding questions on a proposal that require live, cross-team discussion. While async review and approval is always the first and best option, practically speaking at a certain level of system change, live discussion is inevitable if not necessary. We should embrace that reality, not fight it and rely on back-channels and 100-comment Slack thread to facilitate alignment on major changes.
Tactically, this would involve adding a standing RFC-review meeting that is 1) team-agnostic and open to the entire Product organization; and 2) always includes our Cyber SME and Technical Advisor as participants to make sure that all dependencies are considered. **An agenda should be circulated to participants 36 hours in advance and the meeting can be canceled if there is no agenda.**
One key benefit here is that a cross-organization, discussion-based approach to RFCs reduces knowledge silos across the product organization and allows engineers to better 1) understand what is happening across different teams; and thus 2) flag cross-cutting concerns that might not have been addressed during the primary review phase (e.g. changes to authn/authz affects many different teams, but not every team might be involved as the primary reviewers).
### Why a standing meeting instead of as needed/ad-hoc?
While the flexibility of ad-hoc better mirrors our historical and current practices around engineering meetings, there are a few reasons why a standing meeting with the sole purpose of reviewing RFCs is beneficial, at least in the first instance:
1. The right people are always in the room: the blended team model create a world where no single individual has access to everyone's calendar. By maintaining a standing meeting, everyone must put re-occuring blocks on their respective calendars, greatly increasing the chance that if they are a stakeholder, they will be able to attend.
1. In this vein, we want to ensure that our key IRS IT counterparts - those with a known stake in facilitating the delivery of the technical output - have their concerns are addressed before proceeding to implementation. This reduces our overall delivery lead time by removing "unknown unknowns" and proactively identifying (and accounting for) process-based roadblocks much earlier in the delivery process.
2. Resolving opposing views: major engineering changes often have several viable paths, and it is rare to have all outstanding questions answered asynchronously. A standing meeting releases both the author and reviewer from "finding a time to hash it out live" in favor of using a dedicated mechanism like RFC review (with an agenda and time limit on topics) to facilitate to discussion. This reduces unnecessary friction within and across teams, and enables other members of the organization to manage the discussion.
3. Context sharing and maintaining visibility for other teams and leadership: As Direct File grows, it is unrealistic that the people who might have reviewed PRs during the pilot will have the time to do so in Year 2, 3, etc. This doesn't mean, however, that they want to be divorced from the technical discussions that are happening. A standing meeting provides a dedicated space for those members/leadership to keep a finger on the pulse of what is happening without reviewing a dozen RFCs a week.
4. It is easier to start with a standing meeting and move to ad-hoc later than vice versa. Especially as we build the organizational muscles around a process like RFC review, it is helpful to have the meeting in place instead of requiring individuals to advocate for ad-hoc meetings out of the gate. During filing season, for instance, I expect us to leverage ad-hoc meetings significantly more. Conversely, during May-September when a lot of planning and technical designs are choosen, we would benefit from a standing meeting to make sure we aren't crossing-wires and are moving in lockstep.
5.
# Appendix I: Step-by-Step examples of how this all works in practice
If implemented, the expected development lifecycle would look roughly as follows:
**note: Each team/group/pod maintains autonomy in terms of how they want to define and implement the various steps, as long as 1) async and sync RFC review is incorporated into their development; and 2) IRS IT SMEs are engaged at the appropriate moments. The below will not map perfectly onto any given team's cadence, and instead aims to approximate the most-process heavy approach from which team's can choose what they would like to incorporate.**
1. Product requirements for a feature set are specified in a ticket (by someone)
2. The Directly Responsible Engineer (DRE) provides an intial, rough estimate of the scope and sizing of the work, as well as the documentation required to drive alignment on an orginizationally acceptable approach:
1. If a system-wide change (see below for criteria) is involved, an RFC and ADR will be required before moving to any implementation. **IRS IT SMEs should be looped in early as key stakeholders and reviewers.**
2. If the feature set is not a system-wide change, the DRE has discretion about if an RFC would be a helpful tool to facilitate design and/or gain consensus within a team or across teams. Some feature sets are complex enough to benefit from an RFC; others are not. Once the RFC is drafted, reviewed and approved, the DRE can begin implementation.
3. If an RFC is not needed, the DRE can immediately begin implementation and put up a PR with a description of the work and link back to the ticket.
3. If an RFC is needed, the DRE drafts a written proposal as a means to solicits feedback on the proposed technical approach. The document should live in the `docs/rfc` directory and be committed to the codebase in a PR in a text format like Markdown with associated artifacts (diagrams, etc.) included as needed.
1. All initial discussion can happen asynchronously and ad-hoc.
2. If a system-wide change is being proposed, DevOps and our IRS IT colleagues (in partiular Cyber SME and Technical Advisor) should be looped in at this stage as reviewers.
3. If a system-wide change is not being proposed, the DRE and reviewers should use their discretion as to if IRS IT should be engaged or not during the RFC stage. **If they are not engaged, the assumption is that they will not need to be engaged during or after implementation.**
4. If all questions (including those from IRS IT colleagues) are sufficiently addressed in the written RFC, the RFC can be approved and the DRE can move to implementation.
5. If there are outstanding questions in the RFC that cannot be resolved asynchronously, the RFC is slotted for discussion during the standing "RFC Review" meeting and circulated for discussion to all RFC Review participants.
1. During the meeting, the DRE presents a summary of the proposed changes and the group discusses the main outstanding questions and aligns on a path forward.
2. The DRE updates the RFC as needed coming out of this meeting.
3. n.b. **this is the only synchronous portion of this process, everything else is asynchronous**
6. In the event that an ADR is needed, after the RFC stage is complete an ADR is drafted and committed to the codebase in the `docs/adr` repository. This should all occur asynchronously and should be merged in short order with minimal review cycles.
1. No alignment is needed on the ADR as it simply codifies the outcome of the RFC and RFC review.
7. Once the RFC and/or ADR stages are complete, the DRE can begin implementation. At the same time, they also coordinate with IRS IT and DevOps to understand if they need any additional documentation aside from the RFC and ADR is necessary to initiate or facilitate IRS IT or IEP processes.
# Appendix II: Deciding between RFC, ADR and normal PRs
This section provides a basic decision tree for deciding between the following processes (in order of number of parties that need to coordinate to make a change, from least to most):
- Ticket with a PR
- ADR
- RFC
- A combination of the above
In general, default to the process requiring the least coordination available if you can't decide.
1. Are any of the criteria of the 'When should I loop in IRS IT SMEs?' section above met? -> RFC + ADR + loop in IRS IT SMEs as early as possible. See the long list of examples in the aforementioned section.
2. Is the feature set cross-cutting and requires multiple teams/pods to weigh in? -> RFC (ADR optional) + should confirm with IRS IT SMEs if there are upstream dependencies/compliance considerations that require documentation updates. Examples include
1. User-permissions
2. SADI
3. Major changes to MeF Integration
4. Major changes to the submission flow
5. Microservice messaging (queues, pub/sub)
3. Is the feature set within the domain of a single pod but cross-cutting between teams? -> RFC helpful but not required (ADR optional as well)
1. Addition of or modification to core functionality of a given microservice
4. Is the feature set within the domain of a single pod and within the domain a single team? -> RFC optional, depends on if the DRE feels it would be helpful to have a document separate from PR description or adding detail to ticket
1. Changes to MeF/PDF/Fact Graph conversion
2. Major additions or modifications to the UX flow
3. Implementing new tax logic within the flow/fact graph
4. Implementing or modifying retry logic for certain backend processes (e.g. sending email)
5. Requesting infrastructure configuration changes for previously-provisioned resources, such as changing the redrive policy on SQS for certain queues
5. Does the feature set have pre-existing, well defined product and technical requirements? -> PR/ticket is sufficient, no need for RFC or ADR
1. Modifying pre-existing tax logic within the flow/fact graph
2. Adding new repositories,interfaces, classes, services, etc. that clean up parts of the codebase
3. General refactoring
4. Spring-ifying the backend services
5. Updating dependencies
6. Remediating security findings

View file

@ -0,0 +1,42 @@
# Code generation for a React frontend
Our system is build on a Fact Graph and a Flow. The Facts have no awareness of presentation concerns, while the Flow encapsulates ordering and grouping of questions, along with pointers to human language translations of user-facing questions.
### Tom's suggestions:
Facts will have some properties:
- A way of getting graph specified validation into the system
- A way of referring to other constraining facts (charitable deduction for marital status)
- A name or identifier for being found and focused later
This tends to indicate that there should be some version of the fact graph in memory on the client side.
Each page will also need to have some properties:
- A series of templates for different page styles and layouts -- how will people know which to use?
- Well defined results for front and back actions
- The ability to jump to an arbitrary page (maybe by fact alone)
- A post action that sends the current state to the server
- Analytics
Because of the well ordered nature of the flow, and the need to be able to identify where facts are, we should use something like an FSM flow for pages. The actions above represent transitions to other states.
Page Information Singleton:
There can only ever be one of these in the system. The page information singleton contains the generated configuration that the front end understands. This configuration contains things like the IDs and friendly names, a dictionary of <fact name, page> so that individual facts can be found.
Page Manager FSM:
Each page is some data about what to display and how. It has a view template that lays out the page, zero or more fact templates that are applied to that template, a defined forward action, a defined back action, and ID for the page, and a friendly name for the URL. As the user works their way through the flow, the page manager will pull the information about which page to display from the Page Information Singleton
The implication here is that an external system generates the React frontend system. We start with a template react frontend that has the css and some of the basic tools in it. This external system then generates all of the pages, the frontend configuration for the page information singleton, and there are probably a few other pieces. Most of the fuzz is around the review step and how that will look in an autogenerated system. What I don't think will look good is if we just link directly back to the page in the flow without some indication that you are in the review process. I would also want the forward and back to go back to the review section and not start you into the flow again.
Testing:
We should be able to, in an automated way, run through the entire frontend application and verify that all of the paths work, all of the validators work, all of the constraints work, and that the product produces correct results.
Questions:
Will a page ever need alternate layouts? I am imagining a review section in which we may want forward and backward to work completely differently, and for the site to maybe look a bit different.
Should the site be generated on some other backend or should it kind of generate itself? I can imagine, if the need for multiple templates happens, that it would be nice to have a system that can dynamically generate itself. I can see massive problems with it too (like testing completeness).

74
docs/rfc/rfc-template.md Normal file
View file

@ -0,0 +1,74 @@
# RFC: [Title Here]
- Created: [mm/dd/yyyy]
- Approver(s):
- PR: if applicable
- Gitlab Issue: if applicable
# Primary author(s)
[primary authors]: #primary-authors
who owns this document and should be contacted about it?
# Collaborators
[collaborators]: #collaborators
anyone who contributed but isnt a primary author.
# Problem Statement
[problem statement]: #problem-statement
Why are we doing this? What use cases does it support? What is the expected outcome?
# Prior Art
What existing solutions are close but not quite right? How will this project replace or integrate with the alternatives?
# Goals and Non-Goals
[goals and non-goals]: #goals-and-non-goals
What problems are you trying to solve? What problems are you not trying to solve?
# Key terms or internal names
Define terms and language which might be referenced in the background and Suggested Solution below
# Background & Motivation
[background and motivation]: #background--motivation
What is the current state of the world? Why is this change being proposed?
# Suggested Solution
[design]: #suggested-solution
What exactly are you doing? Include architecture and process diagrams.
This is typically the longest part of the RFC.
# Timeline
[timeline]: #timeline
What is the proposed timeline for the implementation?
# Dependencies
[dependencies]: #dependencies
What existing internal and external systems does this one depend on? How will it use them?
# Alternatives Considered
[alternatives]: #alternatives-considered
What other approaches did you consider?
# Operations and Devops
[operations]: #operations-and-devops
What operational work is needed to implement this idea? What kind of IRS SME's should be involved? What is the "lift" from an operational perspective?
What additional telemetry and monitoring would be required to implement this?
# Security/Privacy/Compliance
[security privacy compliance]: #security-privacy-compliance
What security/privacy/compliance aspects should be considered?
# Risks
[risks]: #risks
What known risks exist? What factors may complicate your project?
Include: security, complexity, compatibility, latency, service immaturity, lack of team expertise, etc.
# Revisions
[revisions]: #revisions
RFC Updates for major changes, including status changes.
- created
- updated
- revisions based on feedback
- final status

91
docs/rfc/sqs-listeners.md Normal file
View file

@ -0,0 +1,91 @@
# RFC: SQS listener configuration
- Created: 9/10/2024
- Status: created
# Problem Statement
[problem statement]: #problem-statement
Which approach will we use to improve the graceful shutdown of SQS listeners in
Direct File's microservices?
# Goals and Non-Goals
[goals and non-goals]: #goals-and-non-goals
- Primary goal: Eliminate SQS exceptions seen on routine application shutdown
during deployments
- Secondary goal: Reduce complexity of SQS listener configuration
# Background & Motivation
[background and motivation]: #background--motivation
Currently, the class that sets up the SQS Connection (named
`SqsConnectionSetupService` in most, if not all, cases) in each microservice
includes a `@PreDestroy` method to stop and close the JMS SQS connection factory
during application shutdown. Even with this method in place, as old containers
shut down during rolling application deployments, they generate consumer
prefetch and receive message exceptions.
The `SqsConnectionSetupService` classes also contain a significant amount of
boilerplate code and must be updated anytime a microservice needs to subscribe
to another SQS queue.
# Suggested Solution
[design]: #suggested-solution
Create a Spring Boot starter (e.g., `irs-spring-boot-starter-sqs-jms`) that uses
`org.springframework:spring-jms` to manage the SQS connection factory. Having
Spring JMS manage the lifecyle of the listener containers should improve the
graceful shutdown. At the very least, implementing this solution will allow us
to test this and confirm whether we see an improvement.
Encapsulating the configuration in a Spring Boot starter will simplify the setup
for Direct File Spring Boot microservices that subscribe to SQS queues.
Implementation steps:
- Add the starter and integrate with the backend in one PR to start
- Integrate other apps with the starter in separate PRs (email-service, status,
submit
# Timeline
[timeline]: #timeline
About 1 sprint to implement the Spring Boot starter and update the various
microservices to use it. Initial prototyping and testing has already been done.
# Dependencies
[dependencies]: #dependencies
This solution will add a new Java dependency: `org.springframework:spring-jms`.
# Alternatives Considered
[alternatives]: #alternatives-considered
- Spring Cloud AWS. Spring Cloud AWS could also simplify our SQS configuration
and likely have a similar benefit in terms of improving graceful shutdown.
If we were already using Spring Cloud AWS, I would lean towards using it for
SQS. However, Spring Cloud AWS does not yet support Spring Boot `3.3.x`. Given
that we already use JMS for the SQS listeners, introducing the Spring JMS
project should be a lower lift.
- Explore other modifications to `SqsConnectionSetupService`. Perhaps there is a
way to modify our existing SQS connection setup to improve the graceful
shutdown, but it is unclear what that implementation would look like, and at
this point, there is likely value in exploring whether there is an existing
framework/library that can solve this problem for us.
# Operations and Devops
[operations]: #operations-and-devops
No operational/devops work expected.
# Security/Privacy/Compliance
[security privacy compliance]: #security-privacy-compliance
No security/privacy/compliance impacts expected.
# Risks
[risks]: #risks
Adds a new dependency, which may require some time for the team to get familiar
with, but this risk should be minimized by the fact that the applications would
still be using JMS, rather than making more significant changes to how the
application manages SQS subscriptions/listening.
# Revisions
[revisions]: #revisions
RFC Updates for major changes, including status changes.
- created
- updated
- revisions based on feedback
- final status

View file

@ -0,0 +1,86 @@
Created At: July 19, 2024
Updated At: May 26, 2024
RFC: Status Service 2.0 (f/k/a One Status to Rule them All)
# Problem Statement
Today, the editability of a tax return within the Client is computed differently than the status banner that is shown to the taxpayer within the Client as well as the exportability of a tax return into a State filing tool. The discrepancy of editability vs. exportability/status representation has been a constant topic of conversation since last filing season and a major source of confusion from an engineering and operational perspective. Moreover, the current design adds a large amount of unnecessary overhead in REST calls and database queries to and within the Status service, which cannot be scaled to accommodate increased load. 50x more scale in FS25 would risk introducing performance related problems under load in production.
This RFC proposes a unification of these statuses into a single, pre-existing status that controls all operations related to post-submission processing from the standpoint of the Client, Backend and State-API.
# Background
## Application Considerations
The statefullness of the Status services derived from the 2023 need to demo a full POC of Direct File. The current design is built directly on the POC in a way that causes multiple datastores across microservices to store redundant data.
The exportability of a tax return for State-API purposes and the status representation logic in the Client is based on logic within the Status service (specifically the TaxReturnXMLController that interfaces with the AcknowledgementService). Both operations are sent through internal REST calls from State-API and Backend, respectively, to the Status service, which handles all the business and database logic.
However, the editability of a tax return is wholly controlled by the Backend service and Backend database, specifically in the TaxReturnService.isTaxReturnEditable() logic which under the hood makes a query against the TaxReturnSubmissionRepository which queries the TaxReturnSubmission and SubmissionEvent tables.
## Architectural Considerations
The Status service has a single pod running in each region. It doesn't have autoscaling enabled in any way, which means that the only way we can scale it is vertically. For reference, the Backend service can scale to 96 pods max (48 per region). The client makes this call every 60 seconds whenever a user has an active session and is on the dashboard (post-submission), until a non-pending status is returned. For a given 15 minute session, this means that the client could make 15 calls to poll the status, if the return is determined to be pending for that time period. This creates a situation where nearly all requests are unnecessary and just add overhead to the Backend and Status services.
# Why we should expect the Status service to become a bottleneck in FS25 and beyond
The current design is not positioned to scale well in FS25 and beyond. Today, exporting a return and fetching the status are the two most expensive API calls we have throughout our systems, typically lasting between 3-10 seconds (depending on system load). Under an order of magnitutde of increased load, we would expect to see further elevated latency for these operations, likely on the order of 10-20 seconds. This is because of increased resource consumption/contention in the Status service, particurlarly at the database level. In other words, the Status service will become a bottleneck that will have cascading effects onto the State-API, Backend and Client, and directly onto the taxpayer.
If we have a huge rush of traffic from both the client (via the Backend) and State-API (first day, last day of filing season), we will greatly increase the number of cross-service REST calls and database queries in a way that would hamper performance for other parts of the Status service, for instance polling MeF for acknowledgements (its primary purpose). It will also provide poor user experiences for filers trying to export their returns, as well as cause confusion if/when the editability of a return within the Client is different than the banner indicating the return's status.
# Proposal
There are a few ways to solve this bottleneck problem:
1. Deprecate the Status service's endpoint completely and instead rely on the Backend database as the source of truth for statuses as well as horizontal scaling capacity of the backend for throughput purposes.
2. Convert the Status service's endpoint to consume submissionIds instead of taxReturnIds and caching the results in Redis (with a short-ish TTL)
3. Remove the Backend as the proxy for requests to the Status service and make calls directly to the Status service
4. Decrease the rate at which the client polls the /status endpoint
Each solution has pros and cons, however I would advocate that we implement Solution 1. Solution 2 is a fallback, and both are discussed below.
## What does the system look like if we implement Solution 1?
* Fetching return status into the Client and exporting a return to the State filing tool are significantly more performant through a combination of caching, no internal REST calls, reduced number of database calls, and reliance on the Backend horizontal pod autoscaling.
* Tax return editability in the Client is managed by the Backend database
* Tax return exportability to State filing tools is managed by the Backend database and accessed via a Backend internal API
* Tax return editability and exportability are managed by the same domain object (SubmissionEvent) and are thus much easier to reason about.
* There are no internal API calls to the Status service
* The same guardrails for always preferring accepted submissions is in place, guarding us against duplicate submissions.
# Solution 1 Implementation Proposal
* [ ] \[STEP 1, START HERE\] Migrate the AcknowledgementService.getLatestSubmissionIdByTaxReturnIdPreferringAcceptedSubmission logic to the TaxReturnService and refactor the underlying query to rely on SubmissionEvents instead of Completed/Pending objects.
* [ ] This interface should be the single SOT for determining if a return is accepted or rejected, for frontend editability and state-api export purposes.
* [ ] It should probably be named something like `getLatestSubmissionEventByTaxReturnIdPreferringAcceptedSubmissions`
* [ ] Migrate the Status service's AcknowledgementController.get() interface into the Backend app and expose the internal endpoint to the State-API.
* [ ] The query logic at AcknowledgementController.java#L56 can be deprecated and instead leverage the new interface in the TaxReturnService described in step 1.
* [ ] Alternatively, implement this logic within the State API service itself for fetching XML from S3 depending on the outcome of the `/status` call?
* [ ] Migrate the Status service's TaxReturnXMLController into the Backend and expose the internal endpoint to the State-API.
* [ ] The getTaxReturnStatus in TaxReturnXmlController.java#L32) logic can be deprecated and instead leverage the new interface in the TaxReturnService described in step 1
* [ ] Migrate the SubmissionEvent table to either:
* [ ] Add a single rejection_codes column with a JSON type that can persist JSON that mirrors StatusResponseBody.rejectionCodes; or
* [ ] a 1:M join between SubmissionEvent and RejectionCodes (identical to Completed and Error today in the Status service e.g. AcknowledgementService.java#L441
* [ ] Modify the status-change SQS message payload to include Errors such that all information contained in a StatusResponseBody is included in the SQS message and can be persisted in the Backend database
* [ ] When consuming SQS message payloads, persist rejectionCodes appropriately into the SubmissionEvent or RejectionCodes table (as decided above). After the data is persisted to the database, cache the results as a submissionId:StatusResponseBody key-value pair in Redis.
* [ ] Modify the TaxReturnService.getStatus in TaxReturnService.java#L793 to:
* [ ] First, query for the latest submission ID of the tax return and use this ID as a key to check Redis to see if there is a corresponding StatusResponseBody value
* [ ] Cache hit: return the StatusResponseBody value to the client
* [ ] Cache miss: either due to TTL expiration or the return hasn't been received/acknowledged by MeF. Check the database using
`TaxReturnService.getLatestSubmissionEventByTaxReturnIdPreferringAcceptedSubmissions`
* [ ] If the status is accepted/rejected , construct a StatusResponseBody payload from combination of the SubmissionEvent status (+ RejectionCode if applicable) and return it to the client
* [ ] If status is anything else, the tax return hasn't been sent/received by MeF and the appropriate StatusResponseBody is `pending` .
* [ ] Deprecate all Controllers and internal APIs in the Status service. Goodbye!
# Solution 2 Implementation Proposal
* [ ] Add a new API to the AcknowledgementController in AcknowledgementController.java#L45 that accepts a single submission ID instead of a tax return ID. This API will be similar to the current `get()` method within the Controller, except that it will have a different query pattern that will look for the most recent submission of a tax return to just looking for the submission itself via the submission Id.
* [ ] Logic around handling cases where a submission is both accepted and rejected so remain in place (i.e. we should always return accepted if a duplicate condition exists)
* [ ] NOTE: An alternative here is to change the current Status service API to only accept submission IDs (which can be easily accommodated by the Backend as described below) and then expose an internal API in the backend `getMostRecentSubmission` that the State API service calls before calling the `status` endpoint to fetch the status of the latest submission and check if the return can be exported
* [ ] Convert the TaxReturnService.getStatus in TaxReturnService.java#L793 to query for the latest submission ID of the tax return. Before making calling the internal status endpoint, the backend first checks Redis to see if there is a key-value latest submission Id-AcknowledgementStatus pair exists.
* [ ] If a cache hit, return the AcknowledgementStatus response to the client without making a call to the Status service
* [ ] If a cache miss, make a HTTP request to the Status service `/status` endpoint to fetch the AcknowledgementStatus, cache the results and return the result to the client
* [ ] When an Acknowledgement is received within the Status service, update the Redis cache with the accepted/rejected status

View file

@ -0,0 +1,7 @@
# How to generate a tsc perf build
1. Make sure that the normal lint steps run first to generate all the factgraph related stuff
2. `cd` into the package whose tsc build you'd like to test (probably df-client-app/)
3. Run `npx tsc --generateTrace ~/my-tsc-trace-output-directory` replacing the output directory with whatever you'd like
4. Open chrome, in the address bar type `chrome://tracing/`
5. Click the load button to open up the trace generated in step 3. (Should be a json file called `trace.json`)