SEC Filings Parsing#

One of the aspects that make parsing an SEC filing challenging is that each company have a slightly different format for the filing. Therefore we have implemented a custom process to attempt to clean up the HTML content as much as is necessary to get a common format for each filing. Then we extract the headings from the Table of Contents that is always present at the start of the filing. Then the headings can be used to extract the relevant sections from the filing.

The process is implemented in the SECFilingParser class and relies heavily on the BeautifulSoup package to parse the HTML content.

The following simplified diagram shows the overall steps for the parsing:

---
config:
  layout: elk
  theme: base
  look: classic
---
flowchart LR
    F["fa:fa-file SEC Filing"] --> C["fa:fa-broom Clean"]
    C --> E
    E["fa:fa-list Extract TOC"] --> RF["fa:fa-triangle-exclamation Extract Risk Factors"]
    E --> MDA["fa:fa-comment-dots Extract MD&A"]

    classDef lightBlue fill:#ADD8E6,stroke:black
    class F lightBlue

    classDef strokeBlack stroke:black
    class E,C strokeBlack

    classDef green fill:#4CAF50,stroke:black
    class MDA,RF green