SEC Filings Parsing#
One of the aspects that make parsing an SEC filing challenging is that each company have a slightly different format for the filing. Therefore we have implemented a custom process to attempt to clean up the HTML content as much as is necessary to get a common format for each filing. Then we extract the headings from the Table of Contents that is always present at the start of the filing. Then the headings can be used to extract the relevant sections from the filing.
The process is implemented in the SECFilingParser class and relies heavily on the BeautifulSoup package to parse the HTML content.
The following simplified diagram shows the overall steps for the parsing:
---
config:
layout: elk
theme: base
look: classic
---
flowchart LR
F["fa:fa-file SEC Filing"] --> C["fa:fa-broom Clean"]
C --> E
E["fa:fa-list Extract TOC"] --> RF["fa:fa-triangle-exclamation Extract Risk Factors"]
E --> MDA["fa:fa-comment-dots Extract MD&A"]
classDef lightBlue fill:#ADD8E6,stroke:black
class F lightBlue
classDef strokeBlack stroke:black
class E,C strokeBlack
classDef green fill:#4CAF50,stroke:black
class MDA,RF green