How I DIY’d my Budget Using Python for Selenium and Beautiful Soup (2024)

Using Python to extract data for your personal budgeting needs

Jennifer Kim

Published in

Towards Data Science

9 min read

Mar 14, 2020

Getting set up

You’ll first need to install two software packages.

The Selenium package

Install by typing the following in your command prompt:

pip install selenium

The web driver of the browser you’re using

The Chrome driver (which is what I’m using) can be found here. There are different drivers for different versions of Chrome. To find out what version you’re using, click the three vertical dots on the top right of your browser. This will take you to settings. Then, open the menu and click “About Chrome” — this will display your Chrome version. Download the applicable driver and make sure it’s in your Python PATH.

A more thorough installation explanation, including links to drivers of other browsers, can be found in the docs here.

Size matters

Now that you have the necessary packages, you can start dictating web elements the driver should select. One thing that can affect the location of these elements is the size of your window. For maximum consistency, I like to full screen my window before starting any processes.

# from line 8
browser.maximize_window()

Locating elements

To retrieve transactions, we first want Selenium to log into the bank’s website. We can identify what elements need to be selected by inspecting the website’s HTML page. To pull up the page, go to the website, and identify the log in box. Right click the Online ID field. Select “Inspect”.

How I DIY’d my Budget Using Python for Selenium and Beautiful Soup (4)

You’ll have this elements locator pop up, highlighted on the field you chose (in this case, Online ID).

How I DIY’d my Budget Using Python for Selenium and Beautiful Soup (5)

There are eight different ways to locate your element in selenium. These are by:

Name

elem = driver.find_element_by_name("INSERT-NAME")

This is the one I decided to use, as indicated by the poorly drawn red circle in the screenshot above.

elem = driver.find_element_by_id("INSERT-ID")

This is considered to be the most accurate method, as each element’s ID is unique.

Link Text

How I DIY’d my Budget Using Python for Selenium and Beautiful Soup (6)

elem = driver.find_element_by_link_text("INSERT-NAME-OF-LINK-ON-PAGE")# example if I wanted to select link circled above
elem = driver.find_element_by_link_text("Vote Is Likely to Fall Largely Along Party Lines")

Partial Link Text

elem = driver.find_element_by_partial_link_text("DONT-NEED-FULL-LINK")# example if I still wanted to select above NYT link
elem = driver.find_element_by_link_text("Vote Is Likely to Fa")

CSS Selector

elem = driver.find_element_by_css_selector("INSERT-CSS-SYNTAX")

Good examples on CSS selectors can be found here: https://saucelabs.com/resources/articles/selenium-tips-css-selectors

Tag Name

# referring to an HTML tag. first element with tag is returned.
elem = driver.find_element_by_tag_name("INSERT-TAG-NAME")

Class Name

elem = driver.find_element_by_class_name("INSERT-CLASS-NAME")

XPath

elem = driver.find_element_by_xpath("INSERT-XPATH")

XPath is a language used for locating nodes in a XML doc. This is useful for when there is no suitable id or name attribute for your target element. The basic format is as follows:

xpath = //tagname[@attribute = "value"]

You can read more on xpath here.

Be mindful that all these methods will only select the first element it finds. To select multiple elements, use the same methods, but replace the word “element” with “elements.” (e.g. driver.find_elements_by_name(“INSERT-NAME”))

Inputting keys

After you find the login element, the next step is to input your credentials. This is done with the function send_keys().

username = browser.find_element_by_name("onlineId1").send_keys("YOUR-USERNAME")time.sleep(2)password = browser.find_element_by_name("passcode1")
password.send_keys(<YOUR PASSWORD>)

Remember to protect yourself by not committing your password anywhere.

I added a wait to tell Selenium to pause for two seconds between entering my username and password using time.sleep(). I found that without it, Selenium moved too fast and the browser had a hard time keeping up.

I would typically press the “Enter” button after I typed in my credentials, so I wanted to do the same in Selenium. Luckily, Selenium has a list of standard keyboard keys. In this case, I used Keys.RETURN:

password.send_keys(Keys.RETURN)

Now you’re in!

To see if you located the elements and inputted your credentials correctly, you can try running your code. A new Chrome instance will pop up and you can see the browser running automatically. This instance is a different browser than the one you use regularly. It contains no cookies and disappears after you are done. Therefore, if you do need cookies, you can check out how to add them on this website.

I can see that my code ran correctly when this Chrome instance takes me to my bank account home page. I see two links: one to my checking account and the other to my credit card. To click these links, I use find_element_by_link_text and select using the click() method.

browser.find_element_by_link_text('Bank of America Travel Rewards Visa Platinum Plus - ****').click()

Once you are on the page with the transactions you want, retrieve the page_source from the web driver and store it in a variable. This will be used for parsing later.

boa_travel_html = browser.page_source

Now the only thing left to do is to repeat with your other bank accounts.

iFrames

The process was nearly the same for my other account at Barclays, aside from a pesky iFrame. An iFrame, or inline frame, is an HTML document embedded inside another HTML document on a website. I first suspected this might be getting in my way when I received an Element Not Found error despite clearly locating the element I wanted by its name. Luckily, Selenium has an easy way to navigate to an iFrame using the switch_to method.

browser.switch_to.frame(browser.find_element_by_tag_name("iframe"))
browser.find_element_by_name("uxLoginForm.username")

Continue to retrieve the page source using the same method as in the Bank of America example.

Headless browser

Once you know that your code works, you can expedite the process by getting rid of the browser that pops up the next time you run your program.

from selenium import webdriver from selenium.webdriver.chrome.options import Options
chrome_options = Options()chrome_options.add_argument("--headless")driver = webdriver.Chrome(options = chrome_options)

You now have all your necessary data. It may not be in a very readable format, but making it usable is what Beautiful Soup is for.

Beautiful Soup is a Python package for parsing HTML files. Now that we have the necessary HTML pages, we can use Beautiful Soup to parse it for the information we need. Again, I’ve included the code in its entirety before diving in below.

Parsing transaction information

It’s time to check out the HTML page you retrieved earlier. Because the jumbled plain text page is so… jumbled, I chose to navigate the HTML via the source itself, by right clicking each of the transactions on the bank website and selecting “Inspect.” This highlighted the transaction in the web page element inspector (used earlier to identify login boxes with Selenium).

How I DIY’d my Budget Using Python for Selenium and Beautiful Soup (7)

The data I wanted to gather included the date, the description of the transaction, and the dollar amount. As seen above, these pieces of information were nested in multiple “td” tags within the parent “tr” tags. I used a combination of find and find_all functions to move along the tree until I arrived at the tag containing the text I wanted. The snippet below is how I retrieved the date.

# narrowed down to largest parent container 
containers = rows.find_all(‘tr’, class_ = [‘trans-first-row odd’, ‘trans-first-row even’, ‘even’, ‘odd’]) dateli = [] 
descli = [] 
amtli = [] 
pending_counter = 0 for container in containers: 
 date = container.find(‘td’, headers = ‘transaction-date’) . .get_text(strip=True)

Since how you use Beautiful Soup is so specific to the web page you’re looking at (as evidenced by the separate functions I made for each page I retrieved), instead of running through my code line by line, I wanted to instead point out irregularities and interesting tidbits I found to help your process be as efficient as possible.

Class is class_

All the Beautiful Soup find functions take HTML attributes as keyword arguments. While this is pretty straightforward for most attributes, in Python, since class is a reserved keyword, you can use class_ to represent its HTML counterpart.

containers = rows.find_all(‘tr’, class_ = [‘trans-first-row odd’, ‘trans-first-row even’, ‘even’, ‘odd’])

Lambda functions in soup.find()

The find functions can also take other functions as arguments. For a quick way to locate specific tags that fit multiple criteria, try inserting a lambda function.

# Example 1
rows = boa_travel_table.find(lambda tag: tag.name=='tbody')# Example 2
boa_checking_table = boa_checking_soup.find(lambda tag: tag.name == 'table' and tag.has_attr('summary') and tag['summary'] == 'Account Activity table, made up of Date, Description, Type, Status, Dollar Amount, and Available balance columns.')

Example 1 is pretty simple. I could also have done without the lambda function to find the same element using this:

rows = boa_travel_table.find('tbody', class_ = 'trans-tbody-wrap')

Example 2 is where the lambda function’s power really shines. By combining multiple criteria and using Python’s has_attr, my power to search for exactly what I want increases exponentially. Another good example of lambda’s usefulness (and an explanation of lambda!) can be found here, where the author takes the Python isinstance function to conduct Beautiful Soup searches.

Beautiful Soup’s text vs. string

In Rows 8–19 of my Beautiful Soup code above, I narrowed down the tags (or containers as I like to call them) to the largest one that contained all three pieces of information I wanted to extract (date, description, amount) for each transaction. To extract data from these drilled down containers, I used soup.tag.get_text().

date = container.find('td', headers = 'transaction-date').get_text(strip=True)

If you read through the Beautiful Soup documentation, you may have seen soup.tag.string instead used to extract text. This is the function I first used, but I quickly found it did not work in this situation. soup.tag.string only returns a NavigableString type object, which further must be the only object present in the tag.

soup.tag.get_text(), on the other hand, can access all its childrens’ strings (even if it’s not a direct child) and returns a Unicode object. Therefore, if the text data you want to extract lives within a child tag (you can see the a tag within the td tag in the screenshot below), you should use soup.tag.get_text().

How I DIY’d my Budget Using Python for Selenium and Beautiful Soup (8)

If you prefer slightly cleaner code, you can also use soup.tag.text. This calls get_text(), and basically does the same thing, but I prefer the original get_text() as it supports keyword arguments like separator, strip, and types. For this project, I included strip=True as a keyword argument to strip out any white spaces from the text.

You now have the power to retrieve all your financial data from your sources by running a single program. This is your start to creating your own Budget Type Index and finding out more about yourself through your spending habits. Head off to collect your data points, and become the best financial version of yourself!