Accounting image

Introduction

Data science skills are good for exploring the world around us through a quantitative lens, and can also lead to a rewarding and stimulating career. But data science can also help you organize your personal business. For example, I was recently interested to see the balance history of my checking account. My transaction history is available for download from my bank, but the balance history is not.

However, I realized I could use the popular Python package pandas to wrangle the transaction data and compute a balance history. Then I learned a visualization tool that I’d never used, but heard a lot of good things about, Plotly, to create an interactive graph of my balance history with all the information I wanted to see. This included a line plot of the daily balance, with hover text displaying the description and amount of individual transactions upon mouseover, as well as a plot of monthly average balance. In the end, I found the result quite helpful in giving me a quick and informative look at my financial history.

If you’re interested to learn a bit more about wrangling and visualizing data in Python, I hope that you can gain some of those skills from reading this post. Perhaps you may even use the code for the same purpose I did. This notebook could be adapted to visualize a balance history using any dated transaction history. And if you happen to bank with a certain major national chain, you will find you can download your transaction history as a CSV file in exactly the same format as the synthetic data shown here, and use the code directly. Enjoy!


Loading packages and printing versions

First things first.

import sys
import pandas as pd
import numpy as np
import plotly
import plotly.graph_objects as go
print('The Python version is {}\n'.format(sys.version))
print('The pandas version is {}\n'.format(pd.__version__))
print('The NumPy version is {}\n'.format(np.__version__))
print('The Plotly version is {}\n'.format(plotly.__version__))
The Python version is 3.7.4 (default, Aug 13 2019, 15:17:50) 
[Clang 4.0.1 (tags/RELEASE_401/final)]

The pandas version is 1.0.1

The NumPy version is 1.18.1

The Plotly version is 4.5.4

Loading and exploring the data

I’ve created synthetic transaction data for a person with a very simple financial profile: they get a bimonthly paycheck, pay rent and bills on the first business day of every month, and have a few other expenses at irregular times, including a large purchase that might be a car, for example. This is much simpler than what a real transaction record would probably look like, but it works for illustrating the approach here. If you’re interested in the details of using pandas to create the synthetic data, please see the appendix below.

Let’s load the data and do some basic profiling.

Note: I needed to use the index_col=False argument in read_csv to load the data I downloaded from my bank, as there was a missing column header for the first column.

transactions_df = pd.read_csv('../data/synthetic_transaction_data.csv')
transactions_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   41 non-null     object 
 1   Date         41 non-null     object 
 2   Description  41 non-null     object 
 3   Credit       24 non-null     float64
 4   Debit        17 non-null     float64
dtypes: float64(2), object(3)
memory usage: 1.7+ KB
transactions_df.head()
Unnamed: 0 Date Description Credit Debit
0 Cleared 2019-12-31 Payroll 2500.0 NaN
1 Cleared 2019-12-15 Payroll 2500.0 NaN
2 Cleared 2019-12-02 Rent and bills NaN 3500.0
3 Cleared 2019-11-30 Payroll 2500.0 NaN
4 Cleared 2019-11-29 Purchase NaN 750.0

Looks like there are 41 rows in the data, and each row represents a transaction. The first column here looks like it consists entirely of the string Cleared:

transactions_df['Unnamed: 0'].unique()
array(['Cleared'], dtype=object)

Yes it does. This likely just means the transaction has cleared, or in other words been settled. The other columns include the date of the transaction, a description, and separate columns for whether the transaction was a credit to the account (i.e. money flowing in to the account) or a debit (money flowing out). Each transaction is either debit or credit but not both, with the unused column having a missing value in each row.

Convert the Date column to a datetime data type in pandas to enable pandas’ powerful time series capabilities, then look at the range of dates.

transactions_df['Date'] = pd.to_datetime(transactions_df['Date'])
print(transactions_df['Date'].min(), transactions_df['Date'].max())
2019-01-01 00:00:00 2019-12-31 00:00:00

We can see the data span one year in time.

Wrangling the data and calculating a balance history

Right now the transaction amounts are contained in separate columns. Ideally we’d have one column that includes all transaction amounts, with a sign to tell the difference between credits and debits. This would make it easier to add credits and subtract debits from the balance through time. Let’s create such a column, by filling the missing values with zero, adding the credits, and subtracting the debits:

transactions_df['Transaction'] = transactions_df['Credit'].fillna(0)\
- transactions_df['Debit'].fillna(0)
transactions_df.head(5)
Unnamed: 0 Date Description Credit Debit Transaction
0 Cleared 2019-12-31 Payroll 2500.0 NaN 2500.0
1 Cleared 2019-12-15 Payroll 2500.0 NaN 2500.0
2 Cleared 2019-12-02 Rent and bills NaN 3500.0 -3500.0
3 Cleared 2019-11-30 Payroll 2500.0 NaN 2500.0
4 Cleared 2019-11-29 Purchase NaN 750.0 -750.0

Looks like this worked.

Now, one of the things I’ve seen people make with Plotly is an interactive plot with hover text, that appears over data points when the user mouses over them. I decided I wanted the amount and description of each transaction to appear as hover text on my plot, so I created a new column with this string.

transactions_df['Description_amount'] = \
transactions_df['Transaction'].astype(str).str.cat(
    transactions_df['Description'], sep=': ')
transactions_df.head()
Unnamed: 0 Date Description Credit Debit Transaction Description_amount
0 Cleared 2019-12-31 Payroll 2500.0 NaN 2500.0 2500.0: Payroll
1 Cleared 2019-12-15 Payroll 2500.0 NaN 2500.0 2500.0: Payroll
2 Cleared 2019-12-02 Rent and bills NaN 3500.0 -3500.0 -3500.0: Rent and bills
3 Cleared 2019-11-30 Payroll 2500.0 NaN 2500.0 2500.0: Payroll
4 Cleared 2019-11-29 Purchase NaN 750.0 -750.0 -750.0: Purchase

Looking at the earliest transactions in the record, which come last in the reverse chronological order of the data, we can see that some dates have more than one transaction. Since we want a daily balance, we will need a way to combine rows that have the same date.

transactions_df.tail()
Unnamed: 0 Date Description Credit Debit Transaction Description_amount
36 Cleared 2019-02-01 Rent and bills NaN 3500.0 -3500.0 -3500.0: Rent and bills
37 Cleared 2019-01-31 Payroll 2200.0 NaN 2200.0 2200.0: Payroll
38 Cleared 2019-01-15 Purchase NaN 1000.0 -1000.0 -1000.0: Purchase
39 Cleared 2019-01-15 Payroll 2200.0 NaN 2200.0 2200.0: Payroll
40 Cleared 2019-01-01 Rent and bills NaN 3500.0 -3500.0 -3500.0: Rent and bills

To collapse the data down to a date level, we can groupby date. Within each day, the way to aggregate transactions is to add them up, giving the net change in balance on that day. We’ll also need a way to combine the text strings that describe the transactions, to have the hover text on the daily balance plot. Concatenating strings with a newline separator is a readable way to do this, and it turns out the HTML line break syntax <br> is what we’ll need for our Plotly graph.

transactions_date_group = transactions_df.groupby('Date').agg({
    'Description_amount': '<br>'.join,
    'Transaction':'sum'
})
transactions_date_group.head()
Description_amount Transaction
Date
2019-01-01 -3500.0: Rent and bills -3500.0
2019-01-15 -1000.0: Purchase<br>2200.0: Payroll 1200.0
2019-01-31 2200.0: Payroll 2200.0
2019-02-01 -3500.0: Rent and bills -3500.0
2019-02-14 -800.0: Purchase -800.0

The grouping by date resulted in a chronological ordering from earliest to latest, and we can see that multiple transactions on the same date have been combined. We are almost done with calculating a balance history. The last steps require knowing the balance at some point in time. This information should be readily available on the day the balance history was obtained. Let’s say the hypothetical person here had $22,834 in the bank on 12/31/2019, after all transactions had cleared for that day.

ending_balance = 22834

What about the starting balance? If we subtract all the credits and add all the debits in the record to the ending balance, we will have the balance before the first transaction in the record. In order to get this starting balance, we can subtract the sum of the Transaction column we created from the ending balance.

starting_balance = ending_balance - transactions_date_group['Transaction'].sum()
starting_balance
19984.0

To get a balance through time, we need the starting balance with all credits added, and debits subtracted, for each day in the record up to and including that day. This would result in a daily balance according to the clearing date of each transaction. In order to get a sum of all previous rows up to and including the current row, we can use pandas’ cumsum method on the Transaction Series. Then we add this to the starting balance and we have the balance history.

transactions_date_group['Running balance'] = starting_balance\
+ transactions_date_group['Transaction'].cumsum()
transactions_date_group.head()
Description_amount Transaction Running balance
Date
2019-01-01 -3500.0: Rent and bills -3500.0 16484.0
2019-01-15 -1000.0: Purchase<br>2200.0: Payroll 1200.0 17684.0
2019-01-31 2200.0: Payroll 2200.0 19884.0
2019-02-01 -3500.0: Rent and bills -3500.0 16384.0
2019-02-14 -800.0: Purchase -800.0 15584.0

It looks like this has worked. We can check by confirming the running balance at the end of the first day in the record is the starting balance plus the sum of transactions for that day:

first_date_index = transactions_date_group.index[0]
check_1 = starting_balance + transactions_date_group.loc[first_date_index,
                                                         'Transaction']
check_2 = transactions_date_group.loc[first_date_index, 'Running balance']

print(check_1)
print(check_2)

assert check_1 == check_2
16484.0
16484.0

Looks like our check passed! Now we have enough information to plot a daily balance with descriptive hover text. However, I also wanted to know what the monthly average balance was, where the average was taken over the end-of-day balance for all days in the month. For this we need a bit more data manipulation.

Monthly average balance

Right now the frequency of our balance history is irregular. There are only records for days that had transactions. If we want an accurate monthly average, representing the average balance across all the days of the month, it would be helpful to interpolate our time series of balance history, filling in missing days with the most recent balance. A quick way to do this in pandas is to take our balance history that we already calculated and create a new DataFrame with a DatetimeIndex that we specify to have all the days of the year. Then we can easily fill in missing values to have the daily balance.

First let’s create the daily index.

daily_index = pd.date_range(transactions_date_group.index.min(),
                            transactions_date_group.index.max())

Now let’s create the new DataFrame and reindex it using our daily index. The days with no transaction data will be filled with np.nan.

df_daily = pd.DataFrame(transactions_date_group['Running balance'])
df_daily = df_daily.reindex(daily_index, fill_value=np.nan)
df_daily.head(5)
Running balance
2019-01-01 16484.0
2019-01-02 NaN
2019-01-03 NaN
2019-01-04 NaN
2019-01-05 NaN

Looks like we’ve got a row for each day! Now, we need to fill missing values, with the most recent balance. Pandas has an option to “forward fill” missing values, which works well with our chronologically ordered DataFrame.

df_daily = df_daily.fillna(method='ffill')
df_daily.head(5)
Running balance
2019-01-01 16484.0
2019-01-02 16484.0
2019-01-03 16484.0
2019-01-04 16484.0
2019-01-05 16484.0

Finally, because we have a DatetimeIndex, calcuating the monthly average balance is a snap with pandas resampling capabilities.

df_monthly = \
pd.DataFrame(df_daily['Running balance'].resample(rule='1M').mean())
df_monthly.head()
Running balance
2019-01-31 17213.032258
2019-02-28 17134.000000
2019-03-31 17761.419355
2019-04-30 18800.666667
2019-05-31 20335.612903

This has given us the average monthly balance, associated with the last day of the month.

Creating the plot

We now have all the information we need to create a descriptive plot of daily balance and monthly average balance. In order to do this, I found that Plotly’s Graph Objects API provided the flexibility I needed to customize the plot to my liking, although Plotly has another more lightweight interface called Plotly Express.

Unpacking the code below, we first create a figure, similar to plotting with Matplotlib, another popular visualization library in Python. Then we add two “traces” to the figure, each of which is a Scatter object. These are the plots of daily and monthly average balance.

The DatetimeIndex from each DataFrame we created is used for the x-coordinates in each plot, and the y-coordinates are the quantities we want to visualize over time, with some rounding for visual presentation purposes. Both plots will have lines and markers (mode='lines+markers'), and are named so they can be represented in the legend.

Aside from these specifications, the daily balance plot has additional options including the custom hover text we created (text), and a hoverinfo argument stating that we want the x- and y-coordinates displayed upon mouseover, along with the custom text. Finally this plot is given an hv line shape, which means that in order to trace the line from point to point through time, first a step is taken in the horizontal direction, then the vertical direction. This is what we want, because the balance remains the same after each day with a transaction, moving horizontally through time on the graph, and then moves vertically to the new balance on the day of the next transaction. The effect of this is similar to filling missing values forward in time, like we did with the daily-interpolated balances above.

The last touch before showing the figure is to format the hover text, so that it doesn’t show too many decimal places. And then “Voila!”, we have our visualization of balance history!

fig = go.Figure()
fig.add_trace(go.Scatter(x=transactions_date_group.index,
                         y=transactions_date_group['Running balance'].round(0),
                         mode='lines+markers',
                         name='Daily balance',
                         text=transactions_date_group['Description_amount'],
                         hoverinfo='x+y+text',
                         line_shape='hv'))

fig.add_trace(go.Scatter(x=df_monthly.index,
                         y=df_monthly['Running balance'].round(0),
                         mode='lines+markers',
                         name='Monthly mean balance'))

fig.update_yaxes(hoverformat="$d")

fig.show()

Notice how you can interact with the plot, including mousing over to see details, as well as dragging to select a portion to zoom in on. Double-click to zoom back out.

As you can see, there are many ways to customize visualizations with Plotly and I’ve likely only scratched the surface here. For me, the most important part was the hover text that I was able to customize to show all the information I wanted. But I also realized it’s nice to be able to zoom in to particular regions of the plot to get a more detailed look.

If you were looking to learn more about data wrangling with pandas and visualization with Plotly, or even do the specific task performed here, I hope you found this blog post helpful. For more data science and machine learning resources, check out my other posts as well as my book Data Science Projects with Python: A case study approach to successful data science projects using Python, pandas, and scikit-learn.

Appendix: Creating the synthetic data

Start with a bimonthly paycheck (SM = semi-month end frequency (15th and end of month)). This person got a raise starting in April. Good for them!

paycheck_times = pd.Series(pd.date_range(start=pd.Timestamp(2019,1,1),
                                         end=pd.Timestamp(2019,12,31),
                                         freq='SM'))
paycheck_amounts = np.empty(paycheck_times.shape)
paycheck_amounts[:6] = 2200
paycheck_amounts[6:] = 2500
paycheck_df = pd.DataFrame({'Unnamed: 0':'Cleared',
                            'Date':paycheck_times,
                            'Description':'Payroll',
                            'Credit':paycheck_amounts})
paycheck_df.head(8)
Unnamed: 0 Date Description Credit
0 Cleared 2019-01-15 Payroll 2200.0
1 Cleared 2019-01-31 Payroll 2200.0
2 Cleared 2019-02-15 Payroll 2200.0
3 Cleared 2019-02-28 Payroll 2200.0
4 Cleared 2019-03-15 Payroll 2200.0
5 Cleared 2019-03-31 Payroll 2200.0
6 Cleared 2019-04-15 Payroll 2500.0
7 Cleared 2019-04-30 Payroll 2500.0

Include some rent. For simplicity’s sake, I’ll assume all this person’s monthly expenses come on the first business day of the month (BMS or business month start), although a realistic transaction record would likely be more complicated.

rent_times = pd.Series(pd.date_range(start=pd.Timestamp(2019,1,1),
                                     end=pd.Timestamp(2019,12,31),
                                     freq='BMS'))
rent_amounts = np.ones(rent_times.shape) * 3500
rent_df = pd.DataFrame({'Unnamed: 0':'Cleared',
                        'Date':rent_times,
                        'Description':'Rent and bills',
                        'Debit':rent_amounts})
rent_df.head()
Unnamed: 0 Date Description Debit
0 Cleared 2019-01-01 Rent and bills 3500.0
1 Cleared 2019-02-01 Rent and bills 3500.0
2 Cleared 2019-03-01 Rent and bills 3500.0
3 Cleared 2019-04-01 Rent and bills 3500.0
4 Cleared 2019-05-01 Rent and bills 3500.0

Finally, some purchases.

purchase_times = pd.Series([
    pd.Timestamp(2019,1,15),
    pd.Timestamp(2019,2,14),
    pd.Timestamp(2019,7,1),
    pd.Timestamp(2019,9,5),
    pd.Timestamp(2019,11,29)
])
purchase_amounts = np.array([1000, 800, 10500, 300, 750])
purchase_df = pd.DataFrame({'Unnamed: 0':'Cleared',
                            'Date':purchase_times,
                            'Description':'Purchase',
                            'Debit':purchase_amounts})
purchase_df.head()
Unnamed: 0 Date Description Debit
0 Cleared 2019-01-15 Purchase 1000
1 Cleared 2019-02-14 Purchase 800
2 Cleared 2019-07-01 Purchase 10500
3 Cleared 2019-09-05 Purchase 300
4 Cleared 2019-11-29 Purchase 750

Merge all transactions into one DataFrame.

transactions_df = pd.concat([paycheck_df, rent_df, purchase_df])
transactions_df = transactions_df.sort_values(by='Date',
                                              ascending=False,
                                              ignore_index=True)
transactions_df.head()
Unnamed: 0 Date Description Credit Debit
0 Cleared 2019-12-31 Payroll 2500.0 NaN
1 Cleared 2019-12-15 Payroll 2500.0 NaN
2 Cleared 2019-12-02 Rent and bills NaN 3500.0
3 Cleared 2019-11-30 Payroll 2500.0 NaN
4 Cleared 2019-11-29 Purchase NaN 750.0
transactions_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Unnamed: 0   41 non-null     object        
 1   Date         41 non-null     datetime64[ns]
 2   Description  41 non-null     object        
 3   Credit       24 non-null     float64       
 4   Debit        17 non-null     float64       
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 1.7+ KB
transactions_df.to_csv('../data/synthetic_transaction_data.csv', index=False)