Pandas (for data to plot)#
The standard data manipulation framework in Python (many others make use of at least part of its functionalities or are compatible with it)
We are not going in-depth - we are just showing some basic functionalities which are important for plotting.
import os
from pathlib import Path
import numpy as np
import pandas as pd
IN_COLAB = "COLAB_GPU" in os.environ
DATA_DIR = Path("data")
fname = Path("data") / "growth" / "fake_growth_data.csv"
if IN_COLAB:
fname = (
"https://raw.githubusercontent.com/biosustain/dsp_workshop_dataviz_python"
"/refs/heads/main/data/growth/fake_growth_data.csv"
)
1. DATA I/O (Input/Output)#
1. Read Data#
Let us load in some fake data made to fit some growth data.
df = pd.read_csv(fname)
Inspect the first few lines.
df.head()
| condition | SFN concentration (µM) | replicate | time (h) | Bacterial growth (OD600) | |
|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 |
| 4 | Anaerobic | DMSO | 1 | 8.0 | 0.533996 |
You could also omit .head() and it will automatically shorten the output based on your rendering settings.
These settings are controlled by pandas display options, such as
pd.options.display.max_rowsandpd.options.display.max_columns.
df
| condition | SFN concentration (µM) | replicate | time (h) | Bacterial growth (OD600) | |
|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 |
| 4 | Anaerobic | DMSO | 1 | 8.0 | 0.533996 |
| ... | ... | ... | ... | ... | ... |
| 463 | Aerobic | 2.5 | 3 | 16.0 | 0.943037 |
| 464 | Aerobic | 2.5 | 3 | 18.0 | 0.956014 |
| 465 | Aerobic | 2.5 | 3 | 20.0 | 0.945902 |
| 466 | Aerobic | 2.5 | 3 | 22.0 | 0.908015 |
| 467 | Aerobic | 2.5 | 3 | 24.0 | 0.917409 |
468 rows × 5 columns
1.2 Write it to a File#
E.g. after some data manipulation you wish to save the data.
Write a dataframe to a .csv file.
df.to_csv("fake_data.csv")
pd.read_csv() and .to_csv() are the general input/output functions (among some more).
Meaning: they work for other common files too (.txt, .tsv) and some more domain-specific
files that are however in a (somewhat) tabular format (e.g. .vcf).
In other words - as long as it looks like a table, you can read /write to it,
regardless of the file extension.
To demonstrate, let us save it to a .tsv file. The difference is that it is
tab-separated values instead of comma-separated values in a .csv file.
df.to_csv("fake_data.tsv", sep="\t")
It can make sense to omit the index column, as Pandas automatically creates
one when you read it, even if it already exists.
(Otherwise you specify e.g. index_col=0 when reading it - try it out to see the difference).
df.to_csv("fake_data.tsv", sep="\t", index=False)
pd.read_csv("fake_data.tsv", sep="\t", index_col=None).head()
| condition | SFN concentration (µM) | replicate | time (h) | Bacterial growth (OD600) | |
|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 |
| 4 | Anaerobic | DMSO | 1 | 8.0 | 0.533996 |
You can also read/write zipped file. Pandas will try to detect it automatically, but you can specify it yourself:
df.to_csv("fake_data.tsv.gz", sep="\t", index=False, compression="gzip")
2. Data Manipulation#
2.1 Rename#
The column names are written in a human-readable format, with spaces and brackets.
It can be a good practice however to write them in a more programming-friendly way.
We can easily do it by using a dictionary with the syntax
{<old-name>: <new-name>}
df_col_renamed = df.rename(
columns={
"SFN concentration (µM)": "sfn_conc_mumolar",
"time (h)": "time_h",
"Bacterial growth (OD600)": "bact_growth_od600",
}
)
df_col_renamed
| condition | sfn_conc_mumolar | replicate | time_h | bact_growth_od600 | |
|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 |
| 4 | Anaerobic | DMSO | 1 | 8.0 | 0.533996 |
| ... | ... | ... | ... | ... | ... |
| 463 | Aerobic | 2.5 | 3 | 16.0 | 0.943037 |
| 464 | Aerobic | 2.5 | 3 | 18.0 | 0.956014 |
| 465 | Aerobic | 2.5 | 3 | 20.0 | 0.945902 |
| 466 | Aerobic | 2.5 | 3 | 22.0 | 0.908015 |
| 467 | Aerobic | 2.5 | 3 | 24.0 | 0.917409 |
468 rows × 5 columns
We can do this - or many other operations - inplace for brevity.
df.rename(
columns={
"SFN concentration (µM)": "sfn_conc_mumolar",
"time (h)": "time_h",
"Bacterial growth (OD600)": "bact_growth_od600",
},
inplace=True,
)
df
| condition | sfn_conc_mumolar | replicate | time_h | bact_growth_od600 | |
|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 |
| 4 | Anaerobic | DMSO | 1 | 8.0 | 0.533996 |
| ... | ... | ... | ... | ... | ... |
| 463 | Aerobic | 2.5 | 3 | 16.0 | 0.943037 |
| 464 | Aerobic | 2.5 | 3 | 18.0 | 0.956014 |
| 465 | Aerobic | 2.5 | 3 | 20.0 | 0.945902 |
| 466 | Aerobic | 2.5 | 3 | 22.0 | 0.908015 |
| 467 | Aerobic | 2.5 | 3 | 24.0 | 0.917409 |
468 rows × 5 columns
2.2 Filtering by Conditions#
As we can see, we have data for aerobic and anaerobic conditions. If we are only interested in looking at e.g. the aerobic data, we can do it like this:
df_aerobic = df[df["condition"] == "Aerobic"]
df_aerobic
| condition | sfn_conc_mumolar | replicate | time_h | bact_growth_od600 | |
|---|---|---|---|---|---|
| 234 | Aerobic | DMSO | 1 | 0.0 | 0.051324 |
| 235 | Aerobic | DMSO | 1 | 2.0 | 0.012470 |
| 236 | Aerobic | DMSO | 1 | 4.0 | 0.116809 |
| 237 | Aerobic | DMSO | 1 | 6.0 | 0.413621 |
| 238 | Aerobic | DMSO | 1 | 8.0 | 0.710586 |
| ... | ... | ... | ... | ... | ... |
| 463 | Aerobic | 2.5 | 3 | 16.0 | 0.943037 |
| 464 | Aerobic | 2.5 | 3 | 18.0 | 0.956014 |
| 465 | Aerobic | 2.5 | 3 | 20.0 | 0.945902 |
| 466 | Aerobic | 2.5 | 3 | 22.0 | 0.908015 |
| 467 | Aerobic | 2.5 | 3 | 24.0 | 0.917409 |
234 rows × 5 columns
We can also filter by multiple conditions. For example, get all anaerobic data from DMSO, where the OD600 is below a value of 0.4:
df_multi_condition = df[
(df["condition"] == "Anaerobic")
& (df["sfn_conc_mumolar"] == "DMSO")
& (df["bact_growth_od600"] < 0.4)
]
df_multi_condition
| condition | sfn_conc_mumolar | replicate | time_h | bact_growth_od600 | |
|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 |
| 13 | Anaerobic | DMSO | 2 | 0.0 | 0.048281 |
| 14 | Anaerobic | DMSO | 2 | 2.0 | 0.023912 |
| 15 | Anaerobic | DMSO | 2 | 4.0 | 0.083990 |
| 16 | Anaerobic | DMSO | 2 | 6.0 | 0.268920 |
| 26 | Anaerobic | DMSO | 3 | 0.0 | 0.020199 |
| 27 | Anaerobic | DMSO | 3 | 2.0 | 0.038043 |
| 28 | Anaerobic | DMSO | 3 | 4.0 | 0.126597 |
| 29 | Anaerobic | DMSO | 3 | 6.0 | 0.367225 |
2.3 Adding Data By Conditions#
Part of data transformation is adding in new variables based on other columns in the
table. This can be also just for plotting. For example, we could wish to color our
plot on whether our bacteria are in their stationary phase or not. Then, we would add
a variable called is_stationary() that we use to subset parts of our data for
plotting (we will show how later with seaborn). We use NumPy for this purpose with
the syntax
(<condition(s)>, <output-if-true>, <output-if-false>).
df["is_stationary"] = np.where(df["bact_growth_od600"] >= 0.55, True, False)
df
| condition | sfn_conc_mumolar | replicate | time_h | bact_growth_od600 | is_stationary | |
|---|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 | False |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 | False |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 | False |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 | False |
| 4 | Anaerobic | DMSO | 1 | 8.0 | 0.533996 | False |
| ... | ... | ... | ... | ... | ... | ... |
| 463 | Aerobic | 2.5 | 3 | 16.0 | 0.943037 | True |
| 464 | Aerobic | 2.5 | 3 | 18.0 | 0.956014 | True |
| 465 | Aerobic | 2.5 | 3 | 20.0 | 0.945902 | True |
| 466 | Aerobic | 2.5 | 3 | 22.0 | 0.908015 | True |
| 467 | Aerobic | 2.5 | 3 | 24.0 | 0.917409 | True |
468 rows × 6 columns
2.4 Aggregating Data#
Maybe we are more interested in the maximum or average OD or alike in our experiment.
We could just use .max() for example but it would give us the maximum value across
all variables. If however we wish to distinguish between different cases
(like anaerobic/aerobic), then it makes sense to group our data first.
Here, we show how to aggregate across all replicates:
df_rep_agg = (
df.groupby(["time_h", "condition", "sfn_conc_mumolar"])["bact_growth_od600"]
.mean()
.reset_index()
)
df_rep_agg
| time_h | condition | sfn_conc_mumolar | bact_growth_od600 | |
|---|---|---|---|---|
| 0 | 0.0 | Aerobic | 10 | 0.035428 |
| 1 | 0.0 | Aerobic | 15 | 0.069004 |
| 2 | 0.0 | Aerobic | 2.5 | 0.062911 |
| 3 | 0.0 | Aerobic | 20 | 0.046474 |
| 4 | 0.0 | Aerobic | 5 | 0.057184 |
| ... | ... | ... | ... | ... |
| 151 | 24.0 | Anaerobic | 15 | 0.616655 |
| 152 | 24.0 | Anaerobic | 2.5 | 0.660576 |
| 153 | 24.0 | Anaerobic | 20 | 0.597215 |
| 154 | 24.0 | Anaerobic | 5 | 0.642585 |
| 155 | 24.0 | Anaerobic | DMSO | 0.655811 |
156 rows × 4 columns
We put our command in brackets so we can write it out in multiple lines to show it
better. We also reset the index at the end to transform the pd.Series object back into
a pd.DataFrame object.
We can also aggregate multiple things at once:
df_multi_agg = (
df.groupby(["time_h", "condition", "sfn_conc_mumolar"])
.agg(
{
"is_stationary": ["count"],
"bact_growth_od600": [
"min",
"mean",
"max",
],
}
)
.reset_index()
)
df_multi_agg
| time_h | condition | sfn_conc_mumolar | is_stationary | bact_growth_od600 | |||
|---|---|---|---|---|---|---|---|
| count | min | mean | max | ||||
| 0 | 0.0 | Aerobic | 10 | 3 | 0.029107 | 0.035428 | 0.043397 |
| 1 | 0.0 | Aerobic | 15 | 3 | 0.066276 | 0.069004 | 0.071293 |
| 2 | 0.0 | Aerobic | 2.5 | 3 | 0.034301 | 0.062911 | 0.096214 |
| 3 | 0.0 | Aerobic | 20 | 3 | 0.023262 | 0.046474 | 0.074124 |
| 4 | 0.0 | Aerobic | 5 | 3 | 0.051937 | 0.057184 | 0.061837 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 151 | 24.0 | Anaerobic | 15 | 3 | 0.602570 | 0.616655 | 0.624859 |
| 152 | 24.0 | Anaerobic | 2.5 | 3 | 0.633187 | 0.660576 | 0.678301 |
| 153 | 24.0 | Anaerobic | 20 | 3 | 0.572835 | 0.597215 | 0.617822 |
| 154 | 24.0 | Anaerobic | 5 | 3 | 0.584576 | 0.642585 | 0.689661 |
| 155 | 24.0 | Anaerobic | DMSO | 3 | 0.618712 | 0.655811 | 0.696756 |
156 rows × 7 columns
2.5 Adding Data by Aggregation#
Previously, we just added the is_stationary variable based on a guess. We could also
do it more programmatically. We do it here in two steps:
Adding a column indicating the maximum OD600 for that condition-sfn-replicate, minus a tolerance
Conditioning on that new column
df["stationary_od"] = df.groupby(["condition", "sfn_conc_mumolar", "replicate"])[
"bact_growth_od600"
].transform(lambda col: col.max() - 0.1)
df["is_stationary"] = np.where(
df["bact_growth_od600"] >= df["stationary_od"], True, False
)
df
| condition | sfn_conc_mumolar | replicate | time_h | bact_growth_od600 | is_stationary | stationary_od | |
|---|---|---|---|---|---|---|---|
| 0 | Anaerobic | DMSO | 1 | 0.0 | 0.053628 | False | 0.556991 |
| 1 | Anaerobic | DMSO | 1 | 2.0 | 0.034735 | False | 0.556991 |
| 2 | Anaerobic | DMSO | 1 | 4.0 | 0.137467 | False | 0.556991 |
| 3 | Anaerobic | DMSO | 1 | 6.0 | 0.300051 | False | 0.556991 |
| 4 | Anaerobic | DMSO | 1 | 8.0 | 0.533996 | False | 0.556991 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 463 | Aerobic | 2.5 | 3 | 16.0 | 0.943037 | True | 0.856014 |
| 464 | Aerobic | 2.5 | 3 | 18.0 | 0.956014 | True | 0.856014 |
| 465 | Aerobic | 2.5 | 3 | 20.0 | 0.945902 | True | 0.856014 |
| 466 | Aerobic | 2.5 | 3 | 22.0 | 0.908015 | True | 0.856014 |
| 467 | Aerobic | 2.5 | 3 | 24.0 | 0.917409 | True | 0.856014 |
468 rows × 7 columns
That tolerance could also be calculated in a similar way, e.g. based on the standard deviation (we leave that as an exercise if you want to try it out yourself).
# Your turn
There are many more useful transformations, but we end here by having shown some that you could already use in more advanced plots.