Visualizing tabular data

s2t2 · s2t2 · commit d9ac8e7d9aeb · 2024-12-27T18:20:21.000-05:00
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -92,6 +92,15 @@ book:
           text: "Exporting Data Frames"
 
 
+    - part: "Visualizing Tabular Data"
+      chapters:
+        - href: notes/dataviz-tabular/overview.qmd
+          text: "Data Visualization Overview"
+        - href: notes/dataviz-tabular/trendlines.qmd
+          text: "Charts with Trendlines"
+        - href: notes/dataviz-tabular/candlesticks.qmd
+          text: "Candlestick Charts"
+
 
     - part: "Time-series Data Analysis"
       chapters:
diff --git a/docs/notes/dataviz-tabular/candlesticks.qmd b/docs/notes/dataviz-tabular/candlesticks.qmd
@@ -0,0 +1,45 @@
+---
+format:
+  html:
+    code-fold: false
+jupyter: python3
+execute:
+  cache: true # re-render only when source changes
+---
+
+# Candlestick Charts, Revisited
+
+We have [previously studied](https://prof-rossetti.github.io/intro-software-dev-python-book/notes/dataviz/candlesticks.html) how to create candlestick charts using OHLC data. We can do this with tabular data as well.
+
+Converting OHLC data to `DataFrame` object:
+
+```{python}
+from pandas import DataFrame
+
+ohlc_data = [
+    {"date": "2030-03-16", "open": 236.2800, "high": 240.0550, "low": 235.9400, "close": 237.7100, "volume": 28092196},
+    {"date": "2030-03-15", "open": 234.9600, "high": 235.1850, "low": 231.8100, "close": 234.8100, "volume": 26042669},
+    {"date": "2030-03-12", "open": 234.0100, "high": 235.8200, "low": 233.2300, "close": 235.7500, "volume": 22653662},
+    {"date": "2030-03-11", "open": 234.9600, "high": 239.1700, "low": 234.3100, "close": 237.1300, "volume": 29907586},
+    {"date": "2030-03-10", "open": 237.0000, "high": 237.0000, "low": 232.0400, "close": 232.4200, "volume": 29746812}
+]
+df = DataFrame(ohlc_data)
+df.head()
+```
+
+Because the `Candlestick` class comes from the Graph Objects sub-library, instead of passing the `DataFrame` object, we pass the columns directly:
+
+```{python}
+from plotly.graph_objects import Figure, Candlestick
+
+stick = Candlestick(x=df["date"],
+                    open=df["open"],
+                    high=df["high"],
+                    low=df["low"],
+                    close=df["close"]
+)
+
+fig = Figure(data=[stick])
+fig.update_layout(title="Example Candlestick Chart")
+fig.show()
+```
diff --git a/docs/notes/dataviz-tabular/overview.qmd b/docs/notes/dataviz-tabular/overview.qmd
@@ -0,0 +1,224 @@
+---
+format:
+  html:
+    code-fold: false
+jupyter: python3
+execute:
+  cache: true # re-render only when source changes
+---
+
+# Data Visualization with Tabular Data
+
+We have [previously covered](https://prof-rossetti.github.io/intro-software-dev-python-book/notes/dataviz/overview.html) how to create data visualizations using the `plotly` package.
+
+In that introductory chapter, we passed simple lists to the chart-making functions, however the `plotly` package provides an easy-to-use, intuitive interface when working with tabular data.
+
+Now that we know how to work with `DataFrame` objects, let's revisit each of the previous examples, but this time using tabular data.
+
+### Line Charts, Revisited
+
+Starting with some example data, like before, this time we construct a `DataFrame` object from the data (because the data is in an eligible format, in this case a list of dictionaries):
+
+```{python}
+from pandas import DataFrame
+
+line_data = [
+    {"date": "2020-10-01", "stock_price_usd": 100.00},
+    {"date": "2020-10-02", "stock_price_usd": 101.01},
+    {"date": "2020-10-03", "stock_price_usd": 120.20},
+    {"date": "2020-10-04", "stock_price_usd": 107.07},
+    {"date": "2020-10-05", "stock_price_usd": 142.42},
+    {"date": "2020-10-06", "stock_price_usd": 135.35},
+    {"date": "2020-10-07", "stock_price_usd": 160.60},
+    {"date": "2020-10-08", "stock_price_usd": 162.62},
+]
+
+df = DataFrame(line_data)
+df.head()
+```
+
+If we construct a `DataFrame` from this data, we get to skip the mapping step, and move directly to the chart-making step.
+
+Now we have a few options about how to pass this data to the chart-making function. We can use a `Series` oriented approach, or a `DataFrame` oriented approach.
+
+#### `Series` Oriented Approach
+
+In the `Series` oriented approach, we pass the columns to the chart-making function, because when representing a column, the series is list-like:
+
+```{python}
+from plotly.express import line
+
+fig = line(x=df["date"], y=df["stock_price_usd"], height=350,
+          title="Stock Prices over Time",
+          labels={"x": "Date", "y": "Stock Price ($)"}
+)
+fig.show()
+```
+
+#### `DataFrame` Oriented Approach
+
+Alternatively, we can use a `DataFrame` oriented approach where we pass the `DataFrame` as the first parameter to the chart-maker function.
+
+
+```{python}
+from plotly.express import line
+
+fig = line(df, x="date", y="stock_price_usd", height=350,
+          title="Stock Prices over Time",
+          labels={"date": "Date", "stock_price_usd": "Stock Price ($)"}
+)
+fig.show()
+```
+
+Notice, when we pass the `DataFrame` as the first parameter, now the `x` and `y` parameters refer to string column names in that `DataFrame` to be plotted on the `x` and `y` axis, respectively. The `labels` parameter keys now reference the column names as well.
+
+For the remaining examples, we will use this `DataFrame` oriented approach.
+
+### Bar Charts, Revisited
+
+Constructing a `DataFrame` from the raw data:
+
+```{python}
+bar_data = [
+    {"genre": "Thriller", "viewers": 123456},
+    {"genre": "Mystery", "viewers": 234567},
+    {"genre": "Sci-Fi", "viewers": 987654},
+    {"genre": "Fantasy", "viewers": 876543},
+    {"genre": "Documentary", "viewers": 283105},
+    {"genre": "Action", "viewers": 544099},
+    {"genre": "Romantic Comedy", "viewers": 121212}
+]
+df = DataFrame(bar_data)
+df.head()
+```
+
+Charting the data:
+
+```{python}
+from plotly.express import bar
+
+fig = bar(df, x="genre", y="viewers", height=350,
+          title="Viewership by Genre",
+          labels={"genre": "Genre", "viewers": "Viewers"}
+)
+fig.show()
+```
+
+
+#### Horizontal Bar Chart, Revisited
+
+With categorical data, a horizontal bar chart can be a better choice than a vertical bar chart. Ideally, the bars are sorted so the largest are on top. This helps tell the story of which are the "top genres".
+
+Before charting, we use a `pandas` sorting operation to get the bars in the right order:
+
+```{python}
+df.sort_values(by="viewers", inplace=True)
+df.head()
+```
+
+:::{.callout-warning title="Important Note"}
+Notice, here in order to get bars in DESCENDING order, we sort the data in ASCENDING order.
+
+Oddly, and counter-intuitively, `plotly` plots the data in reverse order as it was passed in.
+:::
+
+
+```{python}
+fig = bar(df, y="genre", x="viewers", orientation="h", height=350,
+          title="Viewership by Genre",
+          labels={"genre": "Genre", "viewers": "Viewers"}
+)
+fig.show()
+```
+
+### Scatter Plots, Revisited
+
+Constructing a `DataFrame` from raw data:
+
+```{python}
+scatter_data = [
+    {"income": 30_000, "life_expectancy": 65.5},
+    {"income": 35_000, "life_expectancy": 62.1},
+    {"income": 50_000, "life_expectancy": 66.7},
+    {"income": 55_000, "life_expectancy": 71.0},
+    {"income": 70_000, "life_expectancy": 72.5},
+    {"income": 75_000, "life_expectancy": 77.3},
+    {"income": 90_000, "life_expectancy": 82.9},
+    {"income": 95_000, "life_expectancy": 80.0},
+]
+df = DataFrame(scatter_data)
+df.head()
+```
+
+Plotting the data:
+
+```{python}
+from plotly.express import scatter
+
+fig = scatter(df, x="income", y="life_expectancy", height=350,
+          title="Life Expectancy by Income",
+          labels={"income": "Income", "life_expectancy": "Life Expectancy (years)"}
+)
+fig.show()
+```
+
+
+
+
+### Pie Charts, Revisited
+
+Constructing a `DataFrame` from raw data:
+
+```{python}
+pie_data = [
+    {"company": "Company X", "market_share": 0.55},
+    {"company": "Company Y", "market_share": 0.30},
+    {"company": "Company Z", "market_share": 0.15}
+]
+df = DataFrame(pie_data)
+df.head()
+```
+
+
+```{python}
+from plotly.express import pie
+
+fig = pie(df, labels="company", values="market_share", height=350,
+          title="Market Share by Company",
+)
+fig.show()
+```
+
+
+
+
+### Histograms, Revisited
+
+Constructing a `DataFrame` from raw data:
+
+```{python}
+histo_data = [
+    {"user": "User A", "average_opinion": 0.1},
+    {"user": "User B", "average_opinion": 0.4},
+    {"user": "User C", "average_opinion": 0.4},
+    {"user": "User D", "average_opinion": 0.8},
+    {"user": "User E", "average_opinion": 0.86},
+    {"user": "User F", "average_opinion": 0.75},
+    {"user": "User G", "average_opinion": 0.90},
+    {"user": "User H", "average_opinion": 0.99},
+]
+df = DataFrame(histo_data)
+df.head()
+```
+
+Charting the data:
+
+```{python}
+from plotly.express import histogram
+
+fig = histogram(df, x="average_opinion", height=350, nbins=5,
+          title="User Average Opinions",
+          labels={"average_opinion": "Average Opinion"}
+)
+fig.show()
+```
diff --git a/docs/notes/dataviz-tabular/trendlines.qmd b/docs/notes/dataviz-tabular/trendlines.qmd
@@ -0,0 +1,56 @@
+---
+format:
+  html:
+    code-fold: false
+jupyter: python3
+execute:
+  cache: true # re-render only when source changes
+---
+
+# Scatter Plot with Trendlines, Revisited
+
+We have [previously studied](https://prof-rossetti.github.io/intro-software-dev-python-book/notes/dataviz/trendlines.html) how to create scatter plots with trendlines. We can do this with tabular data as well.
+
+Constructing a `DataFrame` from raw data:
+
+```{python}
+from pandas import DataFrame
+
+scatter_data = [
+    {"income": 30_000, "life_expectancy": 65.5},
+    {"income": 35_000, "life_expectancy": 62.1},
+    {"income": 50_000, "life_expectancy": 66.7},
+    {"income": 55_000, "life_expectancy": 71.0},
+    {"income": 70_000, "life_expectancy": 72.5},
+    {"income": 75_000, "life_expectancy": 77.3},
+    {"income": 90_000, "life_expectancy": 82.9},
+    {"income": 95_000, "life_expectancy": 80.0},
+]
+df = DataFrame(scatter_data)
+df.head()
+```
+
+Linear trends:
+
+```{python}
+from plotly.express import scatter
+
+fig = scatter(df, x="income", y="life_expectancy", height=350,
+                title="Life Expectancy by Income",
+                labels={"x": "Income", "life_expectancy": "Life Expectancy (years)"},
+                trendline="ols", trendline_color_override="red"
+)
+fig.show()
+```
+
+Non-linear trends:
+
+
+```{python}
+fig = scatter(df, x="income", y="life_expectancy", height=350,
+                title="Life Expectancy by Income",
+                labels={"x": "Income", "life_expectancy": "Life Expectancy (years)"},
+                trendline="lowess", trendline_color_override="red"
+)
+fig.show()
+```