|
| 1 | +# Analyze Table Data Using ChatGPT |
| 2 | + |
| 3 | +This example shows how to generate suggestions for analyzing tabular data in MATLAB® using ChatGPT™. |
| 4 | + |
| 5 | +First, build a prompt that describes table data to ChatGPT. Next, generate insights and suggestions for data analysis in MATLAB. |
| 6 | + |
| 7 | +# Setup |
| 8 | + |
| 9 | +Using the OpenAI® API requires an OpenAI API key. For information on how to obtain an OpenAI API key, as well as pricing, terms and conditions of use, and information about available models, see the OpenAI documentation at [https://platform.openai.com/docs/overview](https://platform.openai.com/docs/overview). |
| 10 | + |
| 11 | +To connect to the OpenAI API from MATLAB using LLMs with MATLAB, specify the OpenAI API key as an environment variable and save it to a file called ".env". |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +To connect to OpenAI, the ".env" file must be on the search path. |
| 16 | + |
| 17 | +Load the environment file using the `loadenv` function. |
| 18 | + |
| 19 | +```matlab |
| 20 | +loadenv(".env") |
| 21 | +``` |
| 22 | + |
| 23 | +# Describe Table to ChatGPT |
| 24 | + |
| 25 | +Create a table containing data that represents domestic airline flights in the United States in 2008. |
| 26 | + |
| 27 | +A sample of this dataset will be sent to the AI model as part of the system prompt. |
| 28 | + |
| 29 | +```matlab |
| 30 | +airlineData = readtable("airlinesmall_subset.xlsx",Sheet="2008"); |
| 31 | +``` |
| 32 | + |
| 33 | +Calculate summary statistics to describe the table variables. Include statistics that might be useful for string and numeric data. |
| 34 | + |
| 35 | +```matlab |
| 36 | +summaryStruct = summary(airlineData,Statistics=["nummissing" "numunique" "min" "max" "mean"]); |
| 37 | +``` |
| 38 | + |
| 39 | +Convert the summary statistics to JSON\-formatted text. |
| 40 | + |
| 41 | +```matlab |
| 42 | +summaryString = string(jsonencode(summaryStruct,ConvertInfAndNaN=false)); |
| 43 | +``` |
| 44 | + |
| 45 | +To clearly identify rows in the table, add row labels. Then, capture a random 5\-row sample of the data. |
| 46 | + |
| 47 | +```matlab |
| 48 | +dataSample = airlineData; |
| 49 | +dataSample = addvars(dataSample,"Row " + (1:height(dataSample))', ... |
| 50 | + NewVariableNames="RowLabels",Before=1); |
| 51 | +rng default |
| 52 | +randomIdx = randperm(height(dataSample),5); |
| 53 | +randomIdx = sort(randomIdx); |
| 54 | +dataSample = dataSample(randomIdx,:); |
| 55 | +``` |
| 56 | + |
| 57 | +Convert the sample data to JSON\-formatted text. |
| 58 | + |
| 59 | +```matlab |
| 60 | +sampleString = string(jsonencode(dataSample,ConvertInfAndNaN=false)); |
| 61 | +``` |
| 62 | + |
| 63 | +Combine the summary and sample into a full description of the table. |
| 64 | + |
| 65 | +```matlab |
| 66 | +dataName = "airlineData"; |
| 67 | +dataDescription = "The MATLAB workspace contains a table with the name `" + dataName + "`." + newline + ... |
| 68 | + "Here are the basic summary statistics: " + newline + summaryString + newline + ... |
| 69 | + "Here is a random 5-row sample of the dataset: " + newline + sampleString; |
| 70 | +``` |
| 71 | + |
| 72 | +Create a system prompt for ChatGPT that includes the data description. In the prompt, specify that responses typically include MATLAB code. |
| 73 | + |
| 74 | +```matlab |
| 75 | +systemPrompt = "You are a chat assistant designed to help analyze " + ... |
| 76 | + "tabular data using MATLAB. Your responses are concise and " + ... |
| 77 | + "typically contain MATLAB code snippets or suggest specific MATLAB functions." + ... |
| 78 | + newline + dataDescription; |
| 79 | +``` |
| 80 | + |
| 81 | +Connect to the OpenAI Chat Completion API using the [`openAIChat`](../doc/functions/openAIChat.md) function. Specify the model name. |
| 82 | + |
| 83 | +```matlab |
| 84 | +mdl = openAIChat(systemPrompt,ModelName="gpt-4.1-mini"); |
| 85 | +``` |
| 86 | + |
| 87 | +# Ask ChatGPT Questions About Data |
| 88 | + |
| 89 | +You can ask ChatGPT for insights into your data and suggestions for analysis in MATLAB. For example, you can ask for an overview of the data, or ask how to clean up the data and visualize it. |
| 90 | + |
| 91 | +```matlab |
| 92 | +generate(mdl,"Give me a high level overview of this dataset with a few interesting insights.") |
| 93 | +``` |
| 94 | + |
| 95 | +```matlabTextOutput |
| 96 | +ans = |
| 97 | + "This airlineData dataset contains flight records from the year 2008, with 1753 entries. It includes details such as dates (Month, DayofMonth, DayOfWeek), times (Departure, Arrival times both actual and scheduled), airline carriers, flight numbers, tail numbers, elapsed times, delays, and cancellation/diversion status. |
| 98 | + |
| 99 | + Key variables: |
| 100 | + - Flight identifiers: UniqueCarrier (20 unique carriers), FlightNum, TailNum |
| 101 | + - Timing: DepTime, ArrTime, CRSDepTime, CRSArrTime, ActualElapsedTime, AirTime |
| 102 | + - Delays: ArrDelay, DepDelay, CarrierDelay, WeatherDelay, SecurityDelay, LateAircraftDelay |
| 103 | + - Locations: Origin and Dest airports (182 unique origins, 183 unique destinations) |
| 104 | + - Distances and Taxi times: Distance, TaxiIn, TaxiOut |
| 105 | + - Cancellation and diversion info |
| 106 | + |
| 107 | + Insight highlights: |
| 108 | + - The mean arrival delay is about 10 minutes, with a max delay of 567 minutes indicating some heavy delays. |
| 109 | + - The average flight distance is around 706 miles. |
| 110 | + - A small portion of flights were canceled (around 1.37%) or diverted (about 0.34%). |
| 111 | + - Many delay-related variables have >75% missing values which may reflect only delays when applicable. |
| 112 | + - The departure delay (mean ~11 minutes) is close to arrival delay, indicating delays accumulate through the flight. |
| 113 | + - There is variation in scheduled versus actual times, with some flights departing or arriving earlier/later than scheduled. |
| 114 | + - Taxi out times are generally longer than taxi in times (16.8 min vs 7 min average). |
| 115 | + |
| 116 | + Would you like a specific analysis or visualization for any aspect in this dataset?" |
| 117 | +
|
| 118 | +
|
| 119 | +``` |
| 120 | + |
| 121 | +```matlab |
| 122 | +generate(mdl,"Describe how I can clean up this data for further analysis in MATLAB.") |
| 123 | +``` |
| 124 | + |
| 125 | +```matlabTextOutput |
| 126 | +ans = |
| 127 | + "To clean up the airlineData table for further analysis in MATLAB, you can follow these steps: |
| 128 | + |
| 129 | + 1. Handle missing values: |
| 130 | + - Identify columns with missing values using `ismissing`. |
| 131 | + - For delay columns (CarrierDelay, WeatherDelay, etc.), replace missing with 0 if appropriate or remove rows with missing critical values. |
| 132 | + 2. Remove or impute outliers if needed. |
| 133 | + 3. Convert categorical variables from cell arrays to categorical type. |
| 134 | + 4. Remove or filter canceled and diverted flights if your analysis excludes them. |
| 135 | + 5. Fix data types for time columns if you need to analyze time (convert to datetime or duration). |
| 136 | + 6. Remove unnecessary columns or rename for clarity. |
| 137 | + |
| 138 | + Here is example code snippets: |
| 139 | + |
| 140 | + ```matlab |
| 141 | + % 1. Replace NaNs in delay columns with zero |
| 142 | + delayCols = {'CarrierDelay','WeatherDelay','SDelay','SecurityDelay','LateAircraftDelay'}; |
| 143 | + for i = 1:length(delayCols) |
| 144 | + col = delayCols{i}; |
| 145 | + airlineData.(col)(ismissing(airlineData.(col))) = 0; |
| 146 | + end |
| 147 | + |
| 148 | + % 2. Convert cellular columns to categorical |
| 149 | + airlineData.UniqueCarrier = categorical(airlineData.UniqueCarrier); |
| 150 | + airlineData.Origin = categorical(airlineData.Origin); |
| 151 | + airlineData.Dest = categorical(airlineData.Dest); |
| 152 | + airlineData.CancellationCode = categorical(airlineData.CancellationCode); |
| 153 | + |
| 154 | + % 3. Remove canceled and diverted flights if needed |
| 155 | + airlineData = airlineData(airlineData.Cancelled==0 & airlineData.Diverted==0, :); |
| 156 | + |
| 157 | + % 4. Convert times to datetime or duration (optional) |
| 158 | + % For example, convert CRSDepTime and CRSArrTime to duration from midnight |
| 159 | + convertTime = @(t) hours(floor(t/100)) + minutes(mod(t,100)); |
| 160 | + airlineData.CRSDepTime = convertTime(airlineData.CRSDepTime); |
| 161 | + airlineData.CRSArrTime = convertTime(airlineData.CRSArrTime); |
| 162 | + |
| 163 | + % 5. Remove or impute other missing values if needed (e.g., DepTime, ArrTime) |
| 164 | + % For example, remove rows with missing DepTime or ArrTime |
| 165 | + airlineData = airlineData(~ismissing(airlineData.DepTime) & ~ismissing(airlineData.ArrTime), :); |
| 166 | + ``` |
| 167 | + |
| 168 | + This should prepare your data for subsequent analysis. Let me know if you need code for specific cleaning or preprocessing tasks." |
| 169 | +
|
| 170 | +
|
| 171 | +``` |
| 172 | + |
| 173 | +```matlab |
| 174 | +generate(mdl,"Give me a variety of visualizations I can create in MATLAB to explore this data.") |
| 175 | +``` |
| 176 | + |
| 177 | +```matlabTextOutput |
| 178 | +ans = |
| 179 | + "Here are several types of visualizations you can create in MATLAB to explore the airlineData table: |
| 180 | + |
| 181 | + 1. Histogram of Arrival Delays |
| 182 | + ```matlab |
| 183 | + histogram(airlineData.ArrDelay) |
| 184 | + xlabel('Arrival Delay (minutes)') |
| 185 | + ylabel('Frequency') |
| 186 | + title('Histogram of Arrival Delays') |
| 187 | + ``` |
| 188 | + |
| 189 | + 2. Boxplot of Departure Delays by Day of Week |
| 190 | + ```matlab |
| 191 | + boxplot(airlineData.DepDelay, airlineData.DayOfWeek) |
| 192 | + xlabel('Day of Week') |
| 193 | + ylabel('Departure Delay (minutes)') |
| 194 | + title('Departure Delays by Day of Week') |
| 195 | + ``` |
| 196 | + |
| 197 | + 3. Scatter plot of Distance vs Actual Elapsed Time |
| 198 | + ```matlab |
| 199 | + scatter(airlineData.Distance, airlineData.ActualElapsedTime) |
| 200 | + xlabel('Distance (miles)') |
| 201 | + ylabel('Actual Elapsed Time (minutes)') |
| 202 | + title('Distance vs Actual Elapsed Time') |
| 203 | + ``` |
| 204 | + |
| 205 | + 4. Bar chart of number of flights by Month |
| 206 | + ```matlab |
| 207 | + counts = groupcounts(airlineData.Month); |
| 208 | + bar(1:12, counts) |
| 209 | + xlabel('Month') |
| 210 | + ylabel('Number of Flights') |
| 211 | + title('Number of Flights per Month') |
| 212 | + ``` |
| 213 | + |
| 214 | + 5. Boxplot of Arrival Delay by Carrier |
| 215 | + ```matlab |
| 216 | + boxplot(airlineData.ArrDelay, airlineData.UniqueCarrier) |
| 217 | + xlabel('Carrier') |
| 218 | + ylabel('Arrival Delay (minutes)') |
| 219 | + title('Arrival Delay by Carrier') |
| 220 | + ``` |
| 221 | + |
| 222 | + 6. Scatter plot of DepDelay vs ArrDelay with color indicating Cancelled status |
| 223 | + ```matlab |
| 224 | + gscatter(airlineData.DepDelay, airlineData.ArrDelay, airlineData.Cancelled, 'br', 'xo') |
| 225 | + xlabel('Departure Delay (minutes)') |
| 226 | + ylabel('Arrival Delay (minutes)') |
| 227 | + title('Departure vs Arrival Delay by Cancelled Status') |
| 228 | + legend({'Not Cancelled', 'Cancelled'}) |
| 229 | + ``` |
| 230 | + |
| 231 | + 7. Time series of average arrival delay by day |
| 232 | + ```matlab |
| 233 | + dailyAvgDelay = varfun(@mean, airlineData, 'InputVariables', 'ArrDelay', ... |
| 234 | + 'GroupingVariables', {'Month', 'DayofMonth'}); |
| 235 | + plot(datenum(2008, dailyAvgDelay.Month, dailyAvgDelay.DayofMonth), dailyAvgDelay.mean_ArrDelay) |
| 236 | + datetick('x', 'mmm-dd') |
| 237 | + xlabel('Date') |
| 238 | + ylabel('Average Arrival Delay (minutes)') |
| 239 | + title('Daily Average Arrival Delay') |
| 240 | + ``` |
| 241 | + |
| 242 | + If you want code examples for any other specific visualizations or analyses, just ask!" |
| 243 | +
|
| 244 | +
|
| 245 | +``` |
| 246 | + |
| 247 | +*Copyright 2026 The MathWorks, Inc.* |
0 commit comments