Three ChatGPT Tricks to Save You Hours on Data Analysis Projects
Title: “3 ChatGPT Tricks to Save You Hours on Data Analysis Projects”
Introduction
Data analysis can be a complex and time-consuming process, often requiring a high level of expertise and technical skill. However, recent advances in artificial intelligence have led to the development of tools like OpenAI’s ChatGPT that can significantly streamline this process and make data analysis more accessible. This article will reveal three key tricks to optimally utilize ChatGPT and save hours on your data analysis projects.
1. Provide Context
One of the most crucial aspects of successful interaction with ChatGPT is providing the necessary context. In the realm of data analysis, this involves conveying relevant information about the structure and content of your dataset to ChatGPT.
Start by describing the format and structure of your dataset. What does each row and column represent? Are there any particular features or variables that are more significant than others? Are there any known issues with the data, such as missing values or outliers?
Also, it can be helpful to provide a few rows of data to ChatGPT, either directly or in a descriptive manner. This can help the model understand the nature of your dataset better.
Let’s look at an example.
If we were to work with this Air Quality dataset, our prompt to ChatGPT with the context would look like:
I have a dataset called "AirQuality.csv".
This dataset contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. This dataset was taken from UCI Machine Learning Repository: <https://archive.ics.uci.edu/ml/index.php>
The dataset contains 9357 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year) representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.
Attribute Information
0 Date (DD/MM/YYYY)
1 Time (HH.MM.SS)
2 True hourly averaged concentration CO in mg/m^3 (reference analyzer)
3 PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
4 True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
5 True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
6 PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
7 True hourly averaged NOx concentration in ppb (reference analyzer)
8 PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
9 True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
10 PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
11 PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
12 Temperature in °C
13 Relative Humidity (%)
14 AH Absolute Humidity
Here are the first few entries from it:
Date;Time;CO(GT);PT08.S1(CO);NMHC(GT);C6H6(GT);PT08.S2(NMHC);NOx(GT);PT08.S3(NOx);NO2(GT);PT08.S4(NO2);PT08.S5(O3);T;RH;AH;;
10/03/2004;18.00.00;2,6;1360;150;11,9;1046;166;1056;113;1692;1268;13,6;48,9;0,7578;;
10/03/2004;19.00.00;2;1292;112;9,4;955;103;1174;92;1559;972;13,3;47,7;0,7255;;
10/03/2004;20.00.00;2,2;1402;88;9,0;939;131;1140;114;1555;1074;11,9;54,0;0,7502;;
10/03/2004;21.00.00;2,2;1376;80;9,2;948;172;1092;122;1584;1203;11,0;60,0;0,7867;;
10/03/2004;22.00.00;1,6;1272;51;6,5;836;131;1205;116;1490;1110;11,2;59,6;0,7888;;
10/03/2004;23.00.00;1,2;1197;38;4,7;750;89;1337;96;1393;949;11,2;59,2;0,7848;;
11/03/2004;00.00.00;1,2;1185;31;3,6;690;62;1462;77;1333;733;11,3;56,8;0,7603;;
11/03/2004;01.00.00;1;1136;31;3,3;672;62;1453;76;1333;730;10,7;60,0;0,7702;;
Here we are giving ChatGPT the name of our file (so that the former refers correctly to the latter when creating Python scripts for you), description of the dataset and first few lines of the data.
All this helps Chat deliver us much better results.
The key is to make the AI aware of what you’re working with, just as you would do with a human data analyst. Providing the appropriate context can significantly reduce the back-and-forth interactions with ChatGPT and increase the accuracy of its responses.
As a side product of providing context but not setting a task (just yet) to the Chat, you get great ideas about what you can do with your data. Here’s one possible response (remember, every time you execute the same prompt in chat.openai.com, the results may vary):
This dataset offers a fascinating opportunity to explore and analyze various aspects of air quality, including the relationships among different gases and other factors such as temperature and humidity. Here are some potential ways to analyze and utilize this dataset:
1. **Time Series Analysis**: As this dataset has hourly data for a full year, you can perform a time series analysis to identify patterns or trends in air quality over time. You can examine how the levels of different gases change throughout the day, week, or across different seasons.
2. **Correlation Analysis**: It would be interesting to explore the correlation between different variables. For instance, you might want to see how closely related the sensor readings are to the true hourly averaged gas concentrations. Additionally, you could look at how changes in temperature or humidity affect the levels of different gases.
3. **Predictive Modeling**: Using machine learning techniques, you could try to predict future levels of different gases based on the historical data. This could help identify potential air quality issues before they occur.
4. **Sensor Evaluation**: As the dataset includes both sensor readings and true hourly averages, you can use it to evaluate the performance of the sensors. This could involve measuring how accurately they estimate gas concentrations, or investigating any evidence of sensor drift or cross-sensitivity.
5. **Missing Value Treatment**: With the -200 values indicating missing data, this dataset could also be used to experiment with different ways of handling missing data, such as imputation methods or handling them during model building.
Before proceeding with any of these, it would be important to preprocess the data, including handling the -200 missing value markers and possibly converting the date and time into a suitable format for analysis. Depending on the analysis you want to perform, you might also need to standardize or normalize the data. Also, remember to split your data into training and test datasets when building predictive models.
2. Be Specific
As intuitive as ChatGPT may be, it still requires clear, explicit instructions to deliver the best results. Be precise about what you want the AI to do. Are you asking it to analyze the data directly and provide insights, or do you want it to generate a Python script that performs a specific task?
Also, define the format in which you want the results. Do you need a detailed report with full sentences, or would you prefer a bullet-point summary of the insights? Are you looking for a data visualization to highlight a particular trend or pattern?
Now let’s go back to our example.
After we gave Chat some context and got its ideas on what to do with the data, we can proceed with giving it the details of the task.
Provide me with the steps and one-piece Python script (with detailed comments) to pre-process the data.
Notice how we are being specific here:
- We want the Chat to give us Python script to pre-process our data.
- We want this Python script as one piece (otherwise the Chat may give you many small pieces of code you’ll need to copy separately.
- We want the steps explained.
And here’s what we get from the Chat after executing the prompt above:
# Import necessary libraries
import pandas as pd
import numpy as np
# Load the dataset
# Notice the delimiter is ';'
# Missing values are marked as -200 as per the dataset description
df = pd.read_csv('AirQuality.csv', delimiter=';', na_values=-200)
# Combine the 'Date' and 'Time' columns into a single 'DateTime' column
df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format='%d/%m/%Y %H.%M.%S')
# Set 'DateTime' as the index of the dataframe
df.set_index('DateTime', inplace=True)
# Drop the original 'Date' and 'Time' columns as they are not needed anymore
df.drop(['Date', 'Time'], axis=1, inplace=True)
# Replacing European decimal comma with decimal point
# Assuming all numeric columns (from column index 2 to the end) are affected
for column in df.columns[2:]:
df[column] = df[column].str.replace(',', '.').astype(float)
# Handling missing values
# For simplicity, we'll use forward-fill method to fill gaps
# This method propagates the last valid observation forward
# Other methods could be considered based on the specific analysis to be performed
df.fillna(method='ffill', inplace=True)
# If there are any remaining missing values (for example at the start of the series), we'll use back-fill method
df.fillna(method='bfill', inplace=True)
# Print the first few rows of the processed dataframe to check everything is as expected
print(df.head())
# Save the cleaned dataset for future use
df.to_csv('Cleaned_AirQuality.csv')
The more specific you are with your instructions, the more efficient and relevant to your task at hand ChatGPT’s output will be.
In our example, after providing both context (in the first step) and details we got amazing result in just a few minutes (without our helpful AI assistant we would have spent at least half an hour on pre-processing of this simple dataset alone).
3. Iterate
Working with ChatGPT is not a one-step process. Don’t hesitate to iterate and refine your prompts based on the responses you receive.
In the initial stages, it might seem quicker to perform the analysis yourself, but as you iterate and learn how to effectively prompt the AI, you’ll begin to appreciate the time you’re saving and the enhanced creativity your work can encompass.
Let’s look how we can use iteration in our example.
We give the Chat another task:
Now let's do time-series analysis.
As a response, we get python script to do what we asked:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# Load the cleaned dataset
df = pd.read_csv('Cleaned_AirQuality.csv', parse_dates=['DateTime'], index_col='DateTime')
# Plotting the time series data of CO concentration
df['CO(GT)'].plot(figsize=(15, 6))
plt.title('Time Series of CO Concentration')
plt.xlabel('DateTime')
plt.ylabel('CO Concentration')
plt.show()
# Plotting the time series data of Non Metanic Hydrocarbons concentration
df['NMHC(GT)'].plot(figsize=(15, 6))
plt.title('Time Series of Non Metanic Hydrocarbons Concentration')
plt.xlabel('DateTime')
plt.ylabel('Non Metanic Hydrocarbons Concentration')
plt.show()
# More gases can be analysed similarly
But when we try to execute this code, we get an error.
Not a problem at all! This is a good place to iterate. As our next step, we are providing the Chat with the error message we got:
I got this error:
TypeError Traceback (most recent call last) <ipython-input-4-3170e6576ba2> in <cell line: 9>() 7 8 # Plotting the time series data of CO concentration ----> 9 df['CO(GT)'].plot(figsize=(15, 6)) 10 plt.title('Time Series of CO Concentration') 11 plt.xlabel('DateTime')
3 frames /usr/local/lib/python3.10/dist-packages/pandas/plotting/_matplotlib/core.py in _compute_plot_data(self) 633 # no non-numeric frames or series allowed 634 if is_empty: --> 635 raise TypeError("no numeric data to plot") 636 637 self.data = numeric_data.apply(self._convert_to_ndarray)
TypeError: no numeric data to plot
Response to this prompt from the Chat and looking at the dataset give us an idea, that the pre-processing process didn’t do all the needed work. Still some of the numeric values use decimal point, and some — comma. And this our next step in the iteration process. Now we ask the Chat to update our dataset so that all numeric values use decimal point:
Write a python script that will change commas to points in all numerical values of the dataset ("Cleaned_AirQuality.csv").
After that we get the code needed, execute it, and then can execute our time-series analysis code. ChatGPT magic happens — after providing context, details and a few steps of iterating we get data visualized and can derive some useful insights!
Every interaction with ChatGPT offers an opportunity to improve your “prompt engineering” skills—the art of crafting effective prompts that guide AI to produce the desired output. In 2023, this has become an essential skill for anyone using AI tools in their work.
By providing context, being specific, and iterating, you can harness the full potential of ChatGPT for your data analysis tasks. In turn, this could save you hours of manual data crunching and leave you with more time for the exciting part of your job—the actual interpretation and application of your data insights.
Here is the full conversation with ChatGPT from this article: https://chat.openai.com/share/df7f667a-1ce5-473d-ad61-61995fc24b6a