I was recently working on learning more about using Dataframes and Polyglot Notebooks with C# and needed some sample data to work with. At first I started by looking for some examples online, but I wanted to write an article about my experience and needed my own sample set of data.
My first thought was I can generate a sample set with a small script, but that would take valuable time away from what I was trying to learn. Then I remembered ChatGPT. Could it fabricate a sample dataset for me that is somewhat realistic?
I was pleasantly surprised to find that ChatGPT is super easy and efficient at generating sample data files. My first attempts were very basic, but I was getting good results and I turned up the complexity to see how realistic I could get with the data.
I created the following prompt and in a few seconds I had my output.
please generate a csv dataset of home sale prices with the following column headers and include 100 rows. Id,Size,HistoricalPrice,CurrentPrice. The Historical price should range between 178,000 and 751,000, The CurrentPrice should on average be a 10% increase over HistoricalPrice and in 1% of the cases be less than HistoricalPrice. Size is in square feet and should range from 800 to 4000. Size is the primary factor in CurrentPrice and typically larger homes have a higher price.
ChatGPT produced a link to a csv file to download and a link to see the Python code it generated to create my sample data.
This was the Python ChatGPT created to generate my sample.
import pandas as pd
import numpy as np
# Parameters
n_rows = 100
size_min = 800
size_max = 4000
historical_price_min = 178000
historical_price_max = 751000
avg_increase = 1.10
# Generate data
np.random.seed(42) # For reproducibility
sizes = np.random.randint(size_min, size_max + 1, n_rows)
historical_prices = np.random.randint(historical_price_min, historical_price_max + 1, n_rows)
current_prices = historical_prices * (avg_increase + np.random.normal(0, 0.05, n_rows)) # Add some variation around the average increase
# Ensure that 1% of the current prices are less than historical prices
num_decreases = max(1, int(0.01 * n_rows))
decrease_indices = np.random.choice(n_rows, num_decreases, replace=False)
current_prices[decrease_indices] = historical_prices[decrease_indices] * np.random.uniform(0.95, 0.99, num_decreases)
# Create the DataFrame
data = pd.DataFrame({
"Id": range(1, n_rows + 1),
"Size": sizes,
"HistoricalPrice": historical_prices,
"CurrentPrice": current_prices
})
# Save to CSV
csv_file_path = "/mnt/data/home_sale_prices.csv"
data.to_csv(csv_file_path, index=False)
csv_file_path
This is an excerpt of the csv file ChatGPT generated for me to download.
Id,Size,HistoricalPrice,CurrentPrice
1,3974,694588,803736.350654418
2,1660,414584,471629.3702211707
3,2094,740332,843646.5131709137
4,1930,554896,585154.8269442109
5,1895,226984,265602.993193676
...
This is super powerful from a couple of aspects. The first is if you are trying to learn how to solve a problem with Python code. The other is that within a minute I had created a sample file with data that follows my parameters.
Being given the Python code used to generate the sample means I can inspect it to see what the limitations of the output might be. I can also copy that code into a notebook and generate more sample files and much larger sample files.
After inspecting these results a bit, I felt I could refine the data by giving it a little more complexity to help make it a bit more realistic. In a typical market there are cities that are desirable and others have a depressed home market.
In my chat I added the next prompt to refine the dataset.
please add a city column, where some cities are more desirable and have a higher price per square foot and other cities are depressed markets and have a low price per square foot.
This is the new Python script ChatGPT created to generate my new file with cities.
# Define cities and their price factors (higher value means more desirable)
cities = {
"Desirable City A": 1.2,
"Desirable City B": 1.15,
"Neutral City C": 1.0,
"Depressed City D": 0.85,
"Depressed City E": 0.8
}
# Assign cities randomly to the records
city_names = np.random.choice(list(cities.keys()), n_rows)
price_factors = np.array([cities[city] for city in city_names])
# Adjust the current prices based on the city price factors
adjusted_current_prices = current_prices * price_factors
# Update the DataFrame
data = pd.DataFrame({
"Id": range(1, n_rows + 1),
"City": city_names,
"Size": sizes,
"HistoricalPrice": historical_prices,
"CurrentPrice": adjusted_current_prices
})
# Save to CSV with the city column included
csv_file_path_with_cities = "/mnt/data/home_sale_prices_1000_with_cities.csv"
data.to_csv(csv_file_path_with_cities, index=False)
csv_file_path_with_cities
You can see that ChatGPT added a list of cities with weights based on how desirable or depressed the city was. The city names were not super creative, but they were helpful in deciphering the script and result. It would be an easy task to map those names to more creative names, or I could revise my chat prompt to include a list of realistic city names to be included.
Here is an extract from the sample file with city names.
Id,City,Size,HistoricalPrice,CurrentPrice
1,Desirable City A,1315,677883,848369.6614946304
2,Desirable City B,3887,722869,906379.8358689708
3,Desirable City A,3639,463195,591495.6891125031
4,Neutral City C,1135,303995,306023.61005464377
5,Desirable City A,2582,423690,557168.9412129185
6,Desirable City A,3105,509770,669172.6063590873
7,Desirable City B,1296,606988,820661.9553831856
8,Neutral City C,1695,358553,405717.27109660854
9,Depressed City E,3007,738696,619789.9715846126
10,Depressed City E,1651,437568,417511.50304262876
...
I then used Polyglot notebooks to load this sample do some quick calculations and output to a new csv file.
This is the script I created to add the price increase of current vs historical and the current price per square feet.
#r "nuget: Microsoft.Data.Analysis, 0.21.1"
using System.IO;
using System.Linq;
using Microsoft.Data.Analysis;
// Define data path
var dataPath = Path.GetFullPath(@"data/home-sale-prices-1000-with-cities.csv");
var df = df.LoadCsv(dataPath);
df["PriceIncrease"] = df["CurrentPrice"] - df["HistoricalPrice"];
df["PriceIncreasePct"] = df["PriceIncrease"] / df["CurrentPrice"]*100;
df["PricePerSqft"] = df["CurrentPrice"] / df["Size"];
DataFrame.SaveCsv(df, "data/result.csv", ',');
The output to result.csv looked like this.
Id,City,Size,HistoricalPrice,CurrentPrice,PriceIncrease,PriceIncreasePct,PricePerSqft
1,Desirable City A,1315,677883,848369.688,170486.688,20.09580135345459,645.148071
2,Desirable City B,3887,722869,906379.812,183510.812,20.246568322181702,233.182358
3,Desirable City A,3639,463195,591495.688,128300.688,21.690891683101654,162.543472
4,Neutral City C,1135,303995,306023.625,2028.625,0.66289817914366722,269.624329
5,Desirable City A,2582,423690,557168.938,133478.938,23.9566370844841,215.789673
6,Desirable City A,3105,509770,669172.625,159402.625,23.820853233337402,215.514526
7,Desirable City B,1296,606988,820661.938,213673.938,26.036778092384338,633.226807
8,Neutral City C,1695,358553,405717.281,47164.2812,11.624912917613983,239.361221
9,Depressed City E,3007,738696,619790,-118906,-19.184885919094086,206.115723
10,Depressed City E,1651,437568,417511.5,-20056.5,-4.8038199543952942,252.884003
...
Hand coding a script to generate realistic data that fits a hypnosis is not trivial and will always take longer than you want it to. In addition, generating sample data almost always takes time away from your true objective. Using a tool like ChatGPT is fast and very effective. I was able to generate some realistic samples in a couple of minutes where if I hand coded it, I could easily spend an hour or more writing code, testing it and making refinements.
Some good uses for this technique