Assuming you are skilled in C# and the .NET ecosystem, you may be wondering if you need to learn Python to do data analysis or AI/ML. The short answer is you can do everything that Python does with C# and it will likely perform better and take less time to build in C#.
As someone who's built .NET solutions since 1.0, I'm a bit biased. That said, based on my research, Python has historically been easier to learn for non-programmers and is just over 10 years older than .NET. The low bar of entry for non-programmers has been Python's advantage.
In recent years, Microsoft has made a dramatic shift towards embracing Open Source. As a part of this shift, they introduced .NET core and then Open Sourced all of .NET. In more recent years, .NET has introduced concepts like top-level statements, minimal API, etc. to take the ceremony and learning curve out of C# and make it much more accessible to newcomers and non-programmers.
Given this information, if you are already a skilled C# programmer, you do not need to learn Python to be equally effective as someone using Python to complete data analysis tasks. You can quickly build the same functionality in C# that will likely perform drastically faster.
To validate this assertion, let's get into a simple example of a somewhat mundane yet routine example of data analysis. Let's say, you are given a csv file with data and you need to filter or transform it.
In Python, you would use the data analysis library called Pandas to perform this task with Dataframes. In the .NET world an equivalent is the Microsoft.Data.Analysis library.
Why use DataFrames in C#?
- I have a world of knowledge in C#
- I know C# and I have a data analysis task that I need to complete quickly
- My team primarily has C# knowledge and is not strong with Python
- I want to use ML.NET
In our example we were given a csv file with housing prices. We need to filter this list for current prices less than 250,000 and output it to another csv file for some other down stream task.
We are going to leverage dotnet-script and create a file called dataframe.csx
. If you'd like to learn more about dotnet-script, check out my article on scripting with C#.
Note: you could easily do this same thing in a console application as a top-level statement or even use the Polyglot Notebooks extension in VS Code to do this.
In your dataframe.csx
file add the following code.
#r "nuget: Microsoft.Data.Analysis, 0.21.1"
using System.IO;
using System.Linq;
using Microsoft.Data.Analysis;
// Define data path
var dataPath = Path.GetFullPath(@"home-sale-prices.csv");
// Load the data into the data frame
var dataFrame = DataFrame.LoadCsv(dataPath);
// output a description of the data loaded
Console.WriteLine(dataFrame.Description());
// Filter for prices over 200,000
PrimitiveDataFrameColumn<bool> boolFilter = dataFrame["CurrentPrice"].ElementwiseLessThan(250000);
DataFrame filteredDataFrame = dataFrame.Filter(boolFilter);
Console.WriteLine(filteredDataFrame.Description());
// Save the filtered output to a csv file
DataFrame.SaveCsv(filteredDataFrame, "result.csv", ',');
You can create the example csv file by creating a file named home-sale-prices.csv
and then pasting the following data into the file.
Id,Size,HistoricalPrice,CurrentPrice
1,4174,302283,350235
2,4507,296769,175939
3,1860,137065,592141
4,2294,323165,586157
5,2130,199299,302906
6,2095,111534,168047
7,4772,140397,438249
8,4092,357750,225766
9,2638,453531,558923
10,3169,363160,565192
11,1466,155591,194262
12,2238,320884,537261
13,1330,123247,435920
14,2482,124300,311152
15,3135,182798,278376
16,4444,109268,287848
17,4171,448951,242787
18,3919,374329,277948
19,4735,294776,205016
20,1130,317851,552690
Then execute your script with dotnet script dataframe.csx
in a console window.
You will see a summary of the original Dataframe and the filtered Dataframe in the console window and a result.csv
file will be created with the filtered results.
C:>dotnet script dataframe.csx
Description Id Size HistoricalPrice CurrentPrice
Length (excluding null values)20 20 20 20
Max 20 4772 453531 592141
Min 1 1130 109268 168047
Mean 10.5 3039.05 256847.4 364340.75
Description Id Size HistoricalPrice CurrentPrice
Length (excluding null values)6 6 6 6
Max 19 4735 448951 242787
Min 2 1466 111534 168047
Mean 10.5 3511 277561.84 201969.5
Id,Size,HistoricalPrice,CurrentPrice
2,4507,296769,175939
6,2095,111534,168047
8,4092,357750,225766
11,1466,155591,194262
17,4171,448951,242787
19,4735,294776,205016
If you are curious how fast this is with a larger set of data, you can find a sample with 10k rows in my GitHub repo here.
While this is a very simple example, it shows how you can do this simple task in essentially 4 lines of code after you take out the Console.Writeline
that I put in for demonstration. Arguably, you could do this same task with Excel, but since we used dotnet-script with Dataframes, it is now easily repeatable.
If you already have C# skillset and you are not looking to make a full time career in Data Science, Dataframes with is a valuable tool to have in your toolbelt.
If you do want a full time career in Data Science then you should learn Python merely because it won the popularity contest long ago and it is unlikely you'll get past the resume screen for a Data Science job without Python experience listed.
Note: I'm not saying that you don't need to learn languages other than C#. I firmly believe all software engineers should learn multiple languages. Learning multiple languages helps you master your craft and increases your job opportunities. Python is a good choice if you are looking to learn a new language.