One of the features of my blog is a list of articles and videos I have run across on the Internet when researching or just reading. I do this for myself so I can find these articles later, but I also make it available incase others find it useful.
To populate the list, I created an automation that lets me drop in a URL and it scans the page and adds it to the list. This automation used some traditional scanning techniques using XPath to find specific attributes in the HTML and capture the content. It worked most of the time with about 80-90% accuracy. Just good enough to not spend more time refining the algorithm, but not so much that I was satisfied.
In my recent studies to learn more about how to leverage Semantic Kernel and OpenAI, I had an idea. What if I could use generative AI to scan the pages and extract the data I needed for my list. In concept it would be more adept at handling the variations and possibly give me much better results.
In looking at this problem, there were two challenges. The first was that raw HTML has a lot of text attributes that could be confusing to the generative AI. The second challenge was that I wanted structured data as the output.
To reduce the text noise in the page I used some traditional page scraping techniques. For the body of the page, I used a reverse markdown library to convert the html into a markdown document. This works well for pages that do not have good metatags, but I found that a lot of the information I'm looking for is in the metatags. For this I used XPath to loop through the tags and convert them to a simple key value pair text list. This resulted in text that was much smaller to send into the generative AI and with minimal extraneous text.
With my raw input challenge solved, I needed to figure out how I could get structured data as the output. Luckily, I found that OpenAI has recently added an option for their API that allows you to specify that you want JSON as the response. From my research without this option, it was not a guarantee that you would get JSON from your prompt or you might get extraneous text with your JSON in the response.
Semantic Kernel also supports the new JSON output feature of OpenAI, but it is marked experimental. Which means it might get removed in the future. In this case, I'll take my chances and run with it. The worst that can happen is I can't scan pages for my list.
Now it all comes down to the prompt. Getting the prompt right so that I get clean JSON output that makes sense and a rational summary of the page took quite a bit of trial and error. I used a Polyglot Notebook to do this experimentation and now I fully understand why data scientists like Jupyter Notebooks so much. Using notebooks to do the experimentation allowed me to rerun just the blocks of code that I wanted to change and allowed me to document my code with markdown along the way.
Let's get into the code!
The first step is to pull in all our dependencies and config values. You will notice that I'm storing my OpenAPI key in a secrets.json
file that does not get checked into my repo. You will need to obtain an OpenAPI key and create your own secrets.json
file.
To get an OpenAPI key I purchased a block of credits and I did not set it to renew so I can limit my cost. I spent $10 on credits and so far I've only used $0.05 of my balance. I like that there is no monthly reoccurring cost and the per use cost is really low.
#r "nuget:HtmlAgilityPack, 1.11.65"
#r "nuget:ReverseMarkdown, 4.6.0"
#r "nuget: Microsoft.SemanticKernel, 1.18.2"
#r "nuget: Microsoft.Extensions.Configuration, 8.0.0"
#r "nuget: Microsoft.Extensions.Configuration.FileExtensions, 8.0.1"
#r "nuget: Microsoft.Extensions.Configuration.Json, 8.0.0"
using HtmlAgilityPack;
using ReverseMarkdown;
using Microsoft.Extensions.Configuration;
using System.IO;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
public static var config = new ConfigurationBuilder()
.AddJsonFile(Path.GetFullPath("secrets.json"),
optional: false, reloadOnChange: true)
.Build();
public static string OpenAIKey = config["openai-key"];
Next we want to access a page, capture the raw HTML and use XPath to extract the relevant sections of the document. We will use one of my blog articles for the example.
For this task we will use HtmlAgilityPack. This is a library that makes it easy to scan and capture elements from a page. There are several other libraries that can be used for this, but I've had really good luck with HtmlAgilityPack in the past so that is my choice today.
var url = "https://www.bradjolicoeur.com/article/data-analysis-csharp";
var uri = new Uri(url);
// Get the URL specified
var webGet = new HtmlWeb();
var document = await webGet.LoadFromWebAsync(url);
var body = document.DocumentNode.SelectSingleNode("/html/body");
var metaTags = document.DocumentNode.SelectNodes("//meta");
var metaText = string.Empty;
First, we will convert the metaTags to a key/value pair list of text.
if (metaTags != null)
{
var sb = new StringBuilder();
foreach (var item in metaTags)
{
sb.Append(item.GetAttributeValue("property", ""));
sb.Append('|');
sb.Append(item.GetAttributeValue("content", ""));
sb.AppendLine();
}
metaText = sb.ToString();
}
Then we will convert the body to Markdown so we are working with a simple text document with no HTML tags, JavaScript or other text.
var config = new ReverseMarkdown.Config{
UnknownTags = Config.UnknownTagsOption.Drop
};
var converter = new ReverseMarkdown.Converter(config);
string html = body.OuterHtml;
string markdownText = converter.Convert(html);
Now lets stack the two simplified blocks of text into a single text string so we can include it in the prompt.
var sbAllText = new StringBuilder();
sbAllText.AppendLine(metaText);
sbAllText.AppendLine(markdownText);
var textToSummarize = sbAllText.ToString();
We have the page prepared and now we can apply Generative AI to extract our data.
Notice the pragma disable for SKEXP0010. The code will not compile unless you disable this compiler error since we are using an experimental feature of Semantic Kernel.
#pragma warning disable SKEXP0010
static string ModelId = "gpt-4o-mini";
// Create a kernel with OpenAI chat completion
var builder = Kernel.CreateBuilder()
.AddOpenAIChatCompletion(ModelId, OpenAIKey);
Kernel kernel = builder.Build();
// Create and print out the prompt
string prompt = $"""
Consider a JSON schema for Article Summary that includes the following properties: Author:string, PublishDate:datetime, Title:string, Summary:string, KeyWords:string, ImageUrl:string
Please summarize the the following text in 30 words or less for software engineers as the audience and output in json:
{textToSummarize}
# How to respond to this prompt
- No other text, just the JSON data
""";
// Submit the prompt and print out the response
string response = await kernel.InvokePromptAsync<string>(
prompt,
new(new OpenAIPromptExecutionSettings()
{
MaxTokens = 1000,
ResponseFormat = "json_object"
})
);
The output I captured was this JSON document.
{
"Author": "Brad Jolicoeur",
"PublishDate": "2024-08-17T00:00:00Z",
"Title": "Data Analysis with C#: Leveraging .NET for High-Performance Tasks",
"Summary": "C# can perform data analysis tasks as efficiently as Python, often with better performance, especially for skilled C# developers.",
"KeyWords": "C#, .NET, Data Analysis, AI, ML",
"ImageUrl": "https://storage.googleapis.com/blastcms-prod/blog-blastcms/aa4cbcee-2aae-4af0-b782-c18fb6a5c114-20240817121519.JPG"
}
I chose to use the gpt-4o-mini
model in this case. It seemed to give good results for this use case. I tried a couple of other models, but at this time it seems to be the right option.
Let's break down the Prompt.
The prompt starts off by describing a JSON schema for the output. I saw examples where the schema was defined in JSON format, but that seemed verbose and I had really good luck with this more descriptive approach.
Consider a JSON schema for Article Summary that includes the following properties: Author:string, PublishDate:datetime, Title:string, Summary:string, KeyWords:string, ImageUrl:string
Then I asked it to summarize the text from the page
Please summarize the the following text in 30 words or less for software engineers as the audience and output in json:
{textToSummarize}
Lastly, I added some hints to hopefully reduce the chances that I get something other than a JSON document with no other text.
# How to respond to this prompt
- No other text, just the JSON data
This may no longer be needed since I'm using the JSON output option with OpenAI. I found this in older examples and left it in since it didn't seem to hurt and I was getting consistent results.
While in hindsight this code is not super complex, it did take me a while to figure out the parts and refine it. The effort was well worth the results. I'm at almost 100% success rate at scanning pages and getting accurate output. In addition to the parsing success, I get a better summary of the page and I can extract key words from the article to get a better listing.
I think the days of spending hours to hand code extraction logic for unstructured data is likely over. I'm wondering if the days of transforming structured data are also over with this technique. Imaging being able to declaratively convert a CSV file into a JSON file. Conceptually, that is possible with this technique. Will you get accurate results is the question.