Analyzing websites with QtWebkit

published at 21.02.2014 19:11 by Jens Weller
Save to Instapaper Pocket

I just finished a little tool, which needs to analyze websites, in this case mostly blog posts to extract a few fields. I have been working with QtWebkit for this purpose for a while, one larger project I am working on needs it to index certain websites. So, writing this small tool is a good example on how to that with Qt.

The background of this tool is, that one of my voluntary tasks is to post links and updates on isocpp.org. I already do this on twitter, G+, facebook and reddit, so isocpp.org is just one more site. But unlike the social networks, isocpp.org has a few extra rules, there is a style guide I need to follow for posting. So, simply copy & paste won't do, so I need to click the correct styled HTML each time in the WYSIWYG editor. I thought for some time about a tool, that would simply visit the site, and generate the fitting HTML. So, here is how to build such tool with QtWebkit.

QtWebkit is actually a full browser that also can display websites in your UI, unfortunately you can not omit the UI part when only needing the engine in the background. With Qt5, QtWebkit is bundled with QtWebkitWidgets, and when deploying has quite a lot of dependencies. My little tool does not use QML, Qt Quick, V8, PrintSupport and many more, still I need those DLLs as webkit is linked to those. My tool consists of a line edit for entering the URL, a button for starting the process of loading the URL and a text box, where the result is shown when fully loaded. So when the button is clicked, not a lot happens:

void MainWindow::on_btn_loadweb_clicked()
{
    if(ui->txt_url->text().isEmpty())
        return;
    QUrl url = QUrl::fromUserInput(ui->txt_url->text());
    if(url.isValid())
        page.mainFrame()->load(url);
}

I check if the line edit is not empty, and then simply load the entered URL. In this case I also make sure that the user has entered a valid url. The member variable page is of type QWebPage, which now loads into its main QWebFrame the website. When finished, Qt offers a SIGNAL, which I am already connected to, my slot then handles the loaded data:

void MainWindow::onLoadFinished(bool loaded)
{
    if(!loaded)
        return;
    QString formatted_text = "<p>\n\
...
<a href="{URL}\">\n\
...
</blockquote>";

First, if the site failed in loading successfully, there is then nothing to do. Then, I define a template for the HTML I need as output. Next part is the actual searching in the DOM provided by QWebFrame:

QWebElement root = page.mainFrame()->documentElement().findFirst("article");
if(root.isNull())
    root = page.mainFrame()->documentElement().findFirst("section #main");
...
if(root.isNull())
    root = page.mainFrame()->documentElement();
formatted_text.replace("{URL}", page.mainFrame()->url().toString());
QWebElement header = root.findFirst("h1");
if(header.isNull())
    header = root.findFirst("h2");
if(!header.isNull())
    formatted_text.replace("{TITLE}",header.toPlainText());
else
    formatted_text.replace("{TITLE}","INSERT TITLE");

QWebElement represents a single node in the xml like DOM, with findFirst I try to get the first node named "article". Some websites/blogs use this to wrap the actual content. Some others use other things, so if this fails I search for section not with id or class main. This code continues in a few variations, so that it gets the correct content root element for most blogs. I then grab the first h1, if none, I go for h2. The same thing I do for <p>, to grab the first paragraph. With toPlainText I can get the plain text which would be displayed on a website from any element. The API also allows for accessing the attributes, even inserting new nodes and text would be possible. QWebElements find functions simply take a css selector as search string, with findAll("a") you simply could build a web crawler.

At the end formatted_text will be displayed in the QPlainTextEdit, I might add a button for copying to clipboard, or, actually could copy the result directly to clipboard.

What I've learned through this project is, that QtWebkit has a nice API to access Websites as a browser sees them. Parsing HTML is difficult, as it is not XML, and most websites are some kind of wild mix. There are alternatives though, as QtWebkit is as a full webkit browser quite a beast. wxWidgets offers HTML Tag Parsing support, also arabica has a soup based HTML Tagparser. But when working on the DOM is needed, those can fail. Also, as more sites rely on javascript, simply downloading the HTML via HTTP might not be enough. I'm looking forward to the upcoming version of QtWebkit, which will be based on blink. As I did not need the "full package", I must also add that there is a lot of bloat coming with QtWebkit, QML, PrintSupport, QtQuick, are all needed to be included when deployed. All DLLs together are 84 MB on Windows.

 

Join the Meeting C++ patreon community!
This and other posts on Meeting C++ are enabled by my supporters on patreon!