Challenge accepted!

published at 23.01.2013 13:10 by Jens Weller
Save to Instapaper Pocket

This all started a few days ago on twitter, when a discussion about implementing a wordcount algorithm came up. To be precise, it orignates in a couple of blogposts, to get the full details, just read the them here. In a short overview, a simple C++ programm was the origin, which could count the words in one or more files. Stephan T. Lavavej postet in to the comments a version working with C++11 regex library. While this implementation is quite elegant, its lacking like the first a bit in performance. Thats where the discussion started on twitter, when James McNellis and Kenny Kerr started to discuss their solution. Which they present in the linked Blogpost. Which presents a nice, and multithreaded solution, written in Windows specific C++ using PPL and a like. At that point I felt challenged, to implement my own version, in pure C++11. So, in this blogpost I will present to you, how this can be achieved in C++11.

But before I present my solution, I'd like to write a little bit about the problems I had. I choose to implement this on Linux with GCC 4.7.2. Which has a good set of C++11 features, and support for std::thread. The first Problem I run into, was a std::string constructor throwing an exception, which was caused by an index error. This was fixed quickly. But then, once everything compiled, I got another exception, this time "Operation not permitted". Well. I would have expected a linker error when you forget to link against pthread. But GCC thinks its better to throw an exception at runtime. Once everything was done, I wanted to compare my code to other solutions, but as James & Kennys solution is windows only, I went for STLs regex approach. After all, its also implemented in pure C++11, so that should work. But, as it turns out, <regex> seems not fully implemented yet in GCC. So, I get an linker error for std::sregex_token_iterator.

To my solution... As I mentioned, I decided to implement this in pure C++11. Using std::thread for the threading, std::chrono for measuring the time, and some other features like the new for loops and auto. As I write this in C++, I decided to put the main Algorithm in a class, so that I might reuse it later in some app or programm. This class has the following interface:

typedef std::unordered_map<std::string, size_t> wordcontainer;

class WordCounter
{
    std::unique_ptr<std::thread> mythread;

    wordcontainer wordcount;
    bool isrunning=false;
    std::mutex mymutex;
    inline bool isWord(const char& c);
    void countWordsThreaded(const std::string file);
public:
    WordCounter();
    virtual ~WordCounter();
    void startCounting(const std::string file);
    void copyWords(wordcontainer& words);
    bool isRunning();
    void join();
};

The method countWordsThreaded is the one which will run inside std::thread. As I plan in reusing the class, I put std::thread in a unique pointer, which after the thread has run, is replaced by a new one. A threadpool would be better here, but that does not yet exist in Standard C++. The member wordcount is an std::unordered_map, which brings quite a good performance gain against std::map. The rest of the interface is quite selfexplaining.

So here is the content of the main function:

size_t num_cores = std::thread::hardware_concurrency();// 1
std::cout << "Programm running on " << num_cores << " Processor cores" << std::endl;
std::vector<WordCounter> cores(num_cores < argc? num_cores:argc);// 2
wordcontainer wordcount;// 3

auto t1 = std::chrono::high_resolution_clock::now();// 4
for(size_t i = 1,j=0; i < argc;++j)//5
{
    if(j == cores.size())//6
        j =0;
    if(!cores[j].isRunning())//7
    {
        cores[j].copyWords(wordcount);
        cores[j].startCounting(std::string(argv[i]));
        ++i;
    }
}
for(WordCounter& w: cores)//8
{
    w.join();
    w.copyWords(wordcount);
}
auto t2 = std::chrono::high_resolution_clock::now();//9
std::cout << "found " << wordcount.size() << " words" <<std::endl;
std::cout << "Calculation took " << std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count() << " milliseconds" << std::endl;
/*for(std::pair<const std::string, size_t>& p: wordcount)
{
    std::cout << p.first << " : " << p.second << std::endl;
}*/

So, some code. The Details:

Using std::thread::hardware_concurrency(), I get the number of cores on which the programm might run.
Now, either for every core I create an instance of WordCounter, or if there are more Processorcores then files, for each file one.
Creating the word container.
using std::chrono to measure the time which the algorithm will need to do its job.
The loop who will put the cores to work. While j is increased every cycle, i is only increased, if a new file is put to work on a core.
As j acts as the index to the cores vector containing the WordCounter Objects, it need to be set back to 0 each time it reaches j == cores.size(). Alternatively we could use j % cores.size() as the index, but risk the overflow in a normal algorithm. Maybe also some STL algorithm like find_if could replace the counter of j. But the performance gain would be tiny, so its IMHO not worth the effort.
Check if the current index is still running, if not, give it a new job. First the results from the old job are saved to the main container, then the next file is started. Afterwards i is increased by 1. If all files are processed/in processing the loop ends.
If there is still some thread running, the program must wait for it. Also the last results must be copied into the main container for our words.
Done! The algorithm has done its job, and we know now how many different words we have found. So last thing to do is, take the time again. And do some outputs, so the user can see what the programm has done.

So, the main function is rather simple. As I said, the Algorithm and its threading, i put into a different class, for encapsulation, and later reusability. So, lets look at the core method, countWordsThreaded:

std::lock_guard<std::mutex> lock(mymutex);//1
isrunning = true;
std::ifstream in(file);//2
if(!in)
{
    isrunning = false;
    return;
}
in.seekg(0,std::ios::end);//3
unsigned long size = std::streamoff(in.tellg());
in.seekg(0,std::ios::beg);

std::unique_ptr<char[]> data(new char[size]);//4
in.read(data.get(),size);

for(char* it= data.get(),* beg = data.get(),*lastwordbegin = nullptr;(it - beg) < size;++it)//5
{
    if(lastwordbegin && !isWord(*it) )
    {
        ++wordcount[std::string(lastwordbegin,it)];
        lastwordbegin=nullptr;
    }
    else if(!lastwordbegin && isWord(*it))
        lastwordbegin = it;
}
isrunning = false;

And the details explained:

I need to use a mutex for isrunning, hence the lock_guard, which locks the mutex as long as the function is running. This is necessary, since mythread->joinable() didn't do its job, even when the thread should finished, its still joinable. So I use isrunning to find out wether or not the thread is still running.
Yes! I use std::ifstream, to read the file. FILE* might be a bit better, but I don't want to go down that road...
But still, I can read the file in one block. Maybe in the future one should check here filesize, if it makes sense to load the file in one block into memory.
Reading the file in a block of char[].
The main algorithm, which counts the words.

The rest of WordCounters Implementation is straight forward, in startCounting is a new thread created, if possible:

if(mythread && isRunning())
{
    std::cout << "thread still running" << std::endl;
    return;
}
mythread.reset( new std::thread(&WordCounter::countWordsThreaded,this, file));

While this is straight forward, what do we do with wordcount in copyWords? Is it better to call after the copying clear? Or shall we set it the count to 0 while copying? This would have the advantage, that in the next run, a lot of words would not have to be created, but also will force unneeded memory beeing freed later, and the adding of 0s to not found words by any next run:

if(isRunning())
    mythread->join();//wait for finish
for(std::pair<const std::string,size_t>& pair: wordcount)
{
    words[pair.first] += pair.second;
    pair.second = 0;
}
//wordcount.clear();

Through the effectiveness of std::unordered_map, my tests have shown, that its better to not delete the map after every run. But my testdata is also very similar, so that this not always might be the case. But the gain I had was between 4-8%, which is quite good. But as I said, I could not test my solution against others, so I only can say, I think its pretty effiecent =).

You can download the code, if you like.

Join the Meeting C++ patreon community!
This and other posts on Meeting C++ are enabled by my supporters on patreon!