Summarizing the 1GB New York Times corpus in 1hour on a 1.3Ghz laptop

Today we will try to summarize the 1GB New York Times corpus in a reasonable time on a single-core 1.3Ghz machine. It contains document IDs, word IDs and the number of occurences of each word in each document. A separate file contains the words which correspond to the word IDs. The document IDs have no corresponding titles, so it is impossible to tell which document contains the given words. Some words also seem to be altered intentionally by the dataset creator, since they start with strange prefixes like "zzz_". Despite of this, finding a way to summarize such a large dataset quickly is valuable in itself.

The number of instances in the dataset are given as 8 million, while the number of attributes as 100000—you would agree that this is a very high dimensionality. Attempting to work with it in Python would have been nice if possible, because the language gives us access to scikit-learn's topic extraction algorithms like non-negative matrix factorization (NMF) and Latent Dirichlet Allocation (LDA). Yet, even applying these topic extraction methods isn't scalable to very large datasets like this one. In my earlier tests with the small sample datasets provided in the library, the fastest of the two methods took ≈5s, while the slowest ≈16s on a single document, where memory bandwidth limitations weren't even visible. As we will see at the end, this dataset has 300000 documents, which makes fitting these methods prohibitively expensive, even when they could have provided greater insight in the data. Unfortunately, handling this dataset with Python was not feasible on this machine, mainly due to the memory bottlenecks, which are a common in an interpreted language.

For this reason, we need a faster alternative, possibly touching the dataset as lightly as possible. So we can write the following code:

#include <fstream> #include <sstream> #include <iostream> #include <string> #include <vector> #include <algorithm> #include <map> using namespace std; int main () { string line; ifstream file, vocab_file; ofstream out; // Store results of analysis here out.open("nytimes_topic_results.txt"); // Read words. Position determines wordID. vector<string> words; vocab_file.open("vocab.nytimes.txt"); while(getline(vocab_file,line)) { words.push_back(line); } vocab_file.close(); // Read docID, wordID, word_count triples. file.open("docword.nytimes.txt"); int docID = -1; string word = ""; int word_count = -1; int current_line = 0; int word_pos = 0; size_t k = 0; size_t num_words_to_show = 15; map<string, int> m; vector<pair<int, string>> pv; map<int, map<string, int>> docs; while(getline(file, line)) { if(current_line > 3) { stringstream linestream(line); string value; while(getline(linestream, value, ' ')) { switch(word_pos) { case 0: docID = stoi(value); break; case 1: word = words[stoi(value)]; break; case 2: word_count = stoi(value); break; } if(word_count >= 0) { // some occurences are -1 docs[docID][word] = word_count; } word_pos++; } word_pos = 0; } current_line++; } for(const auto doc : docs) { k = 0; pv.clear(); for(const auto me : doc.second) { pv.push_back(make_pair(me.second, me.first)); } sort(pv.rbegin(), pv.rend()); out << "DocID " << doc.first << " -> "; for(const auto pe : pv) { if(k < num_words_to_show) { out << pe.second << "(" << pe.first << "), "; k++; } } out << endl; } file.close(); out.close(); return 0; }

Despite looking big, this C++ code does only few things. It reads the two data files and outputs the result in the form of "docID -> list of the 15 most frequent words". The result can be seen as a file (19.6MB compressed, 76.2MB uncompressed), where the summarization has reduced the dataset by a factor of ≈13 times. We obtain a simple overview of each document without having to spend a lot of time on it.

Note that the code could still be optimized, but here the decision was to keep it simple. We could replace the expensive sorting of potentially large vectors with a more lightweight alternative. C++ is really fast with reading and writing files, but once these files reach a given size, the time to keep and update them in memory while writing seems to trump the time for the writing itself. Initially, I thought, perhaps writing in many smaller pieces at a time could have been faster, but at one given moment, the whole would still need to be merged and then the memory cost would likely spike once again. I can not tell whether this would be true without further experimentation (takes too much time on a slow machine).

This case shows that choosing the right language for the task can make the difference between when we experience something as prohibitively expensive and possible, but within an hour of patience. And since my previous attempt to work with the Reuters RCV1 dataset (2.5GB) from within Python has failed, this would serve as a reminder to always seek alternatives.