Summarizing a short article with Gensim

from gensim.summarization import summarize with open('article.txt', 'r') as fr, open('summary.txt', 'w') as fw: content = summary = summarize(content, split=True, ratio=0.3) for i, sentence in enumerate(summary): fw.write("%d) %s\n" % (i+1, sentence)) """ Summary of the article "The astonishing engineering behind America's latest, greatest supercomputer" # summary.txt 1) You’ll need Summit, a supercomputer nearing completion at the Oak Ridge National Laboratory in Tennessee. 2) Summit will be five to 10 times more powerful than its predecessor, Oak Ridge’s Titan supercomputer, which will continue running its science for about a year after Summit comes online. 3) It's just that at 5 years old, the machine is getting on in years by supercomputer standards.) But it’ll be pieced together in much the same way: cabinet after cabinet of so-called nodes. 4) While each node for Titan, all 18,688 of them, consists of one CPU and one GPU, with Summit it'll be two CPUs working with six GPUs. 5) While not all supercomputers use this setup, known as a heterogeneous architecture, those that do get a boost―each of the 4,600 nodes in Summit can manage 40 teraflops. 6) "So we envision research teams using all of those GPUs on every single node when they run, that's sort of our mission as a facility," says Stephen McNally, operations manager. 7) Performing all those operations sucks up a lot of power and generates a ton of heat. 8) That poses a daunting challenge for Heery, the company charged with preventing Summit from overheating and powering the building that houses it. 9) Another engineering pickle: Each of the supercomputer's 4,600 nodes needs to be cooled individually. 10) You could also cool your electronics in a bath of mineral oil, if you were so inclined.) “Every one of those nodes is using a cold plate technology, where we're putting water through a cold plate that's directly on top,” says Jim Rogers, director for computing and facilities. """

Compared to scikit-learn's Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization (NMF), which output primarily words, Gensim's summary function outputs entire sentences and preserves context as much as possible.

Note: Using this approach on a big corpus may require too much memory and could make your machine unresponsive. An attempt to apply it on the "Benjamin Franklin's autobiography" from Project Gutenberg led to this result.