Multiple Choice Question Generator Project Proposal

Reading in the Information Age

It's often said we live in an information age. Despite the fact, however, that the printed word remains one of the most efficient ways to store and convey information, reading is becoming a "lost art." That was the lament of David L. Ulin, 2015 Guggenheim Fellow and former book critic of the Los Angeles Times, who described reading as "an act of contemplation" requiring a state of immersion that is "increasingly elusive in our over-networked culture." When interviewed following the release of the Macbook Air, Steve Jobs predicted that Amazon's Kindle e-book reader and similar products would fail because "People don't read anymore."

It's not just "Literary Fiction" that's losing our attention. TLDR-- "Too long, didn't read"-- has become a popular acronym on Internet forums as a comment on any lengthy post. The newspaper business in particular has suffered a notable decline in recent years, a trend examined by CNBC's Allen Wastler in Newspaper bane: Nobody reads the stories. Humor columnist Elliott Kalan was reportedly fired in 2007 by the free daily paper, Metro, when he wrote an article entitled "Newspapers: Information's Horse & Buggy," making fun of his employer's declining readership.

Multi-tasking myths

Many reasons are suggested why people are reading less. The articles referenced here variously mention lack of time, too many distractions, social media, the constant deluge of tweets and sound bites-- all hallmarks of our 24-7 networked world. With so much information around us, we tend to skim, often missing important details. Web analytics data suggests that many people comment on articles they have not even read. I think an apt expression for this problem is information overload, described with amazing prescience by Alvin Toffler in his 1970 book, Future Shock.

Mobile devices play a dual role in this milieu, both as a source of information overload and a means for dealing with it. To learn more about developing applications for mobile devices, I enrolled some time ago in a online course entitled "Building Mobile Experiences." The focus of the course was the use of a range of marketing techniques to design commercially successful mobile applications. Early in the course we were to select a general area or problem domain in which we would develop our project, e.g. photo-sharing, meet-up logistics, etc. I found myself facing a sort of dilemma, as reflected in this message I posted on the class forum:

I'm having trouble identifying my domain of interest, perhaps because it doesn't seem to be amenable to the generative research techniques suggested in the text and lectures. I would describe my area of interest as learning and enhancement of cognitive and analytic skills. I note that a number of mobile apps alluded to in the text and lectures relate to physical exercise. It seems natural to extend the use of apps for self improvement from physical to mental and cognitive areas, but there's a disconnect. The generative research approach, as I understand, is to investigate how people currently behave and perform various tasks in some domain of interest. The results from this investigation are then used to design apps that will help users perform those tasks more efficiently or effectively. Here's my problem. There is a growing body of research suggesting that extensive multi-media consumption actually has a negative effect on learning and cognitive ability. Thus, it seems the simplest way to improve learning and cognitive ability would be to get people to use their mobile devices not more, but less.

Another term increasingly associated with our information rich environment is multi-tasking. Ironically, it is no more accurate in describing human behavior than when it was first coined in the 1960's to describe the operation of computers. While today's multiprocessor computers really do work on multiple problems simultaneously, task-switching would be a better description of both humans and vintage time-sharing computers. More to the point, however, multi-tasking does not usually improve productivity. For that and many other reasons it is not an effective response to information overload. In fact, a study by researchers at the University of Utah found that people most likely to engage in multi-tasking are actually those least likely to be good at it.

Getting people to read isn't easy

James Joyce, one of the most influential writers of the 20th century, once said, "The demand that I make of my reader is that he should devote his whole Life to reading my works." Not everyone can write like James Joyce, but maybe better writing could motivate people to read more, even for dry utilitatian material like instructions for setting up your TiVo or filling out an insurance form. That is the message of Peter Vogel, author of rtfm*: the little book on how to write so you'll actually get read. Vogel's book is a guide for technical writers-- people who write user manuals, job performance documentation, help system content, and similar material.

Even if we could automatically filter out all the poorly written material we encounter, what remained would still be overwhelming. We would still try to cut corners. Writing for Inside Higher Ed, Rob Weir comments on an often heard complaint by college instructors-- They Don't Read!-- and presents a variety of suggestions for getting students to read. Most of these suggestions involve creating and grading additional assignments and thus require significant time and effort by the instructor.

A tool for educators

A tool for automatically generating assessments could provide benefits beyond simply reducing the workload of teachers and curriculum designers. For example, using quizzes to encourage completion of reading assignments might result in improved overall student performance. Such a tool could be used to facilitate research in this area.

The most promising educational advantage of a quiz generating tool might not involve assessments at all. In Researchers Find That Frequent Tests Can Boost Learning, Scientific American, August 1, 2015, Annie Murphy Paul describes the use of testing as part of a learning strategy called Retrieval practice. There is a significant body of research indicating that long term retention improves when students are regularly asked to recall information during the learning process, rather than just during assessment. Researchers have shown retrieval practice is more effective than re-reading, taking notes, or listening to lectures. It's also possible that making retreival practice a part of daily classroom activities could reduce the stress and anxiety many students experience when taking tests as assessments.

Analytics

An obvious reason multiple choice questions are popular with educators is that they can be administered and graded automatically, even when they are manually created. Perhaps equally important is that statistical analysis makes it possible to measure the quality of each question by comparing how well it predicts a test taker's overall performance. (More on quiz statistics from moodle.) By integrating analytics with question generation, it should also be possible to fine tune the question generation process itself. With the development of suitable knowledge structures, it should also be possible to judge the quality of the written material on which the questions are based.

Beyond the classroom

Automatically generated quizzes could be useful in a variety of applications outside formal education. Manufacturers and merchants of consumer products could reduce service calls and improve their image by offering discounts and other promotions to customers who read their user manuals. Insurance agents could offer similar perks to clients who read their insurance contracts to familiarize themselves with their coverage. Providers of online services might use quizzes to assure enforceability of click-through agreements. Notably, such agreements were called into question in the recent case of Sgouros v. TransUnion Corp., No. 15-1371 (7th Cir. 2016). In that decision, the appellate court stated, "[W]e cannot presume that a person who clicks on a box that appears on a computer screen has notice of all contents not only of that page but of other content that requires further action (scrolling, following a link, etc.)"

Analytics firms have identified time on site and page visits as important metrics for website owners seeking to gain ad revenue. It's also been shown that study participants who are paid to take a quiz on material they read online, regardless of how well they perform on the quiz, spend much more time on the site than typical visitors. Essentially any context in which descriptive text is used presents a potential application for the proposed tool.

Toward greater civility

As noted above, many people who post remarks on online don't even read the stories they comment on. By having visitors pass a short quiz before allowing them to comment on a story, news websites could encourage more informed exchanges on their discussion boards. Compelling participants to demonstrate a modest familiarity with the discussion topic in this way could have another positive effect. In VIRTUOUS OR VITRIOLIC: The effect of anonymity on civility in online newspaper reader comment boards, Arthur D. Santana reports on research showing that anonymous posts are far likelier to contain threats, stereotyping and other forms of incivility. Further investigation would be needed, but it may even be possible to restore civility to internet forums without requiring users to sacrifice their anonymity.

Current technology

Just as advances in computer hardware and software have long been seen as highly interdependent, the growth of big data has paralleled the rapid rise of interest in the fields of data science and analytics. Much of the newly emerging technology in these areas is directly applicable to the automation of multiple choice question generation.

Natural language processing

Natural language processing (NLP) has seen a number of new developments in recent years. To a great extent, advances in NLP have coincided with improvements in hardware capability. One of the first goals of NLP was translation. Many early translation systems were based on compositional grammar, building on the work of Richard Montague. The essence of Montague's analysis is that the meaning of any expression is determined by the meanings of its parts and how they are assembled grammatically. Relating syntax to semantics in this manner is also a widely adopted approach to compiler design, whereby source code in high level computer programming languages is translated to executable ("low level") object code. Numerous tools have been developed, beginning in the 1970's with YACC (Yet Another Compiler Compiler), to automatically create compilers from grammar rules for a programming language. A more recently developed tool, ANTLR (ANother Tool for Language Recognition), imposes certain restrictions on grammar rules but is preferred by many developers because the compiler code it produces is more easily understood.

While it is possible to use such tools for NLP, the subtle complexities of natural language make it very hard to specify a grammar that accurately recognizes a high percentage of well formed expressions. Programming languages can be designed from the ground up to avoid ambiguity, but structural ambiguity is a particular area of difficulty in NLP. Consider, for example, that a noun phrase in English may be modified by a prepositional phrase. Since a prepositional phrase includes a noun phrase, it may itself be followed by another prepositional phrase, ad infinitum, as in the children's song, "There's a hole in the bottom of the sea." This is similar to arithmetic expressions that can have an arbitrarily long number of terms, e.g., + b x c  d... . Ambiguity as to the order of operations in programming languages is avoided by applying fixed rules of precedence, like the convention for arithmetic expressions reflected in the "PEMDAS" acronym. The problem with English is that prepositional phrases do not always modify a preceding noun, as in "a book on linguistics from the library." Often, too, a prepositional phrase will modify the sentence verb, rather than a noun, and the meaning of a sentence depends on where each prepositional phrase is attached. "I saw a girl with a telescope" is a favorite example presented by computational linguists to illustrate prepositional phrase attachment ambiguity.

Translation continues to be a major goal in NLP, and a wide variety of formalisms have been developed for dealing with structural ambiguity in natural language. One of the most significant trends, beginning with the Candide project at IBM, was the use of statistical techniques for machine translation. In general, statistical approaches make use of large corpora for which translations in both a source and a target language are available. Statistical analysis has also been applied to the problem of translating between languages for which there are no bilingual corpora, effectively treating the source language as an encrypted version of the target.

Machine learning

Question answering systems, or virtual assistants, are another application of NLP that has gained attention recently. Both translation and question answering have benefited from advances in neural networks, a type of machine learning that attempts to mimic the interconnective activity of neurons in the human brain. Apple's Siri, perhaps the best known virtual assistant, was actually later than its competitors in adopting neural networks for speech recognition. More recently, however, Siri has also incorporated neural networks to improve voice synthesis.

Arguably, speech recognition and generation are but loosely related to NLP. In a more direct application, Google Translate has begun using recurrent neural networks in its Chinese to English translation. The "recurrent" feature of the Google Neural Machine Translation system comprises an "attention network" that provides feedback loops between the neural network layers. With the publication of its word2vec neural net model, Google has also promoted the use of vectors to represent words in natural language text. Now widely adopted, "pre-trained" vectorized corpora improve the accuracy of machine learning systems while reducing the need to invest in large sets of training examples.

Learning Executable Semantic Parsers for Natural Language Understanding by Percy Liang, Communications of the ACM, September 2016, Vol. 59 No. 9, Pages 68-76, gives an overview of the current state of the art in grammar based parsing of natural language. Semantic parsing focuses more on logical forms to which NL expressions can be mapped than an underlying grammar model for the expressions themselves. The logical forms, in turn, may be structured as executable programs, data base queries, etc., depending on the application. Liang describes a system in which supervised machine learning techniques are used to optimize the selection of grammar rules used to parse the input expressions.

Sentiment analysis and text summarization

Advances in machine learning have spurred interest in other uses of natural language processing and, in turn, the development of a wide variety tools for implementing machine learning algorithms. Many such tools are freely available as packages for the Python programming language. Natural Language Toolkit (NLTK) is a comprehensive set of text processing utilities for tokenization, parsing, tagging, and many other NLP tasks. Getting Started on Natural Language Processing with Python, by Nitin Madnani, presents a general introduction to NLP using examples from NLTK. From a blog entitled The Glowing Python, Text summarization with NLTK presents a program that creates summaries of narrative text passages by identifying sentences with the most frequently occurring words.

A variety of other python packages can be used with NLTK for more specialized applications. Based on the PhD dissertation of Radim Rehurek, the gensim Python module facilitates semantic modeling from plain text and can be used to create word vectors and for text summarization. An open source API developed by Google, TensorFlow, was originally written for Python to facilitate dataflow modeling and has recently been adapted for other programming languages. TensorFlow has become a popular tool for sentiment analysis, where, for example, reviews written in natural language are analyzed to determine whether the reviewer's opinion of the item being reviewed-- movie, product, etc.-- was positive or negative.

Many NLP applications make use of semantic modeling to access background and contextual knowledge in a systematic way, allowing them to extract useful information from input text. The use of standardized models for sharing data on the World Wide Web is often referred to as the Semantic Web. A collaborative effort, documented at schema.org, promotes schemas for structured data on the Internet. A New Look at the Semantic Web by Abraham Bernstein, James Hendler, Natalya Noy, Communications of the ACM, Vol. 59 No. 9, Pages 35-37, describes recent progress in semantic modeling on the Internet, and .NET tools for semantic modeling are available at LinkedDataTools.com. These schemas and tools make it possible to use information on the Internet to augment and refine text analysis for a broad range of NLP tasks.

Question generation

In his masters thesis, Automatic generation of multiple-choice tests (2010), Sergio Curto describes techniques for creating questions from expressions, typically sentences, that conform to recognizable patterns. Significant attention is given to the generation of distractors, or incorrect answers, that are semantically and grammatically matched with the question.

Curto first reviews a number of existing question generators, comparing a variety of features. Several of the systems reviewed are designed for foreign language study, particularly vocabulary, and as such are somewhat limited in scope. For the most part, they rely on manual review of the questions generated. Some of the systems use semi-supervised machine learning to identify expressions in the input text that are inappropriate for question generation. Curto describes three types of multiple choice questions generated by the systems he reviewed. Most of the systems were limited to cloze type questions, where the answer consists of a word or phrase that has been removed from the question text, or error detection questions used for language study.

Several sources for distractors are discussed-- WordNet®, a lexical database developed at Princeton University; Wikipedia, a popular community maintained online encyclopedia; online dictionaries; and FrameNet, a lexical database based on semantic frames developed at the Berkeley International Computer Science Institute. In addition to patterns and machine learning, some of the systems use word relations to generate questions, but most use the lexical resources for generating distractors.

The rule based multiple choice question generator proposed by Curto comprises a processor that accepts input from two sources. The first is a pre-processed version of the NL text from which the questions are to be generated, and the second a file of manually generated rules. The pre-processing identifies relationships among entities or types that correspond to expressions in the text, and the rules specify templates for constructing questions for specific relationships. As implemented, the system uses the same input text as a source for distractors, which are selected based on shared features and proximity with the expression used to create the question.

A somewhat similar approach to Curto's is described by Naveed Afzal of the Mayo Clinic Department of Health Sciences Research in a 2015 paper, Automatic Generation of Multiple Choice Questions using Surface-based Semantic Relations. Afzal's study focused on the biomedical domain, which is more complex than other generic domains like online news articles or language instruction. Like Curto, Afzal uses a preprocessor to identify concepts and relationships among them in the input text. A significant feature of Afzal's preprocessor is the use of unsupervised machine learning (Relation Extraction), which makes the expense of creating or otherwise obtaining a manually annotated training corpus unnecessary. Another advantage to the unsupervised approach is that it does not require prior knowledge of named entities or relation types, particularly in in the biomedical domain where new ones are being constantly added.

Input text in Afzal's study is first preprocessed by the GENIA tagger, which performs part-of-speech tagging and named entity recognition for several specific named entities relevant to the biomedical domain. The tagged input text is then searched for surface patterns, each pattern comprising two named entities separated by verbs or other content word sequences. Afzal states that the large number of named entities in biomedical literature make it unfeasible to use a dictionary or training set to recognize them, but it is not clear whether any named entities are recognized other than those returned by the GENIA tagger.

After identifying surface patterns in the input text, Afzal first evaluated them intrinsically. The extracted patterns were first ranked for relevance using a variety of information theory and statistical measures. Thresholds based on the rankings were established to select the most relevant patterns. The same ranking methods were then applied to patterns extracted from an annotated corpus (GENIA EVENT) in which events relating pairs of named entities had been tagged manually. Finally, precision and recall were measured by comparing the sets of tagged entity pairs with those in the patterns recognized by the unsupervised ranking method. Ranking criteria and thresholds were adjusted to maximize the precision of the unsupervised pattern selection method.

Afzal emphasizes that for a completely automated question generation system, precision is more important than recall. Precision refers to the proportion of selected patterns that are actually relevant, whereas recall measures the proportion of relevant patterns in the entire input text that are selected. In other words, high precision means few errors, but many relevant patterns may be missed. High recall, on the other hand, implies that most or all of the relevant patterns are recognized, but many irrelevant ones may be used as well.

In addition to evaluating the recognized patterns intrinsically, based on the comparison with the annotated corpus, Afzal submitted questions produced by the system to biomedical experts for extrinsic evaluation. The questions evaluated were generated from a set of templates matched with the recognized patterns. The evaluators rated each question for readability, relevance, usability and other factors, and judged a significant proportion of the questions to be usable without further modification.

Ontology based approach

An alternative approach to multiple choice question generation is described in Ontology-Based Multiple Choice Question Generation, by Maha Al-Yahya, Information Technology Department, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia. Instead of NL text, Al-Yahya leverages the Semantic Web, using OWL (Web Ontology Language) ontologies that conform to the World Wide Web Consortium (W3C) Resource Description Framework (RDF). OWL essentially models data as a directed graph in which pairs of nodes with a connecting path comprise a triple that includes a subject, a predicate (the connecting path) and an object. This infrastructure readily supports the extraction of valid statements from which multiple choice questions can be generated. Strategies used to extract statements include identifying class memberships, collecting statements in which the subject or the object is an individual, and constructing items from property axioms. Question generation is then accomplished by removing the subject or object from the extracted statements, which becomes the correct answer, then generating appropriate distractors.

Questions generated for Al-Yahya's study were manually reviewed by instructors with experience in the relevant subject areas. Evaluation was based on a set of specific design rules relating to the form and relevance of the questions. Many problems observed were with the distractors rather than the correct answers or the questions themselves. The most common issues were the lack of a plausible distractor, more than one correct answer, presumably meaning one or more of the intended distractors were too plausible, or grammatical consistency between the question stem and the corresponding answers.

Al-Yahya points out that some questions were deemed unusable in educational settings because they were too simple. To address this problem, the system would need some means to determine and control the difficulty of questions it generates. A related concern is that most of the questions only measure factual knowledge. Educators typically discuss levels of thinking and cognitive skills in terms of Bloom's taxonomy, described, e.g., by training consultant Donald Clark or education professor Leslie Owen Wilson. To test higher levels in Bloom's taxonomy, Al-Yahya suggests expanding ontologies to include cause and effect relationships or supplementing them with "task templates" relating concepts in different subject domains.

Combining technologies

While automated multiple choice test generation has clearly benefited from advances in NLP and the Semantic Web, further improvements may be realized by new combinations. One possibility would be to start with a learning semantic parser as suggested by Liang, configured to map NL expressions to ontology structures. The resulting ontologies could be used as input for a system like the one described by Al-Yahya. Evaluation of the multiple choice questions thus generated could in turn be used to produce training sets for the semantic parser.

Soylent: A Word Processor with a Crowd Inside, by Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, Katrina Panovich, Communications of the ACM, Vol. 58 No. 8, Pages 85-94, suggests how Mechanical Turk could be used interactively within a word processor to implement crowd sourced grammar checking. Using a crowd sourcing system like Mechanical Turk with well established statistical analysis methods for multiple choice exams, the process of creating training sets for a learning semantic parser could itself be automated.

Applications

Following a widely used business model wherein a basic open source version of a software application is freely distributed, with a more fully featured version requiring payment of a fee, an open source module could be developed to, for example, generate questions exclusively from a specific text. Subscription fees could be charged for access to basic question sets, which could be implemented as ontologies or database views, templates to provide interoperability among ontologies for different subject domains, or updates to a set of features and parameters for a semantic parser on which the multiple choice question generator is built.

An obvious customer base for the proposed tool would comprise schools and other organizations in the education industry. As suggested above, the broader market might also include retail sellers, insurers, or news publishers that sponsor online forums. In essence, the more it is actually read, the greater the value of any book, news article, user manual, contract, policy statement, or other type of natural language text. Anyone who produces such material, then, is a potential customer.