Wikipedia is essentially a vast encyclopedia with a unique twist: its content is exclusively generated by the site’s users, making it the world’s largest online collaborative effort. Articles on Wikipedia are written, edited, and expanded by a diverse group of contributors, including students, subject-matter experts, professional researchers, and enthusiastic amateurs, showcasing true group collaboration. This article introduces algorithms designed to characterize the features of over 38 million Wikipedia web pages. These algorithms are implemented for an intelligent answer retrieval system. The data-driven model incorporates demand-driven aggregation of information sources, mining, and analysis, utilizing MongoDB for data storage.
Introduction
Wikipedia is one of the open-source knowledge repositories used by people all over the web. A wiki is a collection of web pages where any user can contribute to and modify the content. The first wiki was WikiWikiWeb, a website founded in 1995 to facilitate the exchange of ideas between computer programmers. Wikis enable users to not only write new articles but also comment on and edit existing ones. Our capability for data generation has never been as powerful and enormous as it has been since the invention of information technology in the early 19th century. Every day, 2.5 quintillion bytes of data are created, and with over 7 billion pages already present and the knowledge of mankind expanding at a skyrocketing pace, managing this amount of big data for people’s use can be made easier. The era of Big Data has arrived. In just a single minute on the web, 216,000 photos are shared on Instagram, there are 1.8 million likes on Facebook, and three days’ worth of video is uploaded to YouTube. Google performs 2 million searches each minute, and 72 hours worth of video is uploaded to YouTube within 60 seconds. Along with these examples, hundreds of websites are created within a minute online, 204 million emails are sent, and millions of tweets are triggered. Therefore, handling Big Data has become a major breakthrough, where data has become the new oil.
This article discusses algorithms developed to manage data in NoSQL (No Structured Query Language) on the MongoDB platform. MongoDB is a cross-platform, document-oriented database designed with both scalability and developer agility in mind. Instead of storing data in tables and rows as in a relational database, MongoDB stores JSON (JavaScript Object Notation)-like documents with dynamic schemas, making data integration in certain types of applications easier and faster. Classified as a NoSQL database, MongoDB avoids the traditional table-based relational database structure in favor of a more flexible model. It offers efficient utilization of RAM, no complex joins, deep query ability, and ease of setup.
Working of Algorithm
A. Retrieval of Wikipedia Pages
To extract Wikipedia pages, one must first understand their structure. Each Wikipedia page is available in JSON (JavaScript Object Notation) format. JSON is an open standard format that uses human-readable text to transmit data objects consisting of attribute-value pairs. It is the most common data format for asynchronous browser/server communication, largely replacing XML (Extensible Markup Language) used by AJAX (Asynchronous JavaScript and XML). JSON is a language-independent data format derived from JavaScript, but code to generate and parse JSON data is available in many programming languages. The official Internet media type for JSON is application/json, and its filename extension is .json. The JSON format was originally specified by Douglas Crockford. JSON allows data to be serialized or turned into a series of key-value pairs, making it a common method for sending data between various components of an application. All Wikipedia pages can be downloaded in JSON format, which is a lightweight data-interchange format.
Generally, while it is easy for humans to read and write, machines find it easier to parse and generate results. Machines cannot understand human language directly; therefore, all programs and commands are converted into binary language for machine comprehension. Similarly, a Wikipedia page contains page ID, title, revision, namespace, and various other details about the webpage. To find the required page from the vast network of webpages, the page ID and title are used for easy searching. Since Wikipedia’s content is created solely by its users, there is a possibility of page duplication. Therefore, only user-verified pages are displayed using their unique page ID.
The following diagram shows the JSON (JavaScript Object Notation) format of a random page on Wikipedia. These pages are downloaded in a standard format and saved for further processing. The data can be enormous, potentially reaching terabytes.
Figure 1: A schematic representation of the JSON (JavaScript Object Notation) format of Wikipedia pages. Text mining techniques are applied to JSON pages to extract the relevant text
B. Saving the data in MongoDB
MongoDB is one of several database types that emerged in the mid-2000s under the NoSQL (No Structured Query Language) banner. Unlike relational databases that use tables and rows, MongoDB is built on an architecture of collections and documents. Documents consist of key-value pairs and serve as the basic unit of data in MongoDB. Collections, which contain sets of documents, function similarly to tables in relational databases.
Like other NoSQL databases, MongoDB supports dynamic schema design, allowing documents in a collection to have different fields and structures. The database uses a document storage and data interchange format called BSON (Binary JavaScript Object Notation), which provides a binary representation of JSON-like documents. MongoDB offers flexibility for modifications, efficient utilization of RAM (Random Access Memory), no complex joins, deep query capabilities, and ease of setup.
When the JSON format of a webpage is extracted using text mining techniques, the page ID, title, and infobox are separated and saved in MongoDB. If a particular page does not have an infobox, it is eliminated. As a result, MongoDB contains the page ID, title, and infobox stored as classes. The data is now structured, allowing for easy mapping.
Figure 2: The structure of the infobox in Wikipedia assists in data extraction.
C. Storing of Keywords in Database
To design a user-friendly search assistant, it is crucial to allow users the freedom to use any synonym of the actual keyword stored in the database. This requires careful mapping of synonyms to the correct answers. Keywords and their respective synonyms are stored in the database to facilitate this process.
The system functions as follows: The user input is fetched from the browser, and unnecessary elements are removed before it is processed by the database. The words are then matched with any of the words or their synonyms in the database. The resulting matching keyword or synonym is then used to query MongoDB to retrieve and display the correct answer to the user.
D. Separation of Key Attributes
For an intelligent answer retrieval system, speed is of utmost importance. To achieve fast responses, the system must quickly and effectively locate the required data. This necessitates filtering the most relevant data for the querying system. Filtering involves separating keywords from the entire question, which is typically a full sentence containing prepositions, pronouns, conjunctions, verbs, adverbs, adjectives, or articles, collectively referred to as “eliminators” in this context. Only the keywords are needed for effective query-to-data mapping, so eliminators must be removed promptly.
The designed algorithm works as follows:
- An array is created containing all possible prepositions, pronouns, conjunctions, verbs, adverbs, adjectives, or articles (eliminators) that might be used by the user.
- The question asked by the user in the browser is taken as input.
- The input is compared with the array of eliminators.
- All matches found between the user input and the array of eliminators are eliminated.
- The resulting filtered input, now consisting of keywords, is used to query MongoDB to find and provide the answer to the user. If a synonym of a particular word is included in the query, the main word related to it is identified, the synonym is replaced by that word, and the search results are obtained accordingly.
E. WikiMatrix
WikiMatrix identifies the root word by matching the remaining words from the query with their synonyms. It allows the creation of custom search queries to find a Wiki that exactly meets our needs. Wiki software and Wiki hosting services are listed on WikiMatrix, and they differ in certain features. By default, only general features are available for search. It also allows for comfortable side-by-side comparisons of selected Wikis, comparing the features of many different wiki engines.
A wiki is more than just a list of features. Why not use an online, hosted wiki? Because we are not always online, especially when mobile. For personal notes intended only for us, taking the notes with us (either on a USB stick or a mobile device) is just as effective. If we need to share our wiki with others, the wiki data can be hosted on a shared directory for a more conventional collaborative wiki.
F. Smart Answer Retrieval System
To design a smart answer retrieval system, the user query entered in the search box is broken down into keywords, with unnecessary words (eliminators) removed. The remaining keywords are then processed to find their synonyms using a database. This process involves creating an array of eliminators (prepositions, pronouns, conjunctions, verbs, adverbs, adjectives, or articles) and comparing the user’s input against this array to eliminate matches.
The system faces challenges in mapping these keywords to the extensive text on each page in MongoDB, as this can be time-consuming. To address this, the system initially matches the title and the first few lines of text on the page with the keywords. If a match is found, it further searches within the infobox attribute for additional keywords to provide the answer. This method reduces the time complexity compared to scanning the entire page.
Here’s how the algorithm works:
- The user query is broken down into keywords.
- An array of eliminators is used to remove unnecessary words from the keywords.
- The remaining keywords are matched with synonyms in the database using WikiMatrix.
- If a keyword has a synonym in WikiMatrix, it is replaced with the main word.
- Keywords are assigned flags (flag=1 initially, other words in the query are given flag=0).
- Keywords with flag=1 are checked against the MongoDB titles.
- If a match is found, the infobox attribute values for keywords with flag=0 are extracted and displayed as the answer.
For example, if the user asks, “Who is the CEO of Tesla?” the system should precisely return “Elon Musk.” This smart retrieval system is efficient and accurate for handling complex queries. The keywords in this query are CEO, Tesla, and who, with the rest of the words being eliminators. The flag 1 is assigned to Tesla since it lies in the WikiMatrix, while the rest of the keywords are assigned flag 0. The infobox JSON objects are then scanned to find the title Tesla, and once retrieved, the other attributes of the infobox are searched to find CEO.
The use of the infobox for smart answers is advantageous because it contains structured, accurate information that is easier to match with keywords. The keys of the infobox are stored in JSON format in MongoDB, making the process efficient and reliable.
Advantages and Limitations
A. Merits
- Complex Data Handling: Capable of managing more complex data structures.
- Content Presentation: Smart answers allow content designers to present intricate information quickly and simply.
- Improved Search: Unlike many Indian search engines, which struggle with complex queries, smart answers provide precise information, eliminating the need for users to visit multiple sites.
- Efficiency: Reduces time wasted on searching across various sites.
- Accuracy: Delivers more accurate results.
B. Demerits
- Infobox Dependency: Wikipedia pages lacking an infobox cannot be used in this system, as matching keywords with a large amount of text is impractical.
Scope of Improvement and Future Aspects
A. Scope of Improvement
Future enhancements will focus on integrating bLADE Wiki, a free mobile personal wiki for managing notes. It runs on Windows desktops and Windows Mobile PDAs and smartphones, allowing synchronization across devices. It can also operate from a USB memory stick, providing portability. As a standalone application, it doesn’t rely on network connectivity, as files are stored locally.
B. Future Aspects
- Research Aid: This technique may assist ongoing research.
- Search Engine Application: It can be effectively implemented in search engines.
- Government and Standards Adoption: Governments and standards bodies are considering or adopting elements of smart answers.
- MongoDB Potential: MongoDB, a cross-platform document-oriented database, has recently emerged as effective for integrating data in certain applications, making it easier and faster.
Conclusion
In a world with an ever-increasing demand for speed, there is a constant need to design new and improved algorithms to meet this demand. This article is developed primarily to enhance the speed of current search engines and to handle more complex queries than those managed by existing algorithms, thereby simplifying the search process for the average user.
The algorithm described in this article aims to deliver “Intelligent Answers” to search queries. It efficiently manages the vast database of Wikipedia, allowing end users to quickly and easily obtain the desired answers. Typically, it is tedious to sift through numerous blue links provided by search engines to find the desired answer. Moreover, users cannot be certain which link will contain the correct answer, often leading to landing on many irrelevant links before finding the right one.
This algorithm effectively identifies and displays relevant answers, eliminating the need to navigate through numerous unnecessary links. As a result, it has become an essential component in enhancing the performance of modern search engines.
References
- https://www.ijraset.com/fileserve.php?FID=11125
- Xindong Wu, Xingquan Zhu, Gong-Qing Wu, and Wei Ding. “Data Mining with Big Data,” IEEE.
- Michael Miller. “Cloud Computing: Web-Based Applications that Change the Way You Work and Collaborate Online.”
- http://searchdatamanagement.techtarget.com/definition/MongoDB
- http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-datacreated-daily/
- http://www.dailymail.co.uk/sciencetech/article-2381188/Revealed-happensjust-ONE-minute-internet-216-000-photos-posted-278-000-Tweets-1-8m-Facebook-likes.html
- https://en.wikipedia.org/wiki/JSON
- http://dumps.wikimedia.org/other/wikidata/
- http://searchdatamanagement.techtarget.com/definition/MongoDB
- https://www.mongodb.com/mongodb-architecture
- http://www.wikimatrix.org/
- bLADE Wiki Statistics | WikiMatrix. https://www.wikimatrix.org/show/blade-wiki/stats
- What is MongoDB?. https://tech.geekboots.com/2017/06/what-is-mongodb.html
- MongoDB Security: Secure and protect MongoDB database from Ransomware. https://www.thewindowsclub.com/secure-protect-mongodb-database
- What is JSON tutorial? – Stylesubstancesoul.com. https://stylesubstancesoul.com/essay-tips/what-is-json-tutorial/
I am Aditi Choudhary, currently a graduate student at the University of Southern California, where I am furthering my expertise in computer science. Prior to this, I completed my undergraduate studies at SRM University in India, and now, at USC, I am deepening my knowledge in areas such as data analytics and machine learning. Alongside my studies, I have worked as a Software Development Engineer Intern at Amazon Advertising. This unique blend of academia and professional engagement allows me to apply theoretical insights to practical challenges, enhancing my ability to innovate and solve complex problems within the tech community.