Scientific Information Retrieval Essay Example | Topics and Well Written Essays

Scientific Information Retrieval, Introduction This paper will attempt to explain how new computer information retrieval (IR) methods have become an essential tool for the scientific community. The vast body of data created by the sciences has created a problem, namely how can this data be created to usable knowledge Data is distributed worldwide, and works from various universities and institutes are published and posted weekly, daily, even hourly in thousands of journals and reports. In order for a scientist to make sense of the data and understand his particular area of research he must continually keep updated in this published literature. To do this the personal computer(PC) has become major tool and methods of IR have been improved in a mammoth proportion in the last fifty years to aid his efforts. Essentials of Information Retrieval 1 IR Information retrieval is concerned with identifying, within a large document collection, a subset of documents whose content is most relevant to a user's need 1a. The Boolean query The common search or Boolean query that computer users do everyday is a submission of a term to search engine which is programmed with a Boolean algorithm which finds documents with the term we included in the search and it is supported by an index containing all terms in the database. The simple form of Boolean query, which is efficiently implemented over large databases, suffers several limitations: The number of retrieved documents is typically prohibitively large. A substantial part of the retrieved documents is irrelevant to the user's information need. 1b. Similarity queries A broadly used alternative to the Boolean query is the similarity query, which is typically based on the vector-space model. Under this setting, documents are viewed as (algebraic) vectors over terms. A query, q, may consist of many terms, and even comprise a complete document. It too is viewed as a body of text, rather than merely as a search-terms combination and is represented as a vector as well. The retrieval task reduces to searching the database for document-vectors that are most similar to the query-vector. Other approaches based on the vector-space model also aim to reduce the dependency of the retrieved documents on the particular choice of query terms, and effectively improve retrieval. One way to do this is through the reweighting of query terms, where terms occurring within relevant documents receive a higher weight than those occurring in irrelevant ones. This process is called relevance feedback. 1c. Text categorization A task often addressed by information retrieval systems is that of text categorization. This is the labeling of text by category-tags from a predefined set of categories. There are two approaches to text categorization: The first is the knowledge-engineer approach, a set of rules are encoded to determine the categorization of the data base by an engineer who consults with an expert with knowledge of the information in the data base and makes the rules. This method has a fallback in that the rules must be continually revised to keep up with information in the data and results in what is know as the knowledge engineering bottleneck. The other is the machine learning (ML) approach, where a text classifier is viewed as a function learnt by an inductive process, from a training set of example documents, already classified into a predefined set of categories. ML-based classification is partitioned into two types: hard and soft classification. Under hard classification a document is strictly assigned to a single category. In contrast, soft classification entails a ranking by relevance of the categories for each document. Under this approach, the classifier returns a number between 0 and 1 (called the categorization status value, CSV).1 History of the internet Just briefly touching on this subject, a series of memos were written by J.C.R. Licklider of MIT who envisioned what he called the "Galactic Net" Leonard Kleinrock MIT published the first paper on packet switching in 1961. Fast forward to the present decade or two. In 1992, after numerous reorganizations, the Internet Activities Board was again re-organized and re-named the Internet Architecture Board operating under the auspices of the Internet Society. The recent development and widespread deployment of the World Wide Web has brought with it a new community, as many of the people working on the WWW have not thought of themselves as primarily network researchers and developers. A new coordination organization was formed, the World Wide Web Consortium (W3C). Initially led from MIT's Laboratory for Computer Science by Tim Berners-Lee (the inventor of the WWW) and Al Vezza, W3C has taken on the responsibility for evolving the various protocols and standards associated with the Web. Commercialization of the Internet involved not only the development of competitive, private network services, but also the development of commercial products implementing the Internet technology. In the early 1980s, dozens of vendors were incorporating TCP/IP into their products because they saw buyers for that approach to networking. Unfortunately they lacked both real information about how the technology was supposed to work and how the customers planned on using this approach to networking. Many saw it as a nuisance add-on that had to be glued on to their own proprietary networking solutions: SNA, DECNet, Netware, NetBios. The DoD had mandated the use of TCP/IP in many of its purchases but gave little help to the vendors regarding how to build useful TCP/IP products2 Why do we use the internet Various reasons to communicate play games, shop, but it was conceived to share data this where information retrieval takes over and a powerful tool for the internet in any domain is the search engine. Search Engines Though a latecomer in the Internet family, the World Wide Web (WWW or the Web) has rapidly gained popularity and become the second most widely used application of the Internet. The publicity WWW has gained is so great that many people naively equate WWW with the Internet. The friendly user interface and the hypermedia features of WWW have been attracting a significant number of users as well as information providers. As a result, the web has become a sea of all kinds of data, making any query into the huge information reservoir extremely difficult. In order to overcome this difficulty in retrieving information from WWW, more than two dozen companies and institutions quickly developed various search aids [2] such as Lycos and Excite. However, since there are usually only one or two search aids for other Internet applications (e.g., Archie for FTP, and Veronica for Gopher), why have at least two dozen search engines been developed for the Web so far The sheer number invites research. For instance, what features do various Web search engines offer How do they differ from one another in performance Is there a single Web search engine that out-performs all others in information retrieval3 The most simple experiment was performed for this paper. The term used was "Internet search engines" using Google4 I received 1,780,000 results in 0.14s p.1 results were: A list of search engines on the web(included only URLs to each entry) www.freeality.com www.searchengineguide.com/searchengines.htm www.altavista.com www.internettutorials.net/engines.html www.metacrawler.com www.dogpile.com/ www.lycos.com en.wikipedia.org/wiki/List_of_search_engines www.jewishagency.org/JewishAgency/English/Aliyah/Links/Banking+238.htm www.csd99.k12.il.us/NORTH/LIBRARY/search.htm A list of search engines on the web(included only URLs to each entry) I used the same term "Internet search engines" using Google Scholar 5I received 202,000 results in 0.18s p.1 results were: Internet search engines-fluctuations in document accessibility W Mettrop, P Nieuwenhuysen - JOURNAL OF DOCUMENTATION, 2001 - infotoday.com 1. Internet search engines: Fluctuations in document accessibility. Wouter Mettrop. ... 8. Internet search engines: research on indexing functionality. Interactive Internet search: keyword, directory and query reformulation mechanisms compared - ualberta.ca [PDF] P Bruza, R McArthur, S Dennis - Proceedings of the 23rd annual international ACM SIGIR , 2000 - portal.acm.org ... Google was chosen as the mechanism to support standard internet query search because it seems to be one of the more effective newer generation search engines. ... Cited by 68 - Related articles - Web Search - All 13 versions of the HTML META Tag to Improve Retrieval of WorldWide Web Documents through Internet Search Engines TP Turner, L Brackbill - LIBRARY RESOURCES AND TECHNICAL SERVICES, 1998 - bubl.ac.uk ... Rising to the Top: Evaluating the Use of the HTML META Tag to Improve Retrieval of World Wide Web Documents through Internet Search Engines. ... C Hlscher, G Strube - Computer Networks, 2000 - Elsevier ... domain knowledge can be identified. Author Keywords: Expertise; Information retrieval; Internet search engines; Logfile analysis. ... The dynamics of competition in the internet search engine market - cdlib.org [PDF] N Gandal - International Journal of Industrial Organization, 2001 - Elsevier ... The dynamics of competition in the internet search engine market. ... Author Keywords: Internet; Search engines; Entry; Empirical study. ... The anatomy of a large-scale hypertextual Web search engine - tugraz.at [HTML] S Brin, L Page - Computer Networks and ISDN Systems, 1998 - Elsevier ... Search engines index tens to hundreds of millions of Web pages involving a comparable number of distinct terms. ... 3 or with search engines. ... An analysis of Internet search engines: assessment of over 200 search queries NG Tomaiuolo, JG Packer - Computers in Libraries, 1996 - portal.acm.org ... An analysis of Internet search engines: assessment of over 200 search queries. Source, Computers in Libraries archive Volume 16 , Issue ... Cited by 37 - Related articles - Web Search - BL Direct - All 3 versions Lycos: design choices in an Internet search service MI Mauldin, L Inc, PA Pittsburgh - Expert, IEEE [see also IEEE Intelligent Systems and Their , 1997 - ieeexplore.ieee.org ... In December 1993, three more Internet search engines became available: Jump- Station, 4 World Wide Web Worm, 5 and RBSE Spider (Repository-Based Software ... A set of articles pertaining to search engines and Information retrieval Jasco reports that Global Scholar is a good preliminary tool for a primary resource search but not for extensive scholarly searches where detailed bibliographic information is required. Since it is without subscription and free it is a search engine widely used with 129,000 average hits per day6 The internet is the road where the Information is carried still the information must be organized into organized data bases where scientists can find what they are looking for. There are many freely available indices of scientific literature examples include LANL, e=print archive, NCSTRL, UCSTRI, ML Papers LTRS, NZDL. CORA and CORR. Data is now available in full-text in many databases but not without a price. Directories The first requirement for these search engines is to find appropriate directories to search these terms out in our primary and retrieve information, how would I use the world wide web(WWW). On the commercially available computers we have browsers that start us off. Then we find a directory that sends us to sites we may find useful Web World quality sites on the web Here we find a directory listed in categories (http://www.webworldindex.com/): Arts Art Animation, Art Galleries, Artists, Crafts, Design Arts, Performing Arts, Visual Arts, Shopping and Services, ... Business Business to Business, Financial Services, Employment, Finance and Investment, Law Firms, Shopping and Services, ... Computers Chats and Forums, Graphics, Hardware, Internet, Programming, Resources, Software, Shopping and Services, ... Education Adult and Continuing Education, Career and Vocational, Distance Learning, Shopping and Services, .... Entertainment Actors and Actresses, Movies and Film, Music, Performing Arts, Television, Shopping and Services, ... Government Chats and Forums, Documents, Law, Military, Politics, US Government, Taxes, Web Directories, ... Health Alternative Medicine, Beauty, Diseases and Conditions, Fitness, Nutrition, Shopping and Services, ... Kids and Teens Arts, Books, Directories, Games, Pre School, Teen Life, Shopping and Services, ... News Internet Broadcasts, Magazines, Newspapers, News and Media, Radio, Weather, ... People Advice, Celebrities, Cultures and Groups, Families, Food and Drink, Relationships, Religion and Spirituality, ... Recreation and Sports Automotive, Hobbies, Outdoors, Sports, Tour Operators, Travel, Shopping and Services, ... Reference Almanacs, Calendars, Dictionaries, Encyclopedia, Libraries, Phone Numbers and Addresses, Thesauri, ... Science Agriculture, Animals Insects and Pets, Biology, Engineering, Economics, Physics, Psychology, Social Science, ... USA Real Estate Alabama, California, Colorado, Florida, Indiana, Maryland, Pennsylvania, Texas, Utah, Washington, ... Regional Countries, Regions, US States, ... We get a large body of knowledge that covers an entire set of interests that may be of value to any individual who is using the web for self-fulfillment or researching a subject that he wants to learn more about. Scientists don't work like this. Most professional researchers want information on the specific information if someone else is working in the same field as he is. NASA(National Aeronautics Space Administration developed its directory the Planetary Data system(PDS) in the 1980's. It was developed to both prescribe the metadata to be collected for the planetary science data archive and to design the data set catalog, a high level inventory of the data holdings in the archive. The data model, implemented in a relational schema for the catalog database, supports sophisticated constraint-based searches for data sets based on their relationships to other modeled entities such as spacecraft instruments and target bodies that were involved in the collection of the data. The Semantic Web now provides technologies to allow information to be easily read and consumed by computer software. These new technologies such as XML, the Resource Description Framework (RDF), and RDF Schema (RDFS) provide information that allows computer processing and reasoning of web information. This capability however is dependent on the existence of domain ontologies. The Planetary Data System (PDS) is the official science data archive for NASA's planetary science community. As such, it contains tens of terabytes of data collected from over thirty years of solar system exploration and will grow exponentially in the next few years. The data model, implemented in a relational schema for the catalog database, supports sophisticated constraint-based searches for data sets based on their relationships to other modeled entities such as spacecraft instruments and target bodies that were involved in the collection of the data. PDS Data Model The model centered on data sets (i.e. collections of data products) and a data set's relationships to other planetary science entities. Figure 2 shows the progression of the data model's development, from structure diagrams, through the Entity-Relationship model, and then implementation in a relational schema. Finally, in order to include the captured information on archive volumes as text files, the Object Description Language (ODL) was used. Data Model Development7 An ontology is the product of an attempt to formulate an exhaustive and rigorous conceptual schema about a domain. It is typically a hierarchical data structure containing all the relevant entities and their relationships and rules within that domain (e.g., a domain ontology Stanford's ontology tool, Protg8 was used to capture the object classes and their attributes from both the relational schema and the PDS data dictionary. The Resource description framework(RDF) was developed to make the data found on PDS available for human and available of the web. RDF/XML allowed data from RDF set to be be converted into graphs with a user friendly end product, text conversion was done by Lucene7 Lucene Lucene is a free/open source information retrieval library originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under Apache Software License. Lucene has been ported to programming languages including Delphi, Perl, C#, C++, Python, Ruby and PHP. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of file format. Text from PDF, HTML and Word documents, as well as many others can all be indexed so long as their textual information can be extracted.9 Bioinformatics 2a. Text mining Biomedical sciences are turning their attention away from small scale projects involving only a small number of genes or proteins to large scale endeavors involving geonome wide analysis, making it necessary to capture extended biological networks from the literature. Most information of biological discovery is stored in descriptive full text. Manually sorting through the quantity that needs review is both expensive and difficult. Text mining is new method(commercial program Textpresso) is characterized by two key features: the an ontology and the capability to search full text. The ontology is organized into categories that facilitate broader searches of broader enitities. 2b The Textpresso Ontology An otology is a catalog of types objects and concepts, even abstract concepts that are found in a domain of interest. Abstracts, titles and full-text are processed for the purpose of marking them up semantically for the ontology constructed. An ontology helps clarify a domain's semantics for everyday use, as has been demonstrated by the Gene Ontology(GO Gene Ontology Consortium 2000). Go terms are rich in biological meaning and synonyms, though they are used in natural language prose. They are foundations for three associated categories Textpresso and were added to its 30 categories. The ontology is populated by 14,500 Practical Extractable Report Language (PERL) regular expressions, each of which cover from one to eight words. 3.9 million terms in 16,000 abstracts and 2,000 full-text papers in 2 years have been markedup up at the Textpresso lab in Caltech.10 In this paper we have reviewed the essentials of Information retrieval , the history of the internet was included to provide an understanding of how information gets to all of the world as end users. The scientists themselves have started using databases stored by way of the internet and many of the most advanced Databases such as PDS have worked hard to make their data humanly digestable and developed techniques such as RDS to do this. The biologists after the human geonome projects were fast to realize that with the vast body of data that is being created efforts must be made to organize this information in order to transform it into a integrated knowledge. It is possible we are soon to see a fragmenting of the web into domains on knowledge and information, with so many disciplines making its own web so the academics and researchers will be talking and working amongst themselves. References 1. Briefings In Bioinformatics Vol6, No3 222-238 September 2005 Hairpins in bookstacks: Information retrieval from biomedical text Hagit Shatkay 2. Web Site "A Brief History of the Internet" Barry M. Leiner, Vinton G. Cerf, David D. Clark,. Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel, Larry G. Roberts, Stephen Wolff Internet Society October 29,2008 http://www.isoc.org/internet/history/brief.shtml#leiner 3. Web Site " Search Engines for the World Wide Web: A Comparative Study and Evaluation Methodology" Heting Chu, Marylin Rosenthal Asis National Conference Proceedings 1996 October 29, 2008 http://www.asis.org/annual-96/ElectronicProceedings/chu.html 4. Web Site Google The Search engine 28 October 2008 http://www.google.com/ 5. Web Site Google Scholar beta Advanced search methods 28 October 2008 http://scholar.google.com/intl/en/scholar/refinesearch.html 6. . Online Information Review Vol. 32 No. 1, 2008 pp. 102-114 Emerald Group Publishing Limited 1468-4527 7. Web Site 21-23 November 2005, The Royal Society, Edinburgh, UK "The Semantic Planetary Data System PV-2005, Ensuring long-Term preservation and Adding Value to Scientific and Technical Data" J. Steven Hughes, Daniel CrichtonSean Kelly, Chris Mattmann 2 November 2008 http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/37655/1/05-2559.pdf 8. Web Site Protg The Protege software Ontology maker 2 November 2008 http://protege.stanford.edu/ 9. Web Site "Lucene" 2 November, 2008 http://en.wikipedia.org/wiki/Apache_Lucene 10. Textpresso Textpresso: An Ontology-Bases Information Retrieval and Extraction System for Biological Literature Hans-Michael Mu ller, Eimear E. Kenny, Paul W. Sternberg Read More

Scientific Information Retrieval - Essay Example

Extract of sample "Scientific Information Retrieval"

CHECK THESE SAMPLES OF Scientific Information Retrieval

How to Do Scientific Research

Library Information Studies: Describing and Analyzing Information Resources

Language and Communication for Science

Contrasting of Information Retrieval Systems

Scientific Method

Can the Study of Politics Be Scientific

Scientific Uncertainty

Psychosis as an Emotional Response to Early Trauma