Changes: Document retrieval

Latest revision as of 18:22, 13 March 2008

Professional Psychology: Debating Chamber · Psychology Journals · Psychologists

This article needs rewriting to enhance its relevance to psychologists..
Please help to improve this page yourself if you can..

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Document retrieval is sometimes referred to as, or as a branch of, Text Retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. The advent of full text searching made the job of the indexer redundant during the 1980s. Text databases became decentralized thanks to the personal computer and the CD-ROM. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines.

Description

Document retrieval systems find information to given criteria by matching text records (documents) against user queries, as opposed to expert systems that answer questions by inferring over a logical knowledge database. A document retrieval system consists of a database of documents, a classification algorithm to build a full text index, and a user interface to access the database.

A document retrieval system has two main tasks:

Find relevant documents to user queries
Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.

Internet search engines are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using statistical or natural language processing techniques.

Variations

There are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.

Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A suffix tree algorithm is an example for form based indexing.

The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an inverted index algorithm.

Example: PubMed

The PubMed^[1] form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and MeSH terms using a word-weighted algorithm. ^[2]

References

External links

This page uses Creative Commons Licensed content from Wikipedia (view authors).

[1] PubMed Search Interface

[2] Computation

[1]

[2]

@@ Line 1: / Line 1: @@
 {{ProfPsy}}
+{{PsyPerspective}}
-'''Document retrieval''' is defined as the matching of some stated user query against useful parts of free-text records. These records could be any type of mainly [[unstructured text]], such as [[bibliographic record]]s, [[newspaper article]]s, or paragraphs in a manual. User queries could range from multi-sentence full descriptions of an information need to a few words and the vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using [[statistical]] or [[natural language processing]] techniques.
+'''Document retrieval''' is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly [[natural language|unstructured text]], such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
+Document retrieval is sometimes referred to as, or as a branch of, '''Text Retrieval'''. Text retrieval is a branch of [[information retrieval]] where the information is stored primarily in the form of [[natural language|text]]. The advent of [[full text search|full text searching]] made the job of the indexer redundant during the [[1980s]]. Text databases became decentralized thanks to the [[personal computer]] and the [[CD-ROM]]. Text retrieval is a critical area of study today, since it is the fundamental basis of all [[internet]] [[search engine]]s.
+==Description==
+Document retrieval systems find information to given criteria by matching text records (''documents'') against user queries, as opposed to [[expert system]]s that answer questions by [[Inference|inferring]] over a logical knowledge database. A document retrieval system consists of a database of documents, a classification algorithm to build a full text index, and a user interface to access the database.
+A document retrieval system has two main tasks:
+# Find relevant documents to user queries
+# Evaluate the matching results and sort them according to relevance, using algorithms such as [[PageRank]].
+Internet [[search engines]] are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using [[statistical]] or [[natural language processing]] techniques.
+==Variations==
+There are two main classes of indexing schemata for document retrieval systems: ''form based'' (or ''word based''), and ''content based'' indexing. The document classification scheme (or [[Search_engine_indexing|indexing algorithm]]) in use determines the nature of the document retrieval system.
+Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A [[suffix tree]] algorithm is an example for form based indexing.
+The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an [[inverted index]] algorithm.
+==Example: PubMed==
+The [[PubMed]]<ref>[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11825203&dopt=Abstract PubMed Search Interface]
+</ref> form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and [[Medical Subject Headings|MeSH]] terms using a word-weighted algorithm. <ref>[http://www.ncbi.nlm.nih.gov/entrez/query/static/computation.html Computation]</ref>
 == See also ==
+* [[Computer searching]]
 * [[Document classification]]
 * [[Information retrieval]]
 * [[Search engine]]
+* [[Latent semantic indexing]]
+== References ==
+<references/>
+== External links ==
+[[Category:Information retrieval]]
+[[Category:Electronic documents]]
+<!--
-[[Category:information science]]
+[[zh:文献检索]]
-[[zh:&#25991;&#29486;&#26816;&#32034;]]
+-->
 {{enWP|Document retrieval}}