StudentShare
Contact Us
Sign In / Sign Up for FREE
Search
Go to advanced search...
Free

Urdu Script Recognition - Assignment Example

Cite this document
Summary
The assignment "Urdu Script Recognition" focuses on the critical analysis of the major issues in the digital image processing techniques used in Urdu OCR. OCR refers to the areas or branch of computer science that engages the reading of text from paper…
Download full paper File format: .doc, available for editing
GRAB THE BEST PAPER92.7% of users find it useful
Urdu Script Recognition
Read Text Preview

Extract of sample "Urdu Script Recognition"

Urdu OCR 1- Executive Summery This research presents the detailed analysis of the digital image processing techniques used in Urdu OCR. In this research I will discuss about the optical character recognition. This technology is really useful in present information technology structure. This report will present a detailed overview of paper “Recognition of printed Urdu script” presented by U. Pal and Anirban Sarkar. Here I will discuss different algorithms and techniques regarding the Urdu Optical Character Recognition presented for the enhanced character recognition. 2- Introduction Optical Character Recognition (OCR) refers to the areas or branch of computer science that engages reading of text from paper as well as translating the images into a structure that computer is able to recognize (for instance converting into ASCII codes). An Optical Character Recognition system allows us to take a magazine or book or article, feed it straightly into an electronic computer data file, moreover edit the file by means of a word processor (webopedia, 2009). Urdu language is similar to Arabic, which is used widely in different countries. There is no such work was done previously. This research has offered a better reorganization of Urdu script. Pal & Sarkar (2003) also developed a prototype of the system that has attained 97.8% character level accuracy on average (Pal & Sarkar, 2003). 3- Main Structure of OCR All Optical Character Recognition systems comprise an optical scanner intended for extracting or reading text, as well as really complicated/sophisticated software intended for analyzing images. The majority Optical Character Recognition systems utilize a blend of hardware (particular circuit boards) as well as software to identify characters, while a number of low-priced systems perform it completely through software. Superior Optical Character Recognition systems are able to read text in huge variety of fonts; however they still have trouble through handwritten text (webopedia, 2009). The power and effectiveness of Optical Character Recognition systems is huge since they facilitate users to control the power of computers systems to review printed documents. Optical Character Recognition is previously being utilized extensively in the official profession, education, research, and print media (webopedia, 2009). But there is less amount of work done on recognition of other languages (i.e Arabic, Hindi, Urdu). 4- Urdu Script overview and difficulties with OCR implementation The Urdu script is a complex language script. The total number of alphabets in Urdu is 39. In this language we have 10 numerals characters. The main difficulty of this language is the compound characters those are formed through the combination of some characters. The main difficulty in Optical Character Recognition system is difficulty to detect from these compound shapes. The main characteristic of the Urdu script is the similarity of different shapes in the overall alphabets of script. This aspect is also a main difficulty in the recognition of Urdu script (Pal & Sarkar, 2003). The character recognition of English language is very simple because space between characters can be recognized there. The image given below shows the alphabets in detail: Figure 1 Urdu Alphabets: Source (Pal & Sarkar, 2003) 5- Proposed System Pal & Sarkar (2003) has proposed Optical Character Recognition system that utilizes the technique for recognition of individual characters. This process recognizes the Urdu script by means of a combination of contour, topological, and a new concept in this area “water reservoir” based features of Urdu characters. The implemented techniques for character recognition techniques are robust and simple (Pal & Sarkar, 2003). Pal & Sarkar (2003) has utlized the special technique of “segmentation”. This character segmentation technique offers a great advantage regarding the improvement for handling a huge variety of Urdu characters that occur frequently in images taken from inferior quality source documents. This system can perform effectively if fine-tuned for the wider variety of images enclosing characters in varied sizes and fonts (Pal & Sarkar, 2003). This technique is useful only for few alphabets. This technique does not work for all the alphabets. 6- OCR Algorithms This section will provide an overview of the different algorithms used in this research. 6.1- Water Reservoir Principle Pal & Sarkar (2003) have used the water reservoir principle for their Optical Character Recognition. In this technique the water is poured from one side of an element, the opening areas of the element where water will be stocked up are measured as reservoirs. Through bottom or top reservoirs we denote the reservoirs attained when water is poured as of bottom or top of the element. Also if water is poured in the element from right (left) side, the cavity areas of the components where water will be accumulated are known as right (left) reservoirs. This technique helps in recognizing different alphabets of Urdu. The entire reservoirs taken from a way of an element are not taken for future processing. Figure 2 shows water reservoir technique: Figure 2Water Reservoir, source: (Pal & Sarkar, 2003) 6.2- Skew detection and correction The conversion of image in digital format involves the utilization of histogram that is foundational upon the thresholding technique. Removal of pixels is decided on the basis of that threshold. In Skew detection technique we demonstrate object pixels through 1 and background or white pixels through 0. The 2- color based image normally demonstrates projections as well as dents in the characters and isolated object pixels in excess of the background, that are refined through a logical smoothing technique. Normal utilization of the scanner can direct to skew in the document image. This technique is useful for removing noise and unwanted pixels from the scanned image. The removal of noise and unwanted pixel is decided on the basis of threshold value. Normally, we can set any value which is between 0 and 1. In this technique skew angle is the angle that the document text line of the document image creates by means of the horizontal way. Skew improvement can be attained through initially approximating the skew angle, moreover rotating the image through the skew angle in the differing way (Pal & Sarkar, 2003). Figure 3Process of Skew detection, Source (Pal & Sarkar, 2003) 7- Detection Process This section is about the elaboration of main detection process: 7.1- Line and character segmentation The Optical Character Recognition system presented by Pal & Sarkar (2003) repeatedly perceives individual text lines as well as then sections the characters in every line. We do not fragment words as of a line intended for the detection reason. The lines of a text chunks are divided through discovering the valleys of the protuberance profile assessed through counting the amount of black pixels in every row. The channel among two successive peaks in this profile indicates the edge among two text lines. A text line is able to be divorced among two successive boundary lines (Pal & Sarkar, 2003). Figure 4lines detection, source (Pal & Sarkar, 2003) 7.2- Feature selection For the early categorization of characters, we judge contour features, topological features and features attained from then idea of water reservoirs. The topological features utilized in existence of holes. Contour characteristics comprise characteristics of diverse profiles acquired from a segment of character’s contour. The major water reservoir is foundational upon features employed in the detection scheme (Pal & Sarkar, 2003). 7.3- Character recognition Pal & Sarkar (2003) establlisged technique detects the Urdu characters in two stages. In the initial stage, the Urdu script characters are clustered into small subsets through a feature based tree segmentation. In the subsequent phase, we utilize additional sophisticated features to identify comparable characters attached to leaf nodes of the categorization tree. (Pal & Sarkar, 2003). 8- Suggestions I think the paper is excellent. But I would like to give some suggestions or ideas regarding this paper. Skeletonization is the method of peeling off of a pattern as many pixels as possible without modifying the actual form of the pattern. In more simple words, after pixels have been removed, the pattern should not change its meanings. The resultant skeleton must have the following characteristics: (Azar, 1997) as thin as possible connected centered Figure 5Image before applying hilditch algorithm Figure 6Image after applying hilditch algorithm Images source: (Azar, 1997) If we apply Hilditch algorithm on Urdu images then recognition will be uncomplicated, because it will be easy to separate and recognize single pixel text instead of bold text. 9- Conclusion Pal & Sarkar (2003) proposed OCR system for different printed Urdu documents. This system recognizes individual text lines by means of a correctness of 98.3 percent. The character division/segmentation correctness of the system is 96.9 percent. The majority segmentation faults were reasoned through the touching as well as compound characters. Occasionally a number of errors were reasoned for overlapping also. This report has presented an overview of the Pal & Sarkar (2003) proposed OCR system. I have outlined the main operational steps and working structure. I have also proposed some ideas to improve this technique. References Azar, D. (1997). Hilditch's Algorithm for Skeletonization . Retrieved October 10, 2009, from McGill University: http://cgm.cs.mcgill.ca/~godfried/teaching/projects97/azar/skeleton.html Pal, U., & Sarkar, A. (2003). Recognition of Printed Urdu Script. IEEE- Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) . webopedia. (2009). optical character recognition . Retrieved 10 08, 2009, from http://www.webopedia.com/TERM/O/optical_character_recognition.html Read More
Cite this document
  • APA
  • MLA
  • CHICAGO
(Urdu OCR Assignment Example | Topics and Well Written Essays - 1500 words, n.d.)
Urdu OCR Assignment Example | Topics and Well Written Essays - 1500 words. Retrieved from https://studentshare.org/information-technology/1727896-digital-image-processing
(Urdu OCR Assignment Example | Topics and Well Written Essays - 1500 Words)
Urdu OCR Assignment Example | Topics and Well Written Essays - 1500 Words. https://studentshare.org/information-technology/1727896-digital-image-processing.
“Urdu OCR Assignment Example | Topics and Well Written Essays - 1500 Words”, n.d. https://studentshare.org/information-technology/1727896-digital-image-processing.
  • Cited: 0 times

CHECK THESE SAMPLES OF Urdu Script Recognition

Does Spelling Transparency Affect Visual Word Recognition And Short Term Memory

The purpose of this paper is to explain how spelling transparency affects visual recognition and phonological short term memory.... The fourth stage which is not documented is the transfer of word recognition of one's native language.... The more transparent a language is the more access a reader has to an addressed process and to fewer lexicons thus easier comprehension to visual recognition.... Visual recognition takes longer....
10 Pages (2500 words) Essay

Comparison between Urdu and English

This essay "Comparison between urdu and English" is about the national and official language of Pakistan.... urdu is one of the most popular languages of the contemporary world, which is spoken and understood by over ninety million people living in different parts of the world.... urdu flourished during the Mughal Empire, as it was patronized and promoted by the emperors and courtiers alike as the official language of India.... The writers, poets, scholars, and philosophers of that era created literature in this newly-advent language and added thousands of Arabic, Persian, Turkish, Hindi, Sanskrit, Punjabi, and Bengali words in the urdu language....
10 Pages (2500 words) Essay

Word Recognition and Decoding Skills, Prior Knowledge

This paper "Word recognition and Decoding Skills, Prior Knowledge" aims in analyzing three main factors, the problems faced by children who do not have English as the mother tongue in learning the language, how their understanding of the language affect their general level of following other subjects....
16 Pages (4000 words) Essay

Does Spelling Transparency Affect Visual Word Recognition

The purpose of this paper 'Does Spelling Transparency Affect Visual Word recognition?... is to explain how spelling transparency affects visual recognition and phonological short-term memory.... This paper will demonstrate how spelling transparency affects visual recognition and how phonological short-term memory is necessary for the acquisition of languages.... The first step toward word recognition is the phonological The first step toward word recognition is the phonological recoding of the logographic to phonemes reaching the alphabetic stage of reading....
12 Pages (3000 words) Dissertation

Bilinguals Learning in the School District of New London - Policies and Practices

The study "Bilinguals Learning in the School District of New London - Policies and Practices" analyzes which program model would be the best for student characteristics, available resources, and parent desires.... A one-size-fits-all approach will not work for the diverse group of ELL students.... ...
17 Pages (4250 words) Case Study

The Bengali-Urdu Language Controversies

This paper ''The Bengali-urdu Language Controversies'' tells that You can define the word language as the capability of human beings to get hold of and use the complicated systems of communication.... In this essay, we are going to access the Bengali-urdu language controversies.... The urdu language imposition as Pakistan's national language created a devastating crisis in Pakistan in the following years.... The opposition was especially when the government decided to use the urdu language to print money orders, currencies, postal stamps, and railway tickets....
5 Pages (1250 words) Essay

Security of Information in Commercial or Business Organisations

This framework is also essential to businesses in the identification and recognition of which assets are significant to them.... This literature review "Security of Information in Commercial or Business Organisations" discusses the management of vulnerabilities in and threats to assets as a major challenge for business organizations....
15 Pages (3750 words) Literature review

Technique Used for Distance Measurement of Hexacopter in Gold Mining

The paper "Technique Used for Distance Measurement of Hexacopter in Gold Mining" describes that now and again there are situations where the area ends up being arranged in territories that are very unfavorable for a portion of the undeniable strategies used to mine neglects to get to the region.... ...
21 Pages (5250 words) Research Paper
sponsored ads
We use cookies to create the best experience for you. Keep on browsing if you are OK with that, or find out how to manage cookies.
Contact Us