12-10-2016, 09:26 AM
1458715452-ProjectReport.docx (Size: 650.69 KB / Downloads: 6)
1. Introduction
1.1 Need for Project
Many mute people use sign language. Sign language uses gestures instead of sound to convey meaning, simultaneously combining hand shapes, orientations and movement of the hands, arms or body and facial expressions to express fluidly a speaker’s thoughts. Signs are used to communicate words and sentences to audience. Sign language is the only way of communication for such peoples to convey their messages and it becomes important for other people to understand their language.
The project’s aim is to detect and recognize user-defined gestures using the webcam with reasonable accuracy, where the input to the pattern recognition system will be given from the hand.
1.2 Literature Survey
1. Background subtraction
Concepts used:
1. Threshold values for background subtraction using skin colour extraction
2. Modified skin colour detection algorithm
Published in
International Journal of Innovative Research in Computer and Communication Engineering
2.Contour analysis
Concepts used:
1. Contour extraction
2. Polygon approximation
Published in
Contour Extraction and Compression-Selected Topics
3.Convexity defects
Concepts used:
Fingertips location
Published in
International Journal of Computer Applications (0975 – 8887)
1.3 Project Scope
In this software, we are going to process the hand movements captured by webcam and convert them into the audible format. For this processing we are going to use ASL (American Sign Language) as a standard sign language and webcam for gesture recognition.
The main project task is to produce a gesture recognition system that will be able to recognize the ASL gesture using web cameras. In the real world, it has practical usage, if the mute demonstrates the hand posture in front of the web-camera, the system will detect which letter it is in the ASL. If we know which letter it is, we will understand which word several gesture represents.
ASL is a standard and used language for mute community. It is a manual language or visual language, meaning that the information is expressed not with combinations of sounds but with combinations of handshakes, palm orientations, movements of the hands, arms and body, and facial expressions.
1.4 Technology to be used
- OpenCV libraries 3.0
- Haar Cascade
- Visual Studio 2013
2.2 Requirements
2.2.3 Functional Requirements
1. Hand Tracking
1. This feature provides the ability to track movement of the hand of the user.
2. Hand tracking involves steps such as background subtraction and face removal.
2. Hand Location
a. This feature sends the location of the hand from Gesture Recognition Engine to User Interface.
b. After the hand is tracked, this function returns the position of the hand in two dimensions
3. Hand Gesture Recognition
a. This feature calculates necessary attributes of input image for its interpretation.
b. Gesture is recognized using following attributes
i. Palm center and radius
ii. Fingertips location
iii. Convex hull
b. Calculated attributes are tested against database to find closest match.
4. Gesture to Action mapping
a. This feature maps recognized gestures to predefined actions, in this case to words.
5. Text to speech conversion
a. This feature converts recognized words into audible format.
2.2.4 Non Functional Requirements
1. Performance
1. The application involves high speed image processing, hence requires significant amount of processing power.
2. Extensibility
a. The gesture to action mapping unit can map gesture to any kind of actions.
b. This can be used to play games, control computers etc.
Methodology
3.1 Literature Survey (Detailed)
1. Background subtraction
Concepts used:
1. Threshold values for background subtraction using skin colour extraction
2. Modified skin colour detection algorithm
http://www.ijircceupload/2015/june/33_Video.pdf
Published in
International Journal of Innovative Research in Computer and Communication Engineering
Authors:
Rajeshwari J, Veena H L, Dr.K.Karibasappa
Date of conference:
6, June 2015
Conference location:
Visvesvaraya Technological University, Banglore, India
Contour analysis
Concepts used:
1. Contour extraction
2. Polygon approximation
http://cdn.intechopenpdfs-wm/138.pdf
Published in:
Contour Extraction and Compression-Selected Topics
Author:
Andrzej Dziech
Convexity defects
Concepts used:
Fingertips location
http://arxivftp/arxiv/papers/1312/1312.7560.pdf
Published in
International Journal of Computer Applications (0975 – 8887)
Authors
Amiraj Dhawan, Vipul Honrao
Date of conference
17, June 2013
Location of conference
Fr. C. Rodrigues Institute of Technology, Vashi Navi Mumbai, India
4. Study of technology to be used
OpenCV
OpenCV is released under a BSD license and hence it’s free for both academic and commercial use. It has C++, C, Python and Java interfaces and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was designed for computational efficiency and with a strong focus on real-time applications. Written in optimized C/C++, the library can take advantage of multi-core processing. Enabled with OpenCL, it can take advantage of the hardware acceleration of the underlying heterogeneous compute platform. Adopted all around the world, OpenCV has more than 47 thousand people of user community and estimated number of downloads exceeding 9 million. Usage ranges from interactive art, to mines inspection, stitching maps on the web or through advanced robotics.
Haar Cascade
Object Detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper, "Rapid Object Detection using a Boosted Cascade of Simple Features" in 2001. It is a machine learning based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images.
Here we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. For this, haar features shown in below image are used. They are just like our convolutional kernel. Each feature is a single value obtained by subtracting sum of pixels under white rectangle from sum of pixels under black rectangle.
Now all possible sizes and locations of each kernel is used to calculate plenty of features. (Just imagine how much computation it needs? Even a 24x24 window results over 160000 features). For each feature calculation, we need to find sum of pixels under white and black rectangles. To solve this, they introduced the integral images. It simplifies calculation of sum of pixels, how large may be the number of pixels, to an operation involving just four pixels. Nice, isn't it? It makes things super-fast.
But among all these features we calculated, most of them are irrelevant. For example, consider the image below. Top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same windows applying on cheeks or any other place is irrelevant. So how do we select the best features out of 160000+ features? It is achieved by Adaboost.