As superior due the ease in computing a

As advances in computer vision progresses, so does methodologies in object detection
and localization. In earlier methods of object localization, the idea was to utilize forms of object
detection that involved discriminative classifiers which could model the dependence of
observed variables to a set of target variables which were unobserved 19. Discriminative
methods had proven to be superior due the ease in computing a fewer set of variables but
lacked flexibility. Methods of using boosting for supervised learned were then used in adding to
a current weak learned function which was deemed useful in learning more robust classifiers
and ease in implementation. Discriminative and boosting methodologies have given rise to
processes such as cascading classifiers. In cascading classifiers, information collected from a
classifier is then passed onto the next classifier within the concatenation of classifiers 19.
Cascading classifiers is primarily what is used in determining objects for the proposed
algorithm, but is also juxtaposed with vector and displacement tracking in order to obtain the
basic form of localizing by identifying the spatial extent in space to where the object exists and
bounding a box around it.
Most of the algorithms seen now tend to require strong supervision which entails on
training data with firm confidence to learn recognition models under the assumption that
presence of an instance studied exists with an image. However, advances in deep learning
(convolutional neural networks) have given rise to weak supervision localization algorithms that
inherently limit and train to detect the presence or absence of an object within an image as well
as apply segmentation modelling in order to differentiate between various objects that exist
within the image that share common similarities. Much of the algorithms mentioned is key in
localizing objects with webcams 7.
In previous applications seen in localizing objects with/without the medium of a
webcam, companies like Infsoft have successfully implemented their own methods of localizing
users for indoor positioning systems . Rather than using webcams, their systems have multiple
redundancies to account for any loss in Wi-Fi strength through the means of bluetooth based
beacons and through VLC (Visual Lighting Communication). Visual Lighting Communication for
this case was implemented by utilizing special LEDs and light sources to send a flickering light
signal that could be detected by a photodetector or smartphone camera which can then garner
spatial positions based on the location of the light emitting source and the incidence angle of
the light 9.
While providing methods for accounting for loss in a GPS or Wi-Fi signal, issues still arise
in utilizing means of bluetooth and VLC. Firstly, although bluetooth can provide a cheap
solution in localizing an object, it requires the user to have an app for client based solutions, the
need for additional hardware, and the short range of distance. Secondly, using the VLC based
approach would draw heavily on smartphone based applications, lack in flexibility due to only
being able to localize objects indoors, and would be cost ineffective as additional resources
would be required for implementing such light based systems to support tracking and back
channels. Additional issues arise in terms of the lack of discrete forms of localization as it solely
focuses on light and can hinder performance in areas that are subjected to varying light
GiPstech, another company that provides newer indoor localization techniques by
utilizing variations in geo-magnetic fields as well as inertial changes to localize a user within
certainty of a three feet. The initial redundancies seen in supporting a Wi-Fi/GPS based systems
by using bluetooth beacons have been eliminated while utilizing RF based algorithms. Most of
the RF based clients used by GiPstech require an app installed onto a smartphone in order to
localize and map the movement/behavior of the user. Varying other systems that further
support and mitigate localization errors such as GiPstech’s inertial engine, utilizes sensors that
pre-exist within a smartphone to accurately depict and measure movement of the user by
providing a 2D trajectory reconstruction, distance measurement, and step count that when
juxtaposed can provide a relatively affordable yet still invasive and environment change prone
system 8.
In the survey for these two companies along with other startups that provided indoor
localization systems along with basic level tracking surveillance systems, we saw that most
systems required a client to be installed onto a smartphone and strictly limited the localization
to where the location of the smartphone was. Most of these systems also were deemed
resource intensive as rather than utilizing a webcam to classify various cascades in multi-object
localization, they required a plethora of sensors and redundant systems in order to provide
back up in the event that a Wi-Fi or GPS signal returned stochastic errors. We see that the
system proposed for localizing objects can satisfy a wider range of objects/users while further
mitigating resources required. An immediate increase in flexibility can also be realized as the
sole proprietor in localizing an object/user is determined by the placement of the cameras in a
triangulated system.
Project? ?Planning:
Upon undertaking this project, it was understood that the programming languages that
supported computer vision/machine learning techniques were either C++ or Python. Most
machine learning libraries that were supported happened to fall within these two languages;
we were dealt with the option to choose between the two. Ultimately, Python was chosen due
to the plethora of machine learning libraries that it supported along with the current material
that existed for us to build our basis on.
In order to familiarize ourselves with Python, we set up weekly meetings with our
advisor to discuss initial implementations of the project and began working on it while
troubleshooting through trial and error. We found that in order to implement the algorithm
proposed, we were required to have at least a basic understanding of the machine learning
libraries that we would be using along with their corresponding image processing algorithms.
Initially, we began by first learning how to track defined objects such as faces, eyes, and mouths
by utilizing cascades that defined such attributes. In addition to the software, we were to
design and propose a system of cameras that would be placed in such a way such that they
could triangulate and localize an incoming object that passes through their peripheral vision. In
understanding how triangulation works, we familiarized ourselves with research papers and
algorithms that explained how to do so accordingly. However, we were subjected to
multiple camera/webcam design ideas but will focus on the hardware area in the next part of
the design project. The bulk of our initial analysis pertaining to designing a method such that
spatial coordinates could be found once an object was tracked as well as object segmentation.
3.) System Design:
Prior to implementing a design for this project, design constraints and other design that
were considered must be discussed. Constraints in design could pose issues that may result in
an entirely new design as well as external factors such as cost, environmental, social, and
governmental factors.
Design Constraints:
Technical? ?Constraints:
Constraints from a technical standpoint that we could initially face stems from the
actual tracking of an object prior to localization. In trying to provide a tracking system that can
utilize object segmentation, objects that are desired to be detected by utilizing color may be
detected erroneously. In the majority of color based tracking algorithms, images must be
thresholded in order to obtain grayscale image that isolates the color of the object to be
detected. In thresholding an object, the HSV (Hue, Saturation, Value) spectrum rather than the
more commonly used RGB color space. In using the HSV color space, the tint in the color of the
object is assigned a value that correlates to the hue 1. The intensity of the color in the object
is assigned a value for saturation and the brightness of the color detected is assigned another
number for value. Although this method of color tracking provides high flexibility in tracking an
object of a certain color with varying shades, the environment that the object is to be localized
in may contain a shade of the object that falls within the criteria of being detected. As an object
traverses through the vision of the triangulated cameras, point spaces in the environment
surrounding the object that fall within a certain shade can cause erroneous detection on that
particular point and could cause it to be part of the data accumulated for the localized object.
Further constraints upon test would require that the colored object that is to be localized be in
an environment that can contrast the color of the object as it would eliminate the erroneous
detections that we would expect.
In addition to the constraints seen in localization objects using color and implementing
object segmentation, camera quality can pose an issue for localization. Webcams/Cameras to
be used in localizing the object must be able to normalize itself in this case of display and
overexposed or underexposed image. If the cameras cannot adjust its aperture and ISO speed
within a small amount of time, tendencies of experiencing over and under exposure can cause
issues upon initializing when provided a reference to the object to be localized and can fail to
track the object. In addition to this, the object can be continually tracked in the same spot once
the cameras adjusts and can fail to localize the object as it moves through the triangulated
space. In implementing this system in non color based CCTV cameras, localization can be little
to impossible due to the lack of ability to threshold an image and isolate the object in the frame
to localize. Such cases can only work during times where the light outside is low such that the
object in the HSV space would be white in contrast to a darker background, but will still inhibit
limitations for any form of object segmentation let alone tracking. Camera performance is key
in distance calibration of the object in both the X, Y, and Z axis as a higher ISO can result in a
grainier image that could skew values upon initial localization of the object and through further
localization during the simulation. Further limitations in frame rate can also skew calibration
distances as localizing the image would lose continuity in the three spatial coordinates that may
provide data that isn’t discernible due to stochastic/erratic points in the simulation.
In utilizing the triangulation approach with multiple cameras, the tilt and viewing angle
of the cameras within the areas that will be triangulated can skew the spatial location of the
moving object. If the reference location of the object is not initially provided to the feed, the
algorithm will not be able to obtain the known distance of the object in reference to the focal
lengths of the cameras 15. From a technical standpoint, the majority of the constraints seen
are highly dependent on the camera’s performance along with the HSV image processing that
could hinder accurate localization in a triangulated system.
Financial? ?Constraints:
In contrast to previous localization systems that already have been implemented, the
proposed system is significantly cheaper due to the nature of only using cameras as a medium
for localizing objects. However, cost effectiveness becomes an issue when analyzing the system
in terms of the performance of the cameras. We saw that with in order to mitigate localization
errors, cameras that can process image data significantly faster along with increased rates in
ISO detection and aperture control would provide sufficient results with little to no error.
Current cameras that tend to have these characteristics can be cost ineffective, but with further
advances in the algorithm and image processing techniques along with filtering there is a
promise in mitigating such costs by garnering the ability to use lower grade cameras. Further
constraints involved in implementing this system into current surveillance and indoor systems
would consist of majorly in camera technology.
Designs? ?Considered:
Prior to the current implementation that is OpenCV based, in choosing a machine
learning library that would suit the project, both OpenCV and Tensorflow were the two libraries
that would have suited the design aspects of the algorithm. The initial design was based off
Tensorflow due to its ability to classify multiple classes with a video feed and its large variety of
existing pre-trained models within its object detection API. We found that with utilizing an
object detection API like Tensorflow, the ability in being able to classify multiple classes within
an image posed promises for detection in more cluttered premises where multiple classes of
objects could exist and could be localized independently. In terms of object detection,
detection rates of a class once trained required very few samples in contrast to most computer
vision libraries and could support more ‘handwritten’ pattern recognition algorithms. However,
it was found that Tensorflow was more of a generic neural network architecture that required
training to detect features of an image in contrast to OpenCV. In addition, OpenCV provided
more support in terms of edge detection and object segmentation which is key in mitigating
existing errors in object localization through the means of a camera/webcam.
In terms of the cameras used in the triangulation technique, initial ideas were to use a
360 camera in order to have a centralized system that could function independent of its
surroundings rather than using multiple cameras to be situated in a fixed area. The use of a 360
camera would inevitably eliminate the need for triangulation but would inherently be more
cost ineffective and could be more prone to the constraints due to the current 360 camera
technology available in the market. The addition of utilizing external webcams would entail that
the results seen would be less prone to detection errors and could provide accurate
Final? ?Design:
For our final design, we will create a simple graphic user interface (GUI)