FACE DETECTION AND RECOGNITION USING GOOGLE-NET ARCHITECTURE

- Face detection in secure places is an important application merging with machine vision. This paper presentsa system for performing face detection and recognition using the existing architecture of Google-Net and transfer learning to make the network learn images based on pre-trained architecture. The design of the network leads to an architecture that maximizes the system’s accuracy, and accurately detects faces that are saved to the database, and specifies the effect of the weights that were used within the nodes of the hidden layer, which is considered to be the most time-consuming task within the architecture. The main characteristics are explained, and the architecture and data set have been explained in detail with the design of a network, which provides an architecture that leads to maximizing the accuracy of the system, accurately detecting faces that are saved to the database, and specifying the effect of weights used within the nodes of the hidden layer, which is considered the most time-consuming task within the architecture. Experimental results show that using epochs of 10 and 100 samples implies 98.37% training accuracy, whereas other epochs either provide less accuracy or consume more time, and the number of the epochs and training samples can be modified according to the system requirements. Other factors like illumination, the color of the background, and face rotation or scaling were discussed as impact factors.


I. INTRODUCTION
Various authentication techniques are suggested for securing systems and data because user information might readily leak.Anyone must be able to "declare that they are who they claim to be," according to the authentication approach.
Identity authorization for anything is irrelevant to the authentication process.Identity may also be stolen or revealed by close friends or coworkers.A few systems allow users to be identified by an object like an ID card or passport.It is useful for its intendeduse and relatively simple to steal or copy.As a result, biometrics, which are memorable and distinctive, have been considered as a choice for a secure process.A face, fingerprint, palm print, iris, or even hand geometry could be biometric technology.Face image appears to be one of such technologies' most widely employed characteristics because of its collectability, usability, and acceptability.Face recognition technology can be defined as a non-intrusive technology in terms of collectability and usability because it doesn't involve direct contact with test subjects.Numerous works assert that their correct recognition rates are greater than 95% in acceptability.Other biometrics, including fingerprints, also have a high matching accuracy rate.Yet, it is possible to create a fake fingerprint in three dimensions using a plastic sheet.
Additionally, for the recognition to be accurate, the fingerprint images stored in a database must be of high quality, which consumes a lot of space in the database when applied to a system with a large number of users.In contrast, iris scanning represents the most secure and reliable biometric technology; however, iris recognition equipment is pricey [1].
CNNs are a widely utilized and recognized approach for image classification.There are several CNN types, and while they all have the same basic structure, their topologies could vary.All  The convolutional layers contain some kernels, each of which calculates feature representations of the image.Feature maps represent feature representation, including the values for the face.A pooling layer is utilized after that to lower the feature maps' resolution.There are typically some pooling and convolutional layers.To sum up, the first convolution layer, which is also referred to as the low-level features, is used to detect the edges in the image.The following convolution layer is responsible for extracting additional abstract features after that.The fully connected (FC) layers often come after pooling and convolutional layers.The FC layers' task is to establish a communication channel between neurons.The output layer is the last layer of CNN.One function that could be applied in the final layer is the softmax function, which produces the final output.Using Fig. 1, a CNN architecture has been depicted [12] [13].
Figure 1: CNN architecture [14] The goal of transfer learning (TL) is to apply knowledge that has been obtained from solving one problem to another that is unrelated but still present.Consider when a system needs to identify human faces in an image.Pre-trained models also known as models thatÂ have already been trained on many faces can be used in order to solve related problemsÂ without requiring a significant amount of data [15].

II. RELATED WORK
Face recognition and detection were investigated in many studies.Below is a list of some of these researchers, along with brief explanations of their studies: Taigman, Yaniv, et al. (2015) suggested a standard face recognition pipeline with 4 phases: detect ⇒ align ⇒ represent, ⇒ classify.They revisit both the representation and alignment steps by using explicit 3D face modeling to apply piecewise affine transformation and generate face representation from a 9layer deep NN. [2].
Yi Sun et al. (2015), the authors suggested a system that utilizes the impact of face verification and identification supervisory signals on the deep feature representation to coincide with the two components of creating the ideal features for face recognition, i.e., decrease intra-personal variations and increase inter-personal variations.The combination of the two supervisory signals results in considerably better features than either [3].
This is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).that is trained to directly improve the actual embedding.They trained with triplets of roughly aligned non-matching and matching face patches that have been produced by a new online triplet mining technique [5].Qi, Xianbiao, and Lei Zhang, 2019, suggested a straightforward yet efficient centralized coordinate learning (CCL) approach that forces the features to be dispersedly spanned in the coordinate space while guaranteeing that the classification vectors lie on a hypersphere.The authors jointly formulate the learning of classification vectors and facialfeatures [6].
Wang, Pin, Peng Wang, and En Fan 2021, used depth features and artificial features in order to extract the spatiotemporal properties of a video using a CNN and merged them with the features of the trajectory.Two techniques were developed to address the issue of low resolution preventing face images in surveillance video from being properly identified: the SPP-based CNN model and the multi-foot input CNN model [7].Li, Zheng, Xuemei Lei, and Shuang Liu, in 2022, constructed a lightweight NN that only needs a few weight representations and inexpensive operators.It could be used in embedded systems.A total of six convolutional layers make up the lightweight NN that was created in this study.Pooling has been achieved via convolution with a step size of two, and the batch normalization algorithm is employed in order to normalize the NN input [9].
In two key ways, Priya, R. L., et al. (2021) suggested a face recognition-based attendance system.A face-recognition algorithm can be trained over a set of images representing all the identities to identify when a new image belongs to one of these identities or is that of a stranger [10].Sabu M. Thampi et al., 2022, put forth a technique that employs face recognition technology for machine aid throughout search operations and could be applied to the crucial development of applications that utilize CCTV footage from a camera network to locate a lost individual.In their method, they employ face recognition using one-shot learning to locate stranded individuals in large crowds [11].Many other researchers employed AI methods in networks [16], IoT [17], the metaverse [18], and driverless cars [19].Table I summarizes the related work.

III. PROPOSED SYSTEM
To design a network that provides an architecture that maximizes the system's accuracy, accurately detects faces that are saved into the database, and specifies the effect of weights used within the nodes of the hidden layer, which is considered   on CNN can accurately classify images to different classes with high accuracy, which will reflect one limitation if the network works with data that has not been seen before, which it cannot classify.For this purpose, extra data should be added to the database.Within any system of identification and authorization using face detection, it is important to build a database of images that will be used in the identification process.The work is done on the different datasets as explained in the data set section, and the system's main steps are shown in algorithm 1 .In addition, Fig. 3 depicts the general proposed system.
The suggested system architecture is depicted in Fig. 4.

IV. PREPROCESSING
The first stage of the proposed system is the preprocessing stage, where the input data set images are prepared for further processing.This stage is considered the most important step before the feature extraction since the behavior of the

A. Image Resize
The process of resizing an image is considered a critical situation in image processing when the image is not the unique size that the system desires and the number and quality of extracted features vary from image to image.With image resizing, each pixel of the image is updated simultaneously.This process will ignore all the unnecessary pixels if the image size is reduced, and if it enlarges, then new pixels should be added, and the pixels that are added were estimated.Within the proposed work, only one operation was applied.Mainly, the size was adjusted to 224*224 (the downsizing operation), which is suitable for the design of our system, as shown in Fig. 5.
This is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).

B. Noise Removal
This step includes reducing the noise of the image to minimal levels, as shown in Fig. 6, where a median noise filter has been used to remove noise by smoothing the image, reducing or removing the visibility of noise in an image since this noise is mainly generated for different reasons such as camera lens, heat, movement, and dust.The importance of the noise removal step in the preprocessing stage within the proposed system is to remove the wrong or unwanted data (e.g., noise) before representing these data as features that may affect the system's accuracy.

C. Histogram equalization
This method is used to improve the quality of image contrast.It accomplishes this through the effective spreading out of the most frequent values of intensity, i.e., stretching out the image intensity range.Histogram equalization can be considered a part of the preprocessing stage.Preprocessing the image is needed before CNN trains or tests it.Preprocessing aims to reduce the computational cost, provide a faster ID system, and keep sufficient data for face representation. of the inception modules in the suggested system, which has nine inception modules.The input is down sampled as it is passed through the network by such max-pooling layers, which is accomplished by reducing the input data's width and height.Another efficient way of lowering the computational load on a network is by decreasing the input size between the inception modules.The input width and height are lowered to 1 × 1, and the average pooling layer averages all feature maps that the previous inception module has created.The linear layer is utilized immediately prior to a dropout layer ( 40% dropout).The dropout layer can be defined as a regularization method utilized throughout the training to stop the network from overfitting.The dropout reduces the number of interconnected neurons in a NN at random.Each one of the neurons has a chance to be excluded from, or rather dropped from, the combined contribution from related neurons at each training stage.The linear layer has 1000 hidden units, or the same number of classes found in theImage dataset.

D. GOOGLE-NET architecture
The softmax layer, the last layer, calculates the probability distribution regarding a set of integers included in an input vector.The softmax function can be defined as an activation function.A vector that includes a set of values indicates the probability of a class or event due to the softmax activation function.The vector's values add up to one.An enormous network's primary issue is that it experiences vanishing gradient descent.In the case when the update to weights resulting from backpropagation is minimal within the lowest layers due to a modest gradient value, this is known as "vanishing

V. EXPERIMENTAL RESULTS
Epoch represents the main method to help the model find a way to represent the samples within the dataset with less error; choosing the right number of the epoch is a critical task within the training of the dataset and to compare the difference effecting of a different number of epochs and their connection to the number of persons within the database.
The following tables and figure show the effects of increasing number of the epochs+ people samples within the data set which results that the number of epochs when increased and results Overfitting of the model.When using the dynamic number of epochs, this will lead to a lot of times where the weight changes and the training curves go from underfitting to overfitting.By trial and error, not only the number of epochs may affect the accuracy of the system but also the amount of the data as well.According to the results that have been obtained from the training data, 100 persons with ten epochs provided the highest validation accuracy, 98.37%.According to Table II and as shown in figures (5, 6, 7, and 8), with    and can be physically applied in intrusion detection systems where the system gives access rights only to those authorized and detects the ones who are not, as well as detecting the intruders.Table III shows the execution of samples within the system database with epochs of 10 and 100.
Many factors may affect the matching accuracy where the system fails to classify some images since it uses background elements and the shape of the person rather than using facial features only.Using a white-label background may also help the process.Balancing the training elapsed time and training accuracy, in this work an epoch of 10 has been chosen and the number of samples (i.e., the persons) is 100.Since the accuracy that has been obtained is high and the time consumption is much lower than the other data, as shown in the figures and tables compared above, the number of epochs and the number of samples is not static within this work.It can be modified if the system administratordecides that time consumption is not the biggest issue.Table IV shows other impact factors that may or may not affect the classification accuracy.

A. Dataset Used
The data set used in this work consists of two different groups, which are: This is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).Illumination Since the image or frame was preprocessed before the processing was applied, the illumination is not a factor impacting the classification negatively or positively

One-color background
The color of the background, if it is one color, may affect the classification accuracy positively since the number of features obtained from the background is much less than complicated background Scaling and rotation CNN generates features that are invariant to scale or rotate off the face (within acceptable levels) -Collected dataset that is applied to known Middle Eastern figures; the core issue of this thesis is the use of deep learning to detect and recognize faces.The primary problem of deep learning is how to train data, so we need to prepare training data and mark the classification of faces in each image.Because it is difficult to find public datasets that meet our requirements, data were collected ourselves.Since it mainly investigates the influence of face proportion on confidence and accuracy, a proportion of face images is the data that needs to be collected.
-The global dataset that the previous author used to recognize faces consists of 200 images that are grouped into four different groups with different snapshots of the people (different ages and genders) at 180 degrees.
This is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).
the CNNs share identical fundamental structures, This is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).ISSN:2222-758X e-ISSN: 2789-7362 consisting of three different types of layers-pooling, convolutional, output, and fully connected-that are used by all CNNs.
ISSN:2222-758X e-ISSN: 2789-7362 Sun, Yi, et al. (2016), DeepID3 is the name of two extremely deep NN designs that the authors suggest using for face recognition.Stacked inception and convolution layers from Google-Net and VGG-net are combined to make such two architectures adequate for face recognition [4].Schroff, Florian, et al. (2016) have suggested a technique that, as opposed to other DL methods, utilizes a deep CNN Du, Hang, et al., 2021, the authors examine Near-infrared to visible (NIR-VIS) face recognition, which seeks to match a set of face images that have been taken from 2 separate modalities, as the most prevalent case in heterogeneous face recognition.While current DL-based approaches for NIR-VIS face recognition have made notable progress, they are facing some new challenges because of the COVID-19 pandemic, where individuals are advised to wear facial masks to stop the virus' transmission.The masked face in the NIR probe image makes this task, which we designate as NIR-VIS masked face recognition, difficult[8].
the most time-consuming task within the architecture.The existing pre-trained or well-trained classification algorithm based This is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).

Figure 2 :
Figure 2: general process of the proposed system

Figure 3 :
Figure 3: general system overview

Figure 4 :
Figure 4: Proposed system architecture Google-Net can be described as a 22-layer deep CNN that is an Inception Network variant, a deep CNN developed by the researchers at Google.It is utilized for other computer vision tasks like adversarial training, face detection, recognition, etc.The Google-Net architecture consists of 22 layers (a total of 27 layers, including the pooling layers), and part of these layers are a total of 9 inception modules Fig.7.The known and famous model was used to build the model, with modifications to the original specifications.TableIIshows the original specification of the layers used for the Google-Net model.TABLE II GOOGLE-NET architecture.The proposed design's first convolution layer has a filter (patch) size of 7 × 7, making it significantly larger compared with other patch sizes in the network.The main goal of this layer is to instantly decrease the input image while preserving This is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).ISSN:2222-758X e-ISSN: 2789-7362 spatial information by using large filter sizes.A greater number of feature maps are created, yet the input image size (width and height) is decreased by a factor of 4 at the 2 nd Conv layer and by a factor of 8 prior to reaching the first inception module.As a result of dimensionality reduction, the second cover layer uses the 1 × 1 Connv block and has a depth of 2 .Through 1 × 1 Conv blocks, dimensionality reduction enables the decrease of processing load by the reduction of the number of operations that are needed for every one of the layers.There are two max-pooling layers between the parts gradient descent."Throughout the training, the network simply stops learning.The third (Inception 4 [a]) and sixth (Inception 4 [d]) layers of the architecture, which are intermediary, are given auxiliary classifiers.Only used throughout training, auxiliary classifiers are eliminated during inference.An auxiliary classifier's task is to do classification depending on inputs in the middleof the network and add the training-related loss back to the network's overall loss.

Figure 10 :
Figure 10: Elapsed time of data 10 epochs in different samples

TABLE I Related
Work Comparisons.

TABLE III Training
With Ten Epochs and N Number of Samples.
This matching accuracy can be used with proper Internet of Things techniques to give access authorization to individualsThis is an open access article under the CC BY 4.0 license (http://creativecommons.org/licenses/by/4.0/).ISSN:2222-758X e-ISSN: 2789-7362

TABLE IV execution
of samples within the system database with epochs of 10 and 100 persons

TABLE V other
impact factors that affect the classification accuracy Factor Discussion