ILA4: Overcoming Missing Values in Machine Learning Datasets – An Inductive Learning Approach Abstract This article introduces ILA4: A new algorithm designed to handle datasets with missing values. ILA4 is inspired by a series of ILA algorithms that also handle missing data with further enhancements. ILA4 is applied to datasets with varying completeness and other known approaches for handling datasets with missing values. In the majority of cases, ILA4 produced a favourable performance that is on a par with many established techniques for treating missing values, including algorithms that are based on the Most Common Value (MCV), the Most Common Value Restricted to a Concept (MCVRC), and those that utilize the Delete strategy. ILA4 was also compared with three known algorithms: Logistic Regression, Naïve Bayes, and Random Forest; the accuracy obtained by ILA4 is comparable or better than the best results obtained from these three algorithms. Keywords: Missing Data, Inductive Learning, Noise Data, Incompleteness, Delete Strategy, Most Common Value. 1. Introduction Treating missing values in datasets used as resources for machine learning is an important task and crucial issue, especially when it is essential to use the complete available data. Knowing that there are large volumes of data available, the ratio of missing values therein is often too large for the learning model. So researchers have to decide either to ignore records with missing values or to have a way to treat this issue and substitute them with correct values. The first choice does not constitute the right approach because it is quite conceivable that some of the missing values are significant for the induction process. For example, Gender data values inpatient data is an essential feature for learning, and missing values would significantly affect the likelihood of accurate diagnoses for breast cancer diagnosis scenarios. Other scenarios include the bank account holder's annual income in deciding to grant loan or otherwise. Additionally, if the ratio of missing values in datasets is stifling, deletion will significantly diminish the depth of the learning data and hence hinder the model's accuracy, in many machine learning scenarios, deletion is probably the least favourable approach [1]. There is a strong argument for solving the missing values problem with more effective, less sweeping solutions to provide optimum reliability as an operational base for the application of predictive models, which, traditionally, are designed for complete datasets to enable these models to apply to the incomplete datasets rather than simply deleting instances and bypassing them. This is in tune with the argument that allocating significance to missing values maximizes the predictive model's effectiveness [2]. The authors accept that it is impossible to solve all missing values cases comprehensively, however, as shown in [3] when missing data constitute less than 1% of the total data then this kind of scenario is considered trivial. In contrast, a ratio of up to 5% is deemed to be manageable. However, rates of over the 5% threshold and approaching 15% necessitate the application of multifaceted methods for treatment, and, finally, ratios over 15% tend to impact the machine learning model's accuracy quite adversely.

 The impact of missing values in datasets is significant in several ways, including but not limited to: 

(i) Decreased efficiency. Resulting in fewer extracted patterns and classes and weaker statistical content. (ii) Data preparation and analysis of complications. As the majority of learning models are designed for complete datasets. (iii) Discrepancies between missing and full datasets quite often result in bias learning, including overfitting and under-fitting (iv) The causes of data missingness are also varied, including incorrect measurements, faulty sensors/algorithms, human error, censored/anonymous data and many others [1], [4], [5].

 The known approaches for dealing with missing data values can be categorized as follows [4]:

(i) Delete strategy, which ignores the data instances with missing feature/attribute values. (ii) Uniform Treatment approach, which applies the exact solution for all scenarios (Any Value, Ignored Value, Common Value and Special Value Strategies) (iii) Case-by-case approaches. In contrast to (ii) above, these are case/scenario-specific and apply Pessimistic value, Predicted value, and Distributed value Strategies. In terms of the timing/stage of treatment of missing data values, techniques fall into two options: PreInduction: wherein before the induction phase, missing values are replaced with corresponding attribute values from the source dataset, and During Induction: in which the missing data is treated during the induction process. Here, the technique for dealing with missing data values is embedded as a step within the inductive learning algorithm. This stipulates careful analysis of the considered algorithm to maximize its accuracy and to be able to handle missing data values effectively and efficiently. This article's work adopts the latter scenario and is inspired by previous work [6]; it utilizes the ILA algorithm from previous research work [7]. The ILA algorithm, designed to handle complete datasets, is modified to handle datasets with missing values. New, bespoke techniques are introduced for handling missing data values; these have been designed for the ILA algorithm. The new algorithm, ILA4, constitutes the central aspect of this work and can produce favourable accuracy in modelling with missing values in learning datasets. This article analyzes related work in this area, shows examples to illustrate the model and idea, and more experiments are performed that demonstrate the strengths of the proposed algorithm from different perspectives. ILA is a robust inductive algorithm that was designed to generate fewer numbers of the most general rules. The generated rules of ILA have higher accuracy in predicting hidden and test examples. So, enhancing ILA to deal with datasets with missing values results in a new algorithm that leverages ILA's power with the added ability to deal with noise data. This algorithm is called ILA4, and it is the main focus of this article. This article is organized as follows: ILA and its versions are discussed briefly in section 3 followed by a thorough discussion of how ILA works in section 4. Section 5, a complete description of the new model of Treating missing values, called ILA4, is thoroughly discussed and then integrated with ILA in section 6 followed by section 7 that shows an illustrative example of the suggested approach. Section 8 discusses many experiments and results that show the feasibility of the new system from many perspectives. The related work and some similar algorithms and models are depicted and briefly discussed in the following area. 2. Related Work The problem of missing values and their imputation (MVI) has been the focus of machine learning related research for a few years. This is due to the huge amounts of data available throughout the world, which has a significant percentage of incompleteness and missing values. Many studies have been done to solve this problem, each tackling the problem from a different perspective. The authors in [8] proposed a framework to deal with missing values. This framework consists of three central units: i) Mean pre-imputation wherein missing data values are tentatively replaced by the value that is used to perform imputation via a fast linear mean imputation method, ii) Application of confidence intervals wherein the base imputation method is used to impute each missing data value. Furthermore, confidence intervals are used to filter the imputed values. iii) Boosting, this unit accepts the best Stifling-quality imputed values. A set of sixteen databases were used to test this framework. Experimental results show that, on average, the framework-based method yields the highest accuracy rates among the considered methods. Authors in [9] proposed a method that uses an additional category to substitute missing data values. While this approach is simpler to apply, it is not without its inherent issues when analyzing resulting data. [10] presented work that evaluates the performance of different missing data imputation methods used in the context of predicting recurrence in breast cancer patients. The studied techniques included statistical techniques such as multiple imputations, Warm-deck, and also Mean. It also had machine-learning methods such as a k-nearest neighbor, self-organization maps, and also multilayer perceptrons. The study concluded that the imputation methods based on machine learning techniques outperformed statistical imputation methods in the prediction process. The authors in [3] conducted a study carried out against datasets to assess the effect of the case deletion method, Mean imputation, Median imputation, and KNN imputation procedure on the misclassification error rate-fixing missing values. The study used a parametric classifier (LDA: Linear Discriminant Analysis) and a non-parametric classifier (KNN: K-Nearest-Neighbor) to evaluate the misclassification rate. The authors concluded that there is not much difference between case deletion and imputation methods for both classifiers with a small number of instances containing missing values in datasets with a small number of instances. However, this is not the case for datasets with a stifling percentage of instances with missing values; KNN imputation seems to perform better than the other methods because it is more robust to bias when the percentage of missing values increases. In later research in [11], the authors studied the effect of missing data using six different imputation methods. Five of which are single imputation methods, including mean, Warm deck, Naive-Bayes, Framework with Warm deck, and Framework with Naive-Bayes, while the sixth one is a multiple imputation method, namely Polytomous regression. Those methods are tested on fifteen datasets to check the classification accuracy of six popular classifiers: RIPPER, C4.5, KNN, SVM with the polynomial kernel, RBF kernels, and Naive-Bayes. Naive-Bayes imputation gave the best results for the RIPPER classifier for datasets with a stifling amount of missing data in one experiment. Polytomous regression imputation was the best for SVM with polynomial kernel classifier. The application of the imputation framework is the best for the RBF kernel and KNN. In another experiment, some classifiers such as C4.5 and Naive-Bayes were missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as KNN, SVM, and RIPPER benefit from the imputation. The work in [4] reviewed Rubin's multiple imputation method, which consists of three steps: a) Completed datasets are created by imputing the unobserved data using independent draws from an imputation model, b) Data analysis is performed by Treating each generated dataset as a real complete dataset, and c) Results from the complete data analysis are combined in an appropriate way to obtain the so called repeated-imputation inference. The emphasis of this review [4] is on the building of imputation models. Three applications of Rubin's method are reviewed within the context of medical studies. Those applications are estimating the reporting delays after AIDS is diagnosed, studying the effects of missing data and noncompliance in randomized experiments, and handling no-response in United States National Health and Nutrition Examination Surveys (NHANES). The authors found that Rubin's multiple imputation method is a powerful tool when dealing with no-response in public use data files, such as in the third application. When used for handling incomplete data and analyzing complete data, as in the first and second applications, it is considered an effective statistical tool because of the conceptual and implementation simplicity. Authors in [5] empirically compared several approaches of dealing with unknown values in datasets used in conventional separate-and-conquer rule learning algorithms. They distinguish three general methods with eight Strategies: a) Delete strategy: ignore examples with unknown values, b) Ignored Value, Any Value, Special Value, and Common Value Strategies: Treat unknown values uniformly for all examples, and c) Pessimistic value, Predicted value, and Distributed value Strategies: Treatment of unknown values depends on the example. All eight Strategies were evaluated using the UCI repository for machine learning databases. Among several findings of this study, the authors concluded that in general: ● Predicting the suitability of a particular strategy for a specific dataset is not straight-forward. ● The delete strategy underperforms many other strategies significantly as it reduces the learning resource depth. ● When the rows with missing attribute values are fewer, the choice of treatment strategy becomes significant. ● Pessimistic Strategies and Any Strategies tend to be biased against the use of imputed attributes, especially in contrast to Distributed and Predicted value Strategies which can maintain a potential preference for these attributes ● Pessimistic and Special Strategies combined together was the best approach for more than half of the datasets In [12], two MVI methods have been suggested to handle missing values from multiple variables; these are: i) Using a Bayesian to predict the missing values from the multi-variate normal distribution (MVNI), this approach assumes that variables in the imputation model mutually follow an MVNI. By selecting a suitable regression model per variable, the Fully Conditional Specification (FCS) [13] method achieves more flexibility by overcoming the MVNI assumption. The limitations of such approaches based on distinct theoretical assumptions fail to cater to time-series data, including trends and seasonality; hence, their application is potentially unreliable for time-series sensor data. For time-series data, and with an industrial application perspective, with particular emphasis on Internet of Things (IoT), intelligent manufacturing applications, and data, the authors in [14] contrasted MVI techniques that include utilizing data from geographically close sensors when specific sensor data is missing, with using (high correlation) data from similar mode sensors regardless of geographical location. They assert that due to the inherent large volumes of transmitted data, it is necessary to make contingency for significant data losses from multiple sensors due to a single event. This scenario renders data imputation from close or correlated sensors infeasible. Hence the authors study MVI for univariate time-series data and propose an iterative framework using multiple segmentation for significant gaps (by reconstructing and subsequent concatenation) to address this issue. Experimental results show notable improvements over known methods, especially in terms of root-mean-square-error metrics. By experimenting against a metabolomics dataset and using a survey/comparative analysis approach to contrast imputation methods in a simulation framework and in turn, validating said approaches by application to real data, [15] managed to evaluate over 30 methods to verify each approach's ability to: i) reconstruct biochemical pathways from data-driven correlation networks and ii) increase statistical power while preserving the strength of established metabolic quantitative trait loci. The authors concluded that k-nearest neighbors-based methodologies performed well in terms of effectiveness and computational costs, followed closely by methods that utilize multiple imputations by chained equations, demanding higher computational processing. The authors in [16] applied unsupervised machine learning to impute missing values. They developed the Rough K-means centroid-based approach, a novel solution targeting inconsistencies of missing values via a combination of soft computing and clustering methods. They compared their method with the approaches that utilize the following models: Rough K-means parameter, K-means parameter, Fuzzy C-means centroid, K-means centroid, and Fuzzy C-means parameter, and experimented with favourable results against the UCI benchmark datasets: Yeast, Pima, Wisconsin, and viz. Dermatology. Using a combination of decision tree-based modeling, with fuzzy clustering and an iterative learning approach, the authors in [17] proposed DIFC, a novel MVI method. Experiments against 5 state of the art approaches: • IFC - iterative fuzzy clustering • DMI - decision trees and decision forests by splitting and merging • IBLLS - iterative bi-cluster-based least square framework • SVR - support vector regression • EMI - estimation of mean values and covariance matrices with 6 known UCI datasets: AutoMPG, Adult, Housing, Yeast, Pima, CMC, and GermanCA with both categorical and numerical missing values of varying frequency and type, show that DIFC is more effective in accuracy as well as being more pliant in handling different missing value scenarios. Other, more comprehensive surveys on MVI include [18] who reviewed over 100 studies spanning 10 years to 2017; the authors addressed many technical issues faced as part of MVI including dataset choice, missingness rates, methodologies and evaluation metrics; thus highlighting limitations in the existing literature. In [19], the authors assert the potential imputation value of machine learning MVI approaches but caution against their typical time costs in comparison to statistics based approaches. In their work, the authors reviewed several MVI approaches focusing on efficiency in comparison to predefined baseline MVI algorithms incl. k-nearest neighbor, support vector machines, mode/median and Naive Bayes. Their work also covered both method outcomes as well as potential for extension. 3. ILA Family ILA was released in 1998 as an inductive learning algorithm [7], after its introduction in the mid 90's as a structure for linking expert systems and decision support systems [20]. A modified version of the framework was built with a new DCL algorithm in 1999 [21]. DCL greatly improves the classification of invisible examples by creating rules with the OR operator in the LHS. At the same time, a significantly modified version of ILA has been developed, ILA-2 [22]. While the original ILA produces rules with 100% certainty, the new one, ILA-2, can generate rules with uncertainty. To utilize the power of parallel systems that run on machines with parallel processing capabilities, a modified ILA version, called PILA, was introduced in 2000. A study for a PhD thesis was conducted in 2009 to find out that ILA can be used on distributed databases as well. With new ideas introduced in the last two years including modifications of ILA with new functions that focus on relevant features within datasets to create a reduced one with only relevant features. The new ILA algorithm is called ILA3, and has significantly improved the efficiency of ILA while maintaining accuracy at an acceptable level [23]. Finally, ILA has been found useful to a variety of research fields, such as text-to-speech, nominal and verbal Arabic sentence classification, and intrusion detection [24]. 3.1 Description of ILA The main processes within ILA works are pertinent for understanding how missing values are treated and tailored with it. ILA is an inductive learning algorithm that can produce a set of classification rules by analyzing discrete training data with no missing values. The algorithm works iteratively. In each iteration, it searches for a rule that can classify the maximum number of training instances. Once such a rule is identified, those training data instances are flagged as processed and discarded in subsequent cycles. Furthermore, the rule that covers those instances is added to the accumulative set of rules. In other words, ILA applies a rule-per-class wherein rule induction separates examples in the current class from examples in the remaining classes. The net result of this process resembles an ordered list of rules instead of a Decision Tree. To analyze the ILA processes, we will use a sample of weather data (with slight modification), as shown in Table 1. This dataset contains fourteen rows (m=14) each with four attributes (k=4) and two possible classification values in one decision (class) attribute, {P, N}, (n=2). In the data the attributes “Outlook”, “Temperature”, “Humidity” and “Stormy” have the possible values {Clear, Cloudy, Frost}, {Warm, Regular, Cold}, {Stifling, Comfy} and {True, false} respectively. Table 1.Weather Data. Example Outlook Temperature Humidity Stormy Class 1 Clear Warm Stifling False N 2 Clear Warm Stifling True N 3 Cloudy Warm Stifling False P 4 Frost Regular Stifling False P 5 Frost Cold Comfy False P 6 Frost Cold Comfy True N 7 Cloudy Cold Comfy True P 8 Clear Regular Stifling False N 9 Clear Cold Comfy False P 10 Frost Regular Comfy False P 11 Clear Regular Comfy True P 12 Cloudy Regular Stifling True P 13 Cloudy Warm Comfy False P 14 Frost Regular Stifling True N Since n = 2, the algorithm will generate 2 Sub-Tables (STs), as shown below. Table 2. Decision class generates two Sub-Tables from the original Training Set. ST1 Original Example New Example Outlook Temperature Humidity Stormy Decision 1 1 Clear Warm Stifling False N 2 2 Clear Warm Stifling True N 6 3 Frost Cold Comfy True N 8 4 Clear Regular Stifling False N 14 5 Frost Regular Stifling True N ST2 Original Example New Example Outlook Temperature Humidity Stormy Decision 3 1 Cloudy Warm Stifling False P 4 2 Frost Regular Stifling False P 5 3 Frost Cold Comfy False P 7 4 Cloudy Cold Comfy True P 9 5 Clear Cold Comfy False P 10 6 Frost Regular Comfy False P 11 7 Clear Regular Comfy True P 12 8 Cloudy Regular Stifling True P 13 9 Cloudy Warm Comfy False P Table 2 is the basis for the second step of the algorithm, as follows: Set j=1, in this case combination attributes are: {Outlook}, {Temperature}, {Humidity} and {Stormy}. It is noted that no value of any of these combinations exists in either ST. Next j is increased to 2 and the combinations of 2 are: {Outlook, Temperature}, {Outlook, Humidity}, {Outlook, Stormy}, {Temperature, Humidity}, {Temperature, Stormy}, and {Humidity, Stormy}. For the combination {Outlook, Humidity} with the highest occurrence for the value {Clear, Stifling} (3 times) that appear in the first ST but not 2, as such, max-combination is set to: "Clear, Stifling," then the rows 1, 2, and 4 are flagged as classified, and the following production rule (Rule 1) is extracted: Rule1: IF Outlook is Clear AND Humidity is Stifling, THEN the set decision = N. ILA algorithm repeatedly applies the same steps to the remaining unmarked rows (3 and 5) in the first ST1. By repeating the steps above, we find the "Frost, True" attribute value under {Outlook, Stormy} featuring the maximum number of times in ST1 (with the value 2) and not in ST2. Hence, rows 3 and 5 are marked classified, and rule 2 is added to the set of rules: Rule2: IF Outlook is Frost AND Stormy is True, THEN set decision = N. By flagging rows 3 and 5 as classified, no more data within ST1 require treatment, and we can consider the next ST(2). With j is set to 1, the "Cloudy" attribute value of {Outlook} occurs 4 times (which is the maximum occurrence for j = 1) in the rows 1, 4, 8, and 9 in ST2 but not in ST1. So, these rows are flagged as classified, and Rule 4 is added to the list of rules. Rule3: IF Outlook is Cloudy THEN the decision is P. In the remaining rows in ST2 (i.e., 2, 3, 5, 6, and 7) no single value of any attribute appeared in ST2 and not in ST1. So j is increased by 1 to be 2, and the 2-attribute combinations are generated, and the value "Frost, false" of the variety {Outlook, Stormy} occurred 3 times in ST2 and not in ST1. This value is the maximum occurrence, so rows 2, 3, and 6 are flagged as classified, and the following rule (Rule 4: Rule4: IF Outlook is Frost AND Stormy is false, THEN the decision is P. In the rows that remained unclassified in this ST (i.e., rows 5 and 7) it is found that the value "Clear, Comfy" of the combination {Outlook, Humidity} has occurred twice in ST2 and not in other STs. So rows 5 and 7 will be flagged as classified and the following rule is generated: Rule5: IF Outlook is Clear AND humidity is Comfy THEN the decision is P. With all of the rows in ST2 now flagged as classified and no other STs to process, the algorithm terminates. 4. The Suggested Approach for Treating Missing Values As mentioned previously, methods for treating missing dataset values are many and varying, including those that perform data treatment before the process of rule generation and those designed with specific induction algorithms in mind. For the latter, it is essential to analyze the induction algorithm to fully understand its characteristics and internal processes to be qualified to design the most effective and efficient treatment approach. The weather example in section 3.1 shows that ILA works by creating ST partitions from the dataset table, with one partition per class value, and subsequently highlights the highest occurring combination within the current sub-class and excludes explicitly other STs/partitions it generates a rule from the current combination. As indicated earlier, ILA operates against datasets that are both complete and discrete. When handling missing values, ILA replaces the missing value in the existing ST with values from the corresponding combination with the maximum occurrences as a reference for rule generation. Hence, there is no significant adverse effect on the induction power of ILA. Table 3 shows a dataset snapshot that was used to illustrate the idea above. Considering the {outlook} attribute, we can see that values "Cold" and "Cloudy" occur in the second ST (2) and not in the first ST (1). We can also see that "Cold" happens 4 times and "Cloudy" 3 times. When considering missing values with these attribute values, "Cold" occurs 5 times and "Cloudy" 6 times. Hence the value "Cloudy" is the candidate to select to generate the rule for all missing {outlook} values. For compound attribute values such as "Clear, Warm", this appears 3 times in the first ST as "Clear, Warm," "Clear, ---," "---, Warm," and does not occur in the second ST. This combination will be the basis for the missing value replacement rule as it has the maximum occurrence rate. ILA repeatedly works on each ST until all rows are classified; some rows with missing values but do occur in other STs are left unmarked/classified. The remaining rows are subsequently treated analogously to generate rules. This is in case there are missing values within the said rows that have values sourced from the domain of the attribute at hand such that those rows do not feature in other STs. With this method, rules are generated, and the said rows are subsequently classified. To demonstrate the main aspects of the idea, consider the 3rd row in ST1 (Table 4) with attribute values "--- Cold Comfy --- N," assuming this row is classified. Only the "Cold, Comfy" combination in this row is also available in the other ST (2). In this case, "Cold, Comfy" will not be selected for rule generation, and the 1st occurrence of "--- "in our row corresponding for {outlook} is replaceable with either "Cloudy" or "Frost," but not "Clear" values. Furthermore, the second occurrence of "--- "under the {Stormy} attribute can only be replaced with the value "True." Values nominated for missing value replacement must produce the treated row's maximum occurrence in the current ST. As our 2nd row remains unmarked in the example above, either candidate from three possible values can be nominated as a replacement source for a missing value within the row at hand; in this way, we have a rule, and we classify the row. The flowchart in Figure 1 gives a high-level description of the algorithm, and the detailed stepwise algorithm is illustrated in Algorithm 1. It is noteworthy that step 11 is the primary modification of the original algorithm (ILA) that constitutes the new algorithm ILA4 as shown in figures 1 and 2 below. By design, ILA4 is expected to be somewhat time-consuming and less efficient; this does not detract from the algorithm's effectiveness as it processes data offline where processing speed is less of a factor than in online algorithms. This efficiency issue has been addressed in the original algorithm ILA in [23]. Table 3. dataset with missing values Figure 1. ILA4 Composition

Figure 2. ILA4 Summary Table 4. Two STs

Sub-Table 1 Original Example New Example Outlook Temp Humidity Stormy Decision 1 1 Clear Warm Stifling -- N 2 2 Clear -- -- True N 6 3 -- Cold Comfy -- N 8 4 Clear Regular -- False N 14 5 Damp Regular Stifling -- N

Sub-Table 2 Original Example New Example Outlook Temp Humidity Stormy Decision 3 1 -- Warm Stifling -- P 4 2 Damp -- Stifling False P 5 3 -- Cold -- False P 7 4 Cloudy -- Comfy True P 9 5 Clear Cold Comfy False P 10 6 -- -- Comfy False P 11 7 Clear Regular Comfy -- P

5. Illustrative Case Study As an illustration of the operation of ILA4 and to compare the results with the original complete dataset, let us reconsider the weather training set given in Table 1, but after deleting some values randomly as depicted in Table 5. Table 5.Samples of Weather Data with Missing Values. Example Outlook Temperature Humidity Stormy Class 1 Clear Warm Stifling --- N 2 Clear --- --- True N 3 --- Warm Stifling --- P 4 Frost --- Stifling false P 5 --- Cold --- false P 6 --- Cold Comfy --- N 7 Cloudy --- Comfy True P 8 Clear Regular --- false N 9 Clear Cold Comfy false P 10 --- --- Comfy false P 11 Clear Regular Comfy --- P 12 Cloudy Regular --- True P 13 Cloudy Warm Comfy false P 14 Frost Regular Stifling --- N Because n = 2, the first step in ILA4 generates two STs, as shown in Table 6. Table 6.Training Set STs, including missing values partitioned by Decision Class. ST1 Example no. old new Outlook Temperature Humidity Stormy Decision 1 1 Clear Warm Stifling --- N 2 2 Clear --- --- True N 6 3 --- Cold Comfy --- N 8 4 Clear Regular --- false N 14 5 Frost Regular Stifling --- N ST2 Example no. old new Outlook Temperature Humidity Stormy Decision 3 1 --- Warm Stifling --- P 4 2 Frost --- Stifling false P 5 3 --- Cold --- false P 7 4 Cloudy --- Comfy True P 9 5 Clear Cold Comfy false P 10 6 --- --- Comfy false P 11 7 Clear Regular Comfy --- P 12 8 Cloudy Regular --- True P 13 9 Cloudy Warm Comfy false P According to step 2 of ILA, process the first ST within Table 4: For j=1, attribute combinations consist of: {Outlook}, {Temperature}, {Humidity} and {Stormy}. It is noted that no value of these combinations feature in ST1 but not in ST2. Therefore, j is increased to 2 and the new combinations are: {Outlook, Temperature}, {Outlook, Humidity}, {Outlook, Stormy}, {Temperature, Humidity}, {Temperature, Stormy}, and {Humidity, Stormy}. Considering the {Outlook, Temperature} combination, we find the values: "Clear, Warm" and "Frost, Regular" that are in ST1 and not in any other table. For "Clear, Warm", according to step 4, we have one occurrence of the combination of values "Clear, ---" which has a missing value and lies under the same combination with a common value; Clear. So the missing value can be temporarily replaced by Warm, which results in two occurrences. For the second combination "Frost, Regular", only one occurrence is available. If we keep looking for the combination with the highest occurrence, we will find "Clear, Stifling" under {Outlook, Humidity} and "Regular, Stifling" under {Temperature, Humidity}, each with three occurrences. Since both have the same occurrence, the algorithm will consider the first combination to mark the rows in which they appear in (i.e. 1, 2, and 4), as classified, and the following rule is generated: Rule1: If Outlook is Clear AND Humidity is Stifling, then the decision is N Next, as per ILA, repeat step 4 through step 9 against the remaining unmarked examples in the first ST (row 3 and row 5). When reapplying steps 4 to 9, we find a single of "Regular, Stifling" under {Temperature, Humidity}, this combination has a maximum occurrence value of 1 in the first ST and not in the second. Therefore, row 5 is marked classified, and the following (Rule 2) is added to the ruleset: Rule2: IF Temperature is Regular AND Humidity is Stifling, THEN decision is N. As per step 11 of the algorithm, only one row (row 3) remains unmarked with the only combination of values "Cold, Comfy" under the combination {Temperature, Humidity} which is also in ST2. That means we cannot generate a rule using this combination. In this case, according to the suggested approach, and as discussed earlier, the missing value for {Outlook} within row 3 can be replaced by the values "Cloudy" or "Frost" but not the "Clear" value. Furthermore, within the same row, corresponding to the {Stormy} attribute may only be replaced by "True" value. As explained above, the algorithm selects values that yield maximum cardinality of the processed row within the current ST. In our example, because only row 3 remains unmarked, any candidate value from the 3 is suitable to substitute a missing value in the row at hand; for example, "Clear", hence the following rule is generated: Rule3: IF Outlook is Clear AND Temperature is Cold AND Humidity is Comfy THEN decision is N. and the row is marked as classified. As all rows of ST1are now classified, move to next ST (2). With j set to 1, the "Cloudy" attribute value of {Outlook} and after all missed values under Outlook (in rows 1, 3, and 6) are replaced with "Cloudy," its occurrence becomes 6 times (which is the maximum occurrence for j = 1) in the rows 1, 3, 4, 6, 8 and 9 in ST2with no occurrences in ST1. According to step 4, these rows are flagged (classified), and Rule 4 is subsequently added to the list of rules. Rule4: IF Outlook is Cloudy, THEN the decision is P. In the remaining rows in ST2 (i.e., 2, 5, and 7) no single value of any attribute appeared in ST2 and not in ST1. So j is increased by 1 to be 2, and the 2-attribute combinations are generated, and the value "Comfy, false" of the combination {Humidity, Stormy} occurred 2 times in rows 5 and 7 after replacing the missing value under {Stormy} in row 7 with "false." This value is the maximum occurrence, so rows 5 and 7 are flagged as classified, and the following rule is generated: Rule5: IF Outlook is Frost AND Stormy is false, THEN the decision is P. In the last row remained unclassified in this ST (i.e., row 2) it is found that the value "Frost, false" of the combination {Outlook, Stormy} has occurred once in ST2 and not in other STs. So row 2 will be flagged as classified, and the following rule is generated: Rule6: IF Outlook is Frost AND Stormy is false, THEN the decision is P. Now, since all of the rows in ST2 are classified, and no ST remains, the algorithm reaches the termination point. To compare the results obtained by ILA4 in this example with the results obtained by applying ILA on the original dataset, it is noted that here 6 rules are generated by ILA4. In comparison, ILA generates 5 rules, and the accuracy of the rules generated by ILA4 on the original dataset is %92.3, which is not bad on an example with a minimal dataset. In the following section, to show the feasibility and accuracy of ILA4, many experiments will be conducted on many datasets with different sizes. 6. Experimental Results As explained above, it is necessary to replace missing data values so that the induction algorithm applicable to the large volumes of incomplete datasets. These algorithms are designed from the outset to operate exclusively on complete datasets. The priority in this process is to compromise the accuracy of the induction models as little as possible. A set of experiments on several datasets with different sizes are conducted to evaluate ILA4 from different perspectives. These experiments are discussed in detail in the following paragraphs. One of the ways by which one can evaluate an induction algorithm that deals with datasets with missing values is to have a dataset with complete values and remove some values randomly as missing values with specific percentages, and subsequently run the algorithm against the resulting datasets and then compare the results with the rules generated from the complete dataset. To this end, ILA4 was applied against several datasets with small and large sizes. The test datasets (Balance, Vote, and Monk1) were sourced from the Machine Learning Repository at the University of California Irvine and Domain Theories1, as described in Table 7. Table 7. Domains

Domain Characteristic Monk1 Vote Balance Number of attributes 6+1 16+1 4+1 Number of examples 124 300 625 Average Values per attribute 2.83 2 5 Number of Class Values 2 2 3 Distribution of Examples Among Class Values 50% are 0 50% are 1 61.33% are democrat 38.67% are republican 46.08% are L 07.84% are B 46.08% are R In this experiment, a predetermined percentage of values (10%, 30%, and 50%) of the original datasets are replaced with nulls. ILA4 is then applied to these datasets; the rules generated from ILA4 against the now incomplete datasets are compared with the original dataset's rules. As these results are closer, the effectiveness of ILA4 is good. The results are shown in Table 8. These results are the average of results obtained from five experiments and include several generated rules, the average number of conditions, and execution time in seconds. It is apparent from the results that these parameters' values are not far from the results of the application of ILA4 on the original dataset. Table 8. ILA4 results against the test datasets With varying missing value rates Monk1 Vote Balance % missing values 0% 10% 30% 50% 0% 10% 30% 50% 0% 10% 30% 50% Number of Rules 32 32 36 39 42 43 46 50 303 307 311 323 Average number of conditions 3.28 3.31 3.25 3.21 3.45 3.44 3.48 3.4 3.41 3.38 3.43 3.42 Execution Time (s) 1.72 1.8 2.2 2.5 4.17 4.19 4.21 4.26 0.87 0.9 1.1 1.3 It is essential to measure the accuracy of the suggested system. Accuracy here means the degree by which ILA4 can replace the missing values with correct ones, and the generated rules can correctly classify examples in the dataset as much as possible. Table 9 shows ILA4 algorithm accuracy when applied to the three datasets (distinguished by missing value percentages). As shown in the table, the algorithm's accuracy decreases as the percentage of missing values increases. Nonetheless, we still see good accuracy (90%) from ILA4, even with 50% missing values. Note that the accuracy of applying ILA4 on the original dataset (with 0% missing values) is always 100%. The original dataset here is used as a benchmark for other datasets. Table 9. ILA4 accuracy against original test datasets with varying missing value rates Monk1 Vote Balance % missing values 0% 10% 30% 50% 0% 10% 30% 50% 0% 10% 30% 50% Accuracy (%) 100 99 95 91 100 97 96 90 100 98 94 89 In the second set of experiments, the ILA4 accuracy is analyzed against known methods that Treat missing data values, including Delete strategy, Most Common Value, and Most Common Value Restricted to a Concept. Using the same datasets from Table 6 above, the main steps in the experiment are: i. Randomly replace values in the original datasets with nulls (10%, 30%, and 50%). ii. Repeat the experiment five times and consider the accuracy average. iii. Apply ILA against these datasets. iv. Compare results with those obtained from applying ILA4 against the same datasets. The experiments show the effect of ILA4 with the suggested bespoke method for treating missing data values against the three datasets, which is then contrasted with the ILA effect on the same datasets using the three methods for treating missing values (Delete, MCV and MCVRC). The performance of ILA4 is shown in table 8 and tables10, 11 and 12show the results from applying each strategy of ILA on the datasets used previously. Table 10. ILA performance against the test datasets with varying missing data rates –MCV Strategy Monk1 Vote Balance % missing values 10% 30% 50% 10% 30% 50% 10% 30% 50% Number of Rules 34 40 42 45 47 55 313 318 343 Average number of conditions 3.41 3.57 3.71 3.5 3.52 3.41 3.4 3.42 3.45 Execution Time (s) 1.39 2.0 2.2 3.7 4.1 3.8 0.5 0.8 1.0 Table 11. ILA performance against the test datasets with varying missing data rates –MCVRC Strategy Monk1 Vote Balance % missing values 10% 30% 50% 10% 30% 50% 10% 30% 50% Number of Rules 31 38 41 45 44 53 310 307 335 Average number of conditions 3.4 3.1 3.2 3.4 3.51 3.44 3.39 3.45 3.46 Execution Time (s) 1.4 1.9 2.1 3.8 3.9 4.1 0.6 0.8 0.9 Table 12. ILA performance against the test datasets with varying missing data rates - Delete Strategy Monk1 Vote Balance % missing values 10% 30% 50% 10% 30% 50% 10% 30% 50% Number of Rules 28 33 28 40 39 28 298 287 136 Average number of conditions 3.1 2.9 3.4 3.1 3.53 3.47 3.8 3.41 3.47 Execution Time (s) 0.9 0.7 0.4 2.9 2.5 1.1 0.8 0.6 0.4 As shown in the comparative analysis of ILA4 with the three methods above (Table 13), and as the algorithm deletes every row with missing values, the Delete method performs worst among approaches, execution time notwithstanding. This is logical because, with hindsight, the loss of data instances from the dataset reduces the learning resource's depth for the model. Next, in performance is the Most Common Values Strategy as it preserves the original number of data instances for learning. The best comparative performance comes from the MCVRC strategy, which substitutes missing values per class value instead of doing so for the entirety of the dataset. Table 13. The ILA vs ILA4 Accuracy (Vote, Balance and Monk1 datasets) Monk1 Vote Balance % missing values 10% 30% 50% 10% 30% 50% 10% 30% 50% ILA4 %99 %95 %91 %97 %96 %90 %98 %94 %89 ILA with the most common values Strategy Restricted to a concept %94 %91 %84 %92 %87 %83 %93 %90 %80 ILA with the most common values Strategy %90 %87 %79 %85 %82 %78 %91 %84 %75 ILA with the delete Strategy %86 %81 %70 %80 %79 %71 %84 %81 %72 ILA4 produces the best results relatively because it handles the majority of issues that impede other methods. ILA4 capitalizes on the strength of MCV Strategy for missing value substitution by considering the value of current class value without compromising the processes within ILA. As expected, it turns out that the three methods execute more efficiently than ILA4, this is because the inductive algorithm within those methods receives a dataset that is preprocessed such that all missing values have already been substituted, whereas, in ILA4, the algorithm has to handle missing values as part of the induction phase. Table 13 illustrates the comparative accuracy of ILA4 and ILA with each of the 3 methods. The table highlights the advantage of ILA4 in the majority of test cases. To compare ILA4 with some other well-known algorithms, two different experiments are conducted. These experiments are performed on a machine with Intel Core i9-9900K Coffee Lake 8-Core, 16 MB Cache, and 16-Thread using Weka 3.9.4. 5-fold cross-validation is used for the test dataset construction. In the first experiment, ILA4 is compared with three algorithms: Logistic regression, Naïve Bayes, and Random Forest algorithms. Three datasets are used in this experiment, namely: Breast Cancer, Marketing, and Mushroom datasets. These datasets are downloaded from the site: https://sci2s.ugr.es/keel/missing.phpwith different sizes (one small and two large) and different percentage of examples with missing values. Table 14 shows the detailed description of these datasets. Table 14. The description of the Breast Cancer, Marketing, and Mushroom datasets Domain Characteristic Breast Cancer Marketing Mushroom Number of attributes 9+1 13+1 22+1 Number of examples 286 8993 8124 Average Values per attribute 5.67 5.77 5.68 Number of Class Values 2 9 2 Percentage of examples with missing values. 3.15% 23.54 30.53 Distribution of Examples Among Class Values No-recurrence-events 70.2% Recurrence-events 29.8% 1 19% 2 09% 3 07% 4 09% 5 08% 6 12% 7 11% 8 15% 9 10% p 51.8% e 48.2% The experiment shows that the results of ILA4 are comparable with the other three algorithms. Table 15 shows the detailed results. It is noted from the table that the accuracy of correctly classified instances of ILA4 is on a par with other algorithms, albeit with a slightly higher execution time Table 15. ILA4 results against other algorithms Breast Cancer Marketing Mushroom Algorithm ILA4 Logistic Naïve Bayes Random Forest ILA4 Logistic Naïve Bayes Random Forest ILA4 Logistic Naïve Bayes Random Forest Accuracy% 73.1 65.4 71.7 68.2 37.2 34.3 31.9 32.3 99.3 99.9 95.7 100 Execution Time (s) 0.03 0.01 0 0.01 15.13 13.86 0 2.07 1.3 1.2 0.02 0.23 TP Rate 0.731 0.654 0.717 0.682 0.372 0.343 0.319 0.323 0.993 0.999 0.957 1.000 FP Rate 0.378 0.561 0.452 0.528 0.099 0.097 0.095 0.095 0.023 0.001 0.046 0.000 Precision 0.722 0.626 0.702 0.656 0.353 0.312 0.297 0.302 0.988 0.999 0.959 1.000 Recall 0.731 0.654 0.717 0.682 0.42 0.343 0.319 0.323 0.985 0.999 0.957 1.000 F-Measure 0.729 0.636 0.707 0.664 0.357 0.316 0.300 0.310 0.986 0.999 0.957 1.000 In the second experiment, ILA4 is compared further with other well-known decision tree algorithms; namely: J48 and JRip on the two data sets: tic-tac-toe and hayes-roth. Table 16 shows the characteristics of these datasets. Table 16. The description of the tic-tac-toe and Hayes-Roth datasets Domain Characteristic Tic-tac-toe Hayes-Roth Number of attributes 9 5 Number of examples 958 132 Average Values per attribute 3 3.2 Number of Class Values 2 3 Percentage of examples with missing values. 0.3% 2.27% Distribution of Examples Among Class Values Positive: 65.3% Negative: 34.7% 1: 38.6% 2: 28.6% 3: 22.7% This experiment shows that the results of ILA4 are comparable also with decision tree algorithms. Table 17 shows the detailed results. It is noted from the table that ILA4 performs well with decision tree algorithms regarding accuracy but with slightly higher execution time. Table 17. ILA4 results against J48 and JRip algorithms on tic-tac-toe and Hayes-Roth datasets Tic-tac-toe Hayes-Roth Algorithm ILA4 J48 JRip ILA4 J48 JRip Accuracy% 94.56 83.8 98.01 87.304 74.242 81.818 Execution Time (s) 0.08 0 0.02 .07 0 0 TP Rate 0.945 0.838 0.980 0.889 0.742 0.818 FP Rate 0.134 0.222 0.032 0.061 0.162 0.114 Precision 0.945 0.836 0.980 0.889 0.766 0.818 Recall 0.945 0.838 0.980 0.889 0.742 0.818 F-Measure 0.945 0.836 0.980 0.889 0.745 0.818 7. Summary and Conclusions The work presented here introduced a new approach to replace discrete missing values within datasets used for machine learning techniques. This method is adapted with the ILA inductive learning algorithm that has demonstrated inductive effectiveness. Therefore, tweaking this approach with the addition of new features, thus creating scope for handling handle more datasets that have missing instances for use with ILA; this addition is distinguished from similar standard algorithms. Experimental tests and comparative effectiveness and accuracy analysis utilize common approaches for substituting missing values, including the MCV, the MCVRC, and the Delete strategy. Experimental tests demonstrated the viability of the proposed methodology and system with favourable results regarding both the number and complexity of the generated rules. In terms of execution time, there is a negligible cost inherent in the design of the new method during the inductive process as data preprocessing is integrated within. Nonetheless, the performance in this metric is also comparable to established methods. References [1] B. Angelov, "Working with Missing Data in Machine Learning," Medium, Dec-2017. [Online]. Available: https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce. [Accessed: 27Jul-2019]. [2] J. Joseph, "How to Treat Missing Values in Your Data," Apr-2016. [Online]. Available: https://www.datasciencecenTral.com/profiles/blogs/how-to-Treat-missing-values-in-your-data-1. [Accessed: 27Jul-2019]. [3] Acuna, E., & Rodriguez, C. (2004). "The treatment of missing values and its effect on classifier accuracy. In Classification, clustering, and data mining applications" (pp. 639-647). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17103-1_60 [4] Barnard, J., & Meng, X. L. (1999). "Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical methods in medical research”, 8(1), 17-36. https://doi.org/10.1177%2F096228029900800103 [5] L. Wohlrab and J. Fürnkranz, “A Comparison of Strategies for Handling Missing Values in Rule Learning” Technische Universität Darmstadt, Germany, Technical Report TUD–KE–2009-03, 2009. [6] Abu-Soud, S. M. (2019, April). "A Novel Approach for Dealing with Missing Values in Machine Learning Datasets with Discrete Values". In 2019 International Conference on Computer and Information Sciences (ICCIS) (pp. 15). IEEE. DOI:10.1109/ICCISCI.2019.8716430 [7] M. Tolun and S. Abu-Soud, "An Inductive Learning Algorithm for Production Rule Discovery" The International Journal of Expert Systems with Applications, vol. 14, no. 3, pp. 361–370, Apr. 1998. Corpus ID: 14523903 [8] Farhangfar, A., Kurgan, L. A., & Pedrycz, W. (2007). "A novel framework for imputation of missing values in databases". IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 37(5), 692709. DOI:10.1109/TSMCA.2007.902631 [9] Tresp, V., Neuneier, R., & Ahmad, S. (1995). "Efficient methods for dealing with missing data in supervised learning". In Advances in neural information processing systems (pp. 689-696). Corpus ID: 6947586 [10] Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). "Missing data imputation using statistical and machine learning methods in a real breast cancer problem." Artificial intelligence in medicine, 50(2), 105-115. https://doi.org/10.1016/j.artmed.2010.05.002 [11] Farhangfar, A., Kurgan, L., & Dy, J. (2008). "Impact of imputation of missing values on classification error for discrete data". Pattern Recognition, 41(12), 3692-3705. https://doi.org/10.1016/j.patcog.2008.05.019 [12] Schafer, J. L. (1997). Analysis of incomplete multivariate data. CRC press. [13] S. Van Buuren, H. C. Boshuizen and D. L. Knook, "Multiple imputation of missing blood pressure covariates in survival analysis", Stat. Med., vol. 18, no. 6, pp. 681-694, 1999. https://doi.org/10.1002/(SICI)10970258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R [14] Y. Liu, T. Dillon, W. Yu, W. Rahayu and F. Mostafa, "Missing Value Imputation for Industrial IoT Sensor Data With Large Gaps," in IEEE Internet of Things Journal, vol. 7, no. 8, pp. 6855-6867, Aug. 2020, doi: 10.1109/JIOT.2020.2970467 [15] Do, K.T., Wahl, S., Raffler, J. et al. Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 14, 128 (2018). https://doi.org/10.1007/s11306018-1420-2 [16] Raja, P.S., Thangavel, K. Missing value imputation using unsupervised machine learning techniques. Soft Comput 24, 4361–4392 (2020). https://doi.org/10.1007/s00500-019-04199-6 [17] Nikfalazar, S., Yeh, CH., Bedingfield, S. et al. Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 62, 2419–2437 (2020). https://doi.org/10.1007/s10115-019-01427-1 [18] Lin, WC., Tsai, CF. Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53, 1487–1509 (2020). https://doi.org/10.1007/s10462-019-09709-4 [19] Rashid W., Gupta M.K. (2021) A Perspective of Missing Value Imputation Approaches. In: Gao XZ., Tiwari S., Trivedi M., Mishra K. (eds) Advances in Computational Intelligence and Communication Technology. Advances in Intelligent Systems and Computing, vol 1086. Springer, Singapore. https://doi.org/10.1007/978-981-15-12759_25 [20] Abu-Soud, S. M. (2020, January). "A framework for integrating DSS and ES with machine learning. In Industrial and Engineering Applications of Artificial Intelligence and Expert Systems". Proceedings of the Tenth International Conference (p. 231). CRC Press. ISBN: 978-90-5699-615-4 [21] Abu-Soud S.M., Tolun M.R. (1999) DCL: A Disjunctive Learning Algorithm for Rule Extraction. In: Imam I., Kodratoff Y., El-Dessouki A., Ali M. (eds) Multiple Approaches to Intelligent Systems. IEA/AIE 1999. Lecture Notes in Computer Science, vol 1611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-48765-4_7 [22] Tolun, M. R., Sever, H., Uludag, M., & Abu-Soud, S. M. (1999). "ILA-2: An inductive learning algorithm for knowledge discovery". Cybernetics & Systems, 30(7), 609-628. https://doi.org/10.1080/019697299125037 [23] Abu-Soud, S. M., & Al Majali, S. "ILA-3: An Inductive Learning Algorithm with a New Feature Selection Approach". WSEAS Transactions on Systems and Control. ISSN / E-ISSN: 1991-8763 / 2224-2856, Volume 13, 2018, Art. #21, pp. 171-185 [24] S. Abu-Soud, "PaSSIL: A New Keystroke Dynamics System for Password Strengthening Based on Inductive Learning", Mar. 2016. WSEAS Transactions on Information Science and Applications, ISSN / E-ISSN: 1790-0832 / 2224-3402, Volume 13, 2016, Art. #13, pp. 126-133 Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Table 4. Two STs

Sub-Table 1 Original Example New Example Outlook Temp Humidity Stormy Decisi on 1 1 Clear Warm Stifling -- N 2 2 Clear -- -- True N 6 3 -- Cold Comfy -- N 8 4 Clear Regular -- False N 14 5 Damp Regular Stifling -- N

Sub-Table 2 Original Example New Example Outlook Temp Humidity Stormy Decis ion 3 1 -- Warm Stifling -- P 4 2 Damp -- Stifling False P 5 3 -- Cold -- False P 7 4 Cloudy -- Comfy True P 9 5 Clear Cold Comfy False P 10 6 -- -- Comfy False P 11 7 Clear Regular Comfy -- P