1.2. Data Quality
1.2.1. Accuracy
Accuracy is generally defined as the number of data attributes that correctly represent the true value of a concept or event in specific use cases [11]. There are 2 types of accuracy: syntactic accuracy and semantic accuracy [2]. Syntactic accuracy is defined as the closeness between a value ν and the corresponding elements of domain D. Domain D is a definition domain that includes all the correct representations the user defines. Semantic accuracy is defined as the closeness between a value ν and the true value ν' in real-life phenomena, which ν tries to portray. From Table 1.1, "Müncen" is syntactically inaccurate, whereas "10623" is syntactically accurate but semantically inaccurate. This thesis only focuses on syntactic accuracy.
Syntactic accuracy
Acc(t)=\Large\frac{\sum_{i=1}^{t}acc(r_i,\ D(r_i))}{\left|t\right|} [2]
For string errors | acc(r_i, D(r_i)) = \begin{cases} 1 & \text{if $r_i \in D(r_i)$} \\ 1 - NED(r_i,\ D(r_i)) & \text{otherwise} \end{cases} | NED is the Normalized Edit Distance [6]. This edit distance is known as the Levenshtein distance [13]. |
---|---|---|
For numerical errors | acc(r_i, D(r_i)) = \begin{cases} 1 & \text{if $r_i \in D(r_i)$} \\ 1 - \text{$\large\frac{\left|r_i-D(r_i)\right|}{Max(r_i, \ D(r_i))}$} & \text{otherwise} \end{cases} | - |
Overall syntactic accuracy for the ith column
CA_i = Acc(a) = \Large\frac{\sum_{j=1}^{a}acc(r_j,\ D(r_j))}{\left|a\right|}
Calculating syntactic accuracy for the column "City" in Table 1.1
rj | City | D(rj) | NED(rj,D(rj)) | acc(rj,D(rj)) |
---|---|---|---|---|
1 | Müncen | München | 1/7 | 6/7 |
2 | Garching bei München | Garching bei München | 0 | 1 |
3 | Neubiberg | Neubiberg | 0 | 1 |
CA_4 = Acc(City) = \Large\frac{\sum_{j=1}^{3}acc(r_j,\ D(r_j))}{3} = \frac{\frac{6}{7}+1+1}{3} = \normalsize\~0.95
1.2.2. Completeness
Completeness can be defined as the degree to which data are of sufficient breadth, depth and scope for the task at hand [16]. There are 3 types of completeness: schema completeness, column completeness and population completeness. This work only considers column completeness, which can be determined by the presence or absence of null values. There are 3 different definitions of null values: real empty values, missing values and ambiguous empty values. From Table 1.2, "Emily" does not have an email, so no incompleteness occurs. "Frank" has an email, but its value is unknown, which is an incompleteness. "Gabrielle" may or may not have an email, so incompleteness may not be the case.
Completeness
\textit{Completeness rating} = 1 - \left(\large\frac{\textit{Number of incomplete items}}{\textit{Total number of items}} \right) [10]
Overall completeness for the ith column
CC_i = 1 - \large\frac{\textit{Number of incomplete items in the column}}{\textit{Total number of items in the column}}
Calculating completeness for the column "Email" in Table 1.2
Person ID | Cell Completeness | |
---|---|---|
1 | danny90@gmail.com | 1 |
2 | NULL | 1 |
3 | NULL | 0 |
4 | NULL | 1 |
CC_5 = 1 - \large\frac{\textit{Number of incomplete items in the "Email" column}}{\textit{Total number of items in the "Email" column}} = \large\frac{3}{4} = \normalsize0.75
1.2.3. Consistency
Consistency is the degree to which data has attributes that are non-conflicting and are coherent with other data in specific use cases [14]. There are 3 types of completeness: consistency of integrity constraints, consistency between two related data elements, consistency of format. For consistency of integrity constraints, it can be further divided into two categories, namely intrarelation constraints and interrelation constraints [3]. In Table 1.3, employees must be above 18 years old. However, "Jack" is below the age of 18, leading to an intrarelation integrity constraint violation. Moving on to 1.4, it can be seen that "Harry" has a promotion year of 2022 but a start year of 2024, which violates the interrelation integrity constraint. Here, in the column "Promotion Year", the value "in 5 years" is a violation of format consistency, as only numerical values are expected. This thesis applies the first and third types of consistency with slight modifications.
Consistency
\textit{Consistency rating} = 1 - \left(\large\frac{\textit{Number of inconsistent units}}{\textit{Total number of consistency checks performed}} \right) [10]
Overall consistency for the ith column
CCS_i = 1 - \large\frac{\textit{Number of inconsistent units in the column}}{\textit{Total number of consistency checks performed in the column}}
Calculating consistency for the column "Promotion Year" in Table 1.4
Promotion ID | Promotion Year | Cell Consistency |
---|---|---|
321 | 2022 | 0 |
322 | 2026 | 1 |
323 | in 5 years | 0 |
CCS_3 = 1 - \large\frac{\textit{Number of inconsistent units in the "Promotion Year" column}}{\textit{Total number of consistency checks performed in the "Promotion Year" column}} = \large\frac{1}{3} = \normalsize\~0.33
1.2.4. Weighted Metric
Simple ratios are easy to implement for measuring data quality, but they do not account for the varying weights of different data [15]. In a company, some data may be more important than others in helping the company achieve its business goals. Elouataoui et al. [7] also share this opinion, stating that this is true in most organizations. Therefore, these relevant data must receive more attention in order to be more significant. Some attributes may be more important than others for a tuple in structured data. In Table 1.5, we can assess the overall completeness by averaging five completeness values, resulting in 78%. However, if each column has a specific weight, it can result in a more practical completeness value for this relation.
Weighted column completeness
\sum_{i=1}^{n}(CW_i \times CC_i) [15]
Weighted column accuracy
\sum_{i=1}^{n}(CW_i \times CA_i)
Weighted column consistency
\sum_{i=1}^{n}(CW_i \times CCS_i)
Calculating weighted column completeness for the relation "Student" in Table 1.5
Column | Student ID | Name | Telephone | Course | |
---|---|---|---|---|---|
Completeness Score | 0.90 | 0.80 | 0.70 | 0.60 | 0.90 |
Weight | 0.30 | 0.25 | 0.10 | 0.20 | 0.15 |
\sum_{i=1}^{5}(CW_i \times CC_i) &= (0.90 \times 0.3) + (0.80 \times 0.25) + (0.70 \times 0.10) + (0.60 \times 0.20) + (0.90 \times 0.15) = \normalsize0.795 = 79.5\%
1.3. Usability Evaluation
1.3.1. Evaluation Setup
Quesenbery has come up with 5Es as seen in Figure 1.2 to describe the dimensions of usability: effective, efficient, engaging, error tolerant and easy to learn [11]. The 5Es are easy to remember and have straightforward definitions. Effective refers to how accurately and completely users achieve their goals. Efficient is how quickly and accurately a particular task can be done. Engaging represents user satisfaction with the system. Error Tolerant describes how well the system prevents errors and helps users recover from them. Easy to Learn refers to the system's ability to support and guide users during initial orientation and subsequent usage. We can prioritize one usability dimension over the other depending on the specific use case.
1.3.2. Data Collection
The type of data being collected is highly relevant in evaluating a system's usability. There are two types of data: quantitative data and qualitative data. Quantitative data are numeric data that can be obtained from measurements, whereas qualitative data includes any information that is not numeric. Table 1.7 shows an example of quantitative and qualitative data based on the usability dimensions, or the 5Es. This thesis primarily collects quantitative data. Any qualitative data obtained through questionnaires will be converted into quantitative data.
To validate time-related usability metrics quantitatively, it is necessary to measure the time taken to complete a task. A digital or analog stopwatch is more accurate and suitable as compared to a clock in this case. Recording or logging the actions carried out by participants when completing an assigned task can be useful for evaluation. Specialized software can keep track of the number of mouse clicks or keystrokes, which is more advantageous than manually recording it on a piece of paper.
Using questionnaires to collect data for an evaluation has its pros and cons. Questionnaires provide the same format for all participants, which facilitates consistent data collection. Thus, it is possible to compare results between participants after conducting such questionnaires. However, it can be difficult and challenging to design a suitable and useful questionnaire as different evaluations require different metrics. This thesis utilizes the structure of System Usability Scale (SUS) [5] as the basis for our questionnaire and modifies it as needed to meet our usability requirements.
2. Implementation
This in-house software application is called NEXTREL, which is an acronym for the term Next Reliability. The front end of this software is developed using a low-code platform called OutSystems. It allows rapid prototyping and iteration within its environment, making it well-suited to the agile development process employed within the development team. This results in a faster and more efficient approach to implement necessary changes compared to traditional software development [1]. This is great for prototype development that requires feedback from either users or the project owner for improvements at every iteration. On the other hand, the back end consists of an Oracle database system to create a robust data model to handle complex relationships between different data points. Consequently, this ensures data integrity and allows complex queries to be executed and displayed at the front end.
Additionally, we develop a Python script with various libraries to objectively measure and assess the data quality dimensions using equations above. The script is also responsible for generating plots to visualize results. Below is a table that shows all the libraries used and their respective versions.
No. | Library | Usage in This Thesis | Version | Link |
---|---|---|---|---|
1 | pandas | Handling structured data and perform operations on DataFrames for data manipulation and analysis | 2.2.2 | https://pypi.org/project/pandas/2.2.2/ [Accessed: Jul 10, 2024] |
2 | Levenshtein | Computing of the Levenshtein (edit) distance using string similarity calculations | 0.26.0 | https://pypi.org/project/Levenshtein/0.26.0/ [Accessed: Jul 10, 2024] |
3 | seaborn | Plotting of customized box, pie and violin plots | 0.13.2 | https://pypi.org/project/seaborn/0.13.2/ [Accessed: Jul 10, 2024] |
4 | matplotlib | Utilization of matplotlib.pyplot, matplotlib.patches and matplotlib.lines to plot and visualize data, create annotations and draw line elements | 3.9.2 | https://pypi.org/project/matplotlib/3.9.2/ [Accessed: Jul 10, 2024] |
5 | numpy | Handling of arrays and generating null values | 1.26.4 | https://pypi.org/project/numpy/1.26.4/ [Accessed: Jul 10, 2024] |
6 | pathlib | Handling file system paths in an object-oriented manner | 1.0.1 | https://pypi.org/project/pathlib/1.0.1/ [Accessed: Jul 10, 2024] |
7 | openpyxl | Reading and writing of Microsoft Excel files | 3.1.5 | https://pypi.org/project/openpyxl/3.1.5/ [Accessed: Jul 10, 2024] |
8 | xlsxwriter | Writing of Microsoft Excel files, particularly writing to multiple sheets within one file | 3.1.1 | https://pypi.org/project/XlsxWriter/3.1.1/ [Accessed: Jul 10, 2024] |
9 | textwrap (included in the Python Standard Library) | Wrapping and formatting text for plotting of data | 3.11.10 | https://docs.python.org/3.11/library/textwrap.html [Accessed: Jul 10, 2024] |
10 | re (included in the Python Standard Library) | Utilization of re.sub to replace substrings matching a pattern and re.split to split strings based on a pattern | 3.11.10 | https://docs.python.org/3.11/library/re.html [Accessed: Jul 10, 2024] |
11 | os (included in the Python Standard Library) | Interacting with the operating system to access file and conduct directory operations | 3.11.10 | https://docs.python.org/3.11/library/os.html [Accessed: Jul 10, 2024] |
12 | random (included in the Python Standard Library) | Generating random numbers and performing random selections | 3.11.10 | https://docs.python.org/3.11/library/random.html [Accessed: Jul 10, 2024] |
13 | string (included in the Python Standard Library) | Utilization of string.digits to check for digits in a string and string.ascii_uppercase to get uppercase English alphabets for certain string replacement | 3.11.10 | https://docs.python.org/3.11/library/string.html [Accessed: Jul 10, 2024] |
3. Evaluation
3.1. Study Setting
Before conducting the study, a clear task description is required to evaluate data quality and system usability. Participants first receive a short introduction to this topic and related information for better context. Their background and experience level with the "Test Sample" workflow are assessed before assigning them a task related to the workflow. The task assigned is to create a new sample with the same data in both Tool A and Tool B. Some participants start with Tool A and then use Tool B, while others begin with Tool B before moving on to Tool A. This randomized order inspired by A/B testing helps to reduce the impact of learning effects, which occur when retaking tests [4, 12]. After using either tool, participants are shown data generated by that tool in real-life scenarios before completing a questionnaire consisting of ten questions. Then, this process is repeated using the other tool.
We use the modified questionnaire illustrated in Figure 4.4, which follows the structure of SUS as seen in Figure 4.3 and the usability dimensions described above. Questions 1 to 7 focus on evaluating usability, specifically the "Error Tolerant" and "Easy to Learn" dimensions, as shown in Figure 4.2. Two questions represent each of these two dimensions, which is about 28.6% of the total questions. Whereas questions 8 to 10 address subjective data quality perceived by users.
3.2. Results
To ease the comparison between both tools, namely Excel and NEXTREL, we can summarize the data quality results across all columns for each dimension that was previously calculated. Figure 3.5 shows the result for weighted accuracy, weighted completeness and weighted consistency. Additionally, we can use our modified questionnaire to evaluate the subjective quality of data generated by both tools, as shown in Figure 3.6. With the same questionnaire, we can also gather user feedback to conduct usability evaluation of both tools, as seen in Figure 3.7.
We use violin plots in the results to show the data distribution, which can sometimes be skewed positively or negatively. Such skewed distributions are no longer normal distributions. Therefore, box plots are not suitable for use in this situation. To provide an easier understanding, we modify the violin plots to show the mean instead of the median and the interquartile range.
4. Attachments
5. Bibliography
[1] S. Alsaqqa, S. Sawalha and H. Abdel-Nabi, “Agile Software Development: Methodologies and Trends,” International Journal of Interactive Mobile Technologies, vol. 14, no. 11, pp. 246–270, 2020.
[2] C. Batini, D. Barone, F. Cabitza and S. Grega, “A Data Quality Methodology for Heterogeneous Data,” International Journal of Database Management Systems, vol. 3, no. 1, pp. 60–79, 2011.
[3] C. Batini and M. Scannapieca, Data Quality: Concepts, Methodologies and Techniques, Springer Berlin Heidelberg, 2006.
[4] W. C. M. Belzak and J. R. Lockwood, “Estimating Test-Retest Reliability in the Presence of Self-Selection Bias and Learning/Practice Effects,” Applied Psychological Measurement, vol. 48, no. 7-8, pp. 323–340, 2024.
[5] J. Brooke, “SUS: A ’Quick and Dirty’ Usability Scale,” in Usability Evaluation In Industry, CRC Press, 1996.
[6] A. K. Elmagarmid, P. G. Ipeirotis and V. S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1–16, 2007.
[7] W. Elouataoui, I. El Alaoui, S. El Mendili and Y. Gahi, “An Advanced Big Data Quality Framework Based on Weighted Metrics,” Big Data and Cognitive Computing, vol. 6, no. 4, p. 153, 2022.
[8] J. D. Gould, S. J. Boies and C. Lewis, “Making Usable, Useful, Productivity-Enhancing Computer Applications,” Communications of the ACM, vol. 34, no. 1, pp. 74–85, 1991.
[9] J. K. Kies, R. C. Williges and M. B. Rosson, “Coordinating Computer-Supported Cooperative Work: A Review of Research Issues and Strategies,” Journal of the American Society for Information Science, vol. 49, no. 9, pp. 776–791, 1998.
[10] Y.W. Lee, L. L. Pipino, R.Y. Wang and J. D. Funk, Journey to Data Quality, The MIT Press, 2006.
[11] W. Quesenbery, “The Five Dimensions of Usability,” in Content and Complexity, Routledge, 2003.
[12] F. Quin, D. Weyns, M. Galster and C. C. Silva, “A/B Testing: A Systematic Literature Review,” Journal of Systems and Software, vol. 211, p. 112 011, 2024.
[13] K.U. Schulz and S. Mihov, “Fast String Correction with Levenshtein Automata,” International Journal on Document Analysis and Recognition, vol. 5, no. 1, pp. 67–85, 2002.
[14] Software Engineering - Software product Quality Requirements and Evaluation (SQuaRE) - Data Quality Model, ISO/IEC 25012, 2008.
[15] R. Vaziri, M. Mohsenzadeh and J. Habibi, “Measuring Data Quality with Weighted Metrics,” Total Quality Management & Business Excellence, vol. 30, no. 5-6, pp. 708–720, 2019.
[16] R.Y. Wang and D. M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management Information Systems, vol. 12, no. 4, pp. 5–33, 1996.
Disclaimer
This wiki contains only some of the topics discussed in the thesis. It mainly serves as a summary and highlights important points. Please refer to the thesis attached for complete explanations.