Abstract

Big data and machine learning applications significantly impact business performance across companies of all sizes. High-quality data are crucial for enabling engineers to automate routine tasks, allowing them to carry out their daily tasks and responsibilities more efficiently. However, achieving high data quality continues to be a challenge because inaccurate, incomplete and inconsistent data can compromise the effectiveness of automation processes. More often than not, these data are managed by software or tools familiar to a wide range of users, such as Microsoft Excel. These tools are often intuitive and thus user friendly, but may not provide high-quality data due to their extensive degree of freedom. In this thesis, the data quality and usability of an existing Microsoft Excel tool is compared to a prototype for data management. To achieve this, we explore specific dimensions of data quality that fit our needs and examine qualities that define good usability for a system. With this information, we can identify the strengths and weaknesses of the prototype and iterate the user interface development process to improve its design based on valuable feedback. The findings are also applicable in other relevant fields that require high data quality and system usability in our daily lives.


 1. Theoretical Background






1.1. User Interface Development

The development of user interface is an iterative process that involves multiple aspects. It is essential in the process of improving a system [6]. Such a life cycle of development is self-correcting, which depends on trial and error as well as feedback from evaluations. This process is also important to predict human behavior, as developers do not get it right the first time. An example of usage is predicting patterns of interaction between humans and the software interface. Moreover, the process of user interface development cannot be like the conventional top-down waterfall model used in software engineering [7]. Thus, developers should utilize this combination of process designs for user interface development, as shown in Figure 1.1. During the prototype design phase, conducting effective formative evaluations provides valuable feedback to developers. This feedback helps developers understand which aspects of the interface are good and bad, allowing them to improve the prototype design. For this, a comprehensive understanding of data quality and usability evaluation is essential.


Table of Contents




1.2. Data Quality

1.2.1. Accuracy

Accuracy is generally defined as the number of data attributes that correctly represent the true value of a concept or event in specific use cases [11]. There are 2 types of accuracy: syntactic accuracy and semantic accuracy [2]. Syntactic accuracy is defined as the closeness between a value ν and the corresponding elements of domain D. Domain D is a definition domain that includes all the correct representations the user defines. Semantic accuracy is defined as the closeness between a value ν and the true value ν' in real-life phenomena, which ν tries to portray. From Table 1.1, "Müncen" is syntactically inaccurate, whereas "10623" is syntactically accurate but semantically inaccurate. This thesis only focuses on syntactic accuracy.


Formula

Syntactic accuracy

Acc(t)=\Large\frac{\sum_{i=1}^{t}acc(r_i,\ D(r_i))}{\left|t\right|} [2]

For string errors

acc(r_i, D(r_i)) = \begin{cases} 1 & \text{if $r_i \in D(r_i)$} \\ 1 - NED(r_i,\ D(r_i)) & \text{otherwise} \end{cases}

NED is the Normalized Edit Distance [6]. This edit distance is known as the Levenshtein distance [13]. 

For numerical errorsacc(r_i, D(r_i)) = \begin{cases} 1 & \text{if $r_i \in D(r_i)$} \\ 1 - \text{$\large\frac{\left|r_i-D(r_i)\right|}{Max(r_i, \ D(r_i))}$} & \text{otherwise} \end{cases}


 -


Overall syntactic accuracy for the ith column

CA_i = Acc(a) = \Large\frac{\sum_{j=1}^{a}acc(r_j,\ D(r_j))}{\left|a\right|}


Example

Calculating syntactic accuracy for the column "City" in Table 1.1

rjCityD(rj)NED(rj,D(rj))acc(rj,D(rj))
1

Müncen

München

1/7

6/7
2Garching bei MünchenGarching bei München01
3NeubibergNeubiberg01


CA_4 = Acc(City) = \Large\frac{\sum_{j=1}^{3}acc(r_j,\ D(r_j))}{3} = \frac{\frac{6}{7}+1+1}{3} = \normalsize\~0.95



1.2.2. Completeness

Completeness can be defined as the degree to which data are of sufficient breadth, depth and scope for the task at hand [16]. There are 3 types of completeness: schema completeness, column completeness and population completeness. This work only considers column completeness, which can be determined by the presence or absence of null values. There are 3 different definitions of null values: real empty values, missing values and ambiguous empty values. From Table 1.2, "Emily" does not have an email, so no incompleteness occurs. "Frank" has an email, but its value is unknown, which is an incompleteness. "Gabrielle" may or may not have an email, so incompleteness may not be the case.


Formula

Completeness

\textit{Completeness rating} = 1 - \left(\large\frac{\textit{Number of incomplete items}}{\textit{Total number of items}} \right) [10]


Overall completeness for the ith column

CC_i = 1 - \large\frac{\textit{Number of incomplete items in the column}}{\textit{Total number of items in the column}}


Example

Calculating completeness for the column "Email" in Table 1.2

Person IDEmailCell Completeness
1

danny90@gmail.com

1

2NULL1
3NULL0
4NULL1


CC_5 = 1 - \large\frac{\textit{Number of incomplete items in the "Email" column}}{\textit{Total number of items in the "Email" column}} = \large\frac{3}{4} = \normalsize0.75



 

1.2.3. Consistency

Consistency is the degree to which data has attributes that are non-conflicting and are coherent with other data in specific use cases [14]. There are 3 types of completeness: consistency of integrity constraints, consistency between two related data elements, consistency of format. For consistency of integrity constraints, it can be further divided into two categories, namely intrarelation constraints and interrelation constraints [3]. In Table 1.3, employees must be above 18 years old. However, "Jack" is below the age of 18, leading to an intrarelation integrity constraint violation. Moving on to 1.4, it can be seen that "Harry" has a promotion year of 2022 but a start year of 2024, which violates the interrelation integrity constraint. Here, in the column "Promotion Year", the value "in 5 years" is a violation of format consistency, as only numerical values are expected. This thesis applies the first and third types of consistency with slight modifications.


Formula

Consistency

\textit{Consistency rating} = 1 - \left(\large\frac{\textit{Number of inconsistent units}}{\textit{Total number of consistency checks performed}} \right) [10]


Overall consistency for the ith column

CCS_i = 1 - \large\frac{\textit{Number of inconsistent units in the column}}{\textit{Total number of consistency checks performed in the column}}


Example

Calculating consistency for the column "Promotion Year" in Table 1.4

Promotion IDPromotion YearCell Consistency
321

2022

0

32220261
323in 5 years0


CCS_3 = 1 - \large\frac{\textit{Number of inconsistent units in the "Promotion Year" column}}{\textit{Total number of consistency checks performed in the "Promotion Year" column}} = \large\frac{1}{3} = \normalsize\~0.33



 

1.2.4. Weighted Metric

Simple ratios are easy to implement for measuring data quality, but they do not account for the varying weights of different data [15]. In a company, some data may be more important than others in helping the company achieve its business goals. Elouataoui et al. [7] also share this opinion, stating that this is true in most organizations. Therefore, these relevant data must receive more attention in order to be more significant. Some attributes may be more important than others for a tuple in structured data. In Table 1.5, we can assess the overall completeness by averaging five completeness values, resulting in 78%. However, if each column has a specific weight, it can result in a more practical completeness value for this relation. 


Formula

Weighted column completeness

\sum_{i=1}^{n}(CW_i \times CC_i)  [15]


Weighted column accuracy

\sum_{i=1}^{n}(CW_i \times CA_i)


Weighted column consistency

\sum_{i=1}^{n}(CW_i \times CCS_i)


Example

Calculating weighted column completeness for the relation "Student" in Table 1.5

ColumnStudent IDNameTelephoneEmailCourse
Completeness Score    

0.90

0.80

0.70

0.60

0.90

Weight0.300.250.100.200.15


\sum_{i=1}^{5}(CW_i \times CC_i) &= (0.90 \times 0.3) + (0.80 \times 0.25) + (0.70 \times 0.10) + (0.60 \times 0.20) + (0.90 \times 0.15) = \normalsize0.795 = 79.5\%





1.3. Usability Evaluation

1.3.1. Evaluation Setup

Quesenbery has come up with 5Es as seen in Figure 1.2 to describe the dimensions of usability: effective, efficient, engaging, error tolerant and easy to learn [11]. The 5Es are easy to remember and have straightforward definitions. Effective refers to how accurately and completely users achieve their goals. Efficient is how quickly and accurately a particular task can be done. Engaging represents user satisfaction with the system. Error Tolerant describes how well the system prevents errors and helps users recover from them. Easy to Learn refers to the system's ability to support and guide users during initial orientation and subsequent usage. We can prioritize one usability dimension over the other depending on the specific use case.


1.3.2. Data Collection

The type of data being collected is highly relevant in evaluating a system's usability. There are two types of data: quantitative data and qualitative data. Quantitative data are numeric data that can be obtained from measurements, whereas qualitative data includes any information that is not numeric. Table 1.7 shows an example of quantitative and qualitative data based on the usability dimensions, or the 5Es. This thesis primarily collects quantitative data. Any qualitative data obtained through questionnaires will be converted into quantitative data. 

To validate time-related usability metrics quantitatively, it is necessary to measure the time taken to complete a task. A digital or analog stopwatch is more accurate and suitable as compared to a clock in this case. Recording or logging the actions carried out by participants when completing an assigned task can be useful for evaluation. Specialized software can keep track of the number of mouse clicks or keystrokes, which is more advantageous than manually recording it on a piece of paper.

Using questionnaires to collect data for an evaluation has its pros and cons. Questionnaires provide the same format for all participants, which facilitates consistent data collection. Thus, it is possible to compare results between participants after conducting such questionnaires. However, it can be difficult and challenging to design a suitable and useful questionnaire as different evaluations require different metrics. This thesis utilizes the structure of System Usability Scale (SUS) [5] as the basis for our questionnaire and modifies it as needed to meet our usability requirements.


2. Implementation

This in-house software application is called NEXTREL, which is an acronym for the term Next Reliability. The front end of this software is developed using a low-code platform called OutSystems. It allows rapid prototyping and iteration within its environment, making it well-suited to the agile development process employed within the development team. This results in a faster and more efficient approach to implement necessary changes compared to traditional software development [1]. This is great for prototype development that requires feedback from either users or the project owner for improvements at every iteration. On the other hand, the back end consists of an Oracle database system to create a robust data model to handle complex relationships between different data points. Consequently, this ensures data integrity and allows complex queries to be executed and displayed at the front end.

Additionally, we develop a Python script with various libraries to objectively measure and assess the data quality dimensions using equations above. The script is also responsible for generating plots to visualize results. Below is a table that shows all the libraries used and their respective versions.

No.LibraryUsage in This ThesisVersionLink
1pandasHandling structured data and perform operations on DataFrames for data manipulation and analysis2.2.2https://pypi.org/project/pandas/2.2.2/ [Accessed: Jul 10, 2024]
2LevenshteinComputing of the Levenshtein (edit) distance using string similarity calculations0.26.0https://pypi.org/project/Levenshtein/0.26.0/ [Accessed: Jul 10, 2024]
3seabornPlotting of customized box, pie and violin plots0.13.2https://pypi.org/project/seaborn/0.13.2/ [Accessed: Jul 10, 2024]
4matplotlibUtilization of matplotlib.pyplot, matplotlib.patches and matplotlib.lines to plot and visualize data, create annotations and draw line elements 3.9.2https://pypi.org/project/matplotlib/3.9.2/ [Accessed: Jul 10, 2024]
5numpyHandling of arrays and generating null values1.26.4https://pypi.org/project/numpy/1.26.4/ [Accessed: Jul 10, 2024]
6pathlibHandling file system paths in an object-oriented manner1.0.1https://pypi.org/project/pathlib/1.0.1/ [Accessed: Jul 10, 2024]
7openpyxlReading and writing of Microsoft Excel files3.1.5https://pypi.org/project/openpyxl/3.1.5/ [Accessed: Jul 10, 2024]
8xlsxwriterWriting of Microsoft Excel files, particularly writing to multiple sheets within one file3.1.1https://pypi.org/project/XlsxWriter/3.1.1/ [Accessed: Jul 10, 2024]
9textwrap (included in the Python Standard Library)Wrapping and formatting text for plotting of data3.11.10https://docs.python.org/3.11/library/textwrap.html [Accessed: Jul 10, 2024]
10re (included in the Python Standard Library)Utilization of re.sub to replace substrings matching a pattern and re.split to split strings based on a pattern3.11.10https://docs.python.org/3.11/library/re.html [Accessed: Jul 10, 2024]
11os (included in the Python Standard Library)Interacting with the operating system to access file and conduct directory operations3.11.10https://docs.python.org/3.11/library/os.html [Accessed: Jul 10, 2024]
12random (included in the Python Standard Library)Generating random numbers and performing random selections3.11.10https://docs.python.org/3.11/library/random.html [Accessed: Jul 10, 2024]
13string (included in the Python Standard Library)Utilization of string.digits to check for digits in a string and string.ascii_uppercase to get uppercase English alphabets for certain string replacement3.11.10https://docs.python.org/3.11/library/string.html [Accessed: Jul 10, 2024]


3. Evaluation



3.1. Study Setting

Before conducting the study, a clear task description is required to evaluate data quality and system usability. Participants first receive a short introduction to this topic and related information for better context. Their background and experience level with the "Test Sample" workflow are assessed before assigning them a task related to the workflow. The task assigned is to create a new sample with the same data in both Tool A and Tool B. Some participants start with Tool A and then use Tool B, while others begin with Tool B before moving on to Tool A. This randomized order inspired by A/B testing helps to reduce the impact of learning effects, which occur when retaking tests [4, 12]. After using either tool, participants are shown data generated by that tool in real-life scenarios before completing a questionnaire consisting of ten questions. Then, this process is repeated using the other tool.


Figure 3.2


We use the modified questionnaire illustrated in Figure 4.4, which follows the structure of SUS as seen in Figure 4.3 and the usability dimensions described above. Questions 1 to 7 focus on evaluating usability, specifically the "Error Tolerant" and "Easy to Learn" dimensions, as shown in Figure 4.2. Two questions represent each of these two dimensions, which is about 28.6% of the total questions. Whereas questions 8 to 10 address subjective data quality perceived by users.


Figure 3.3


Figure 3.4



3.2. Results

To ease the comparison between both tools, namely Excel and NEXTREL, we can summarize the data quality results across all columns for each dimension that was previously calculated. Figure 3.5 shows the result for weighted accuracy, weighted completeness and weighted consistency. Additionally, we can use our modified questionnaire to evaluate the subjective quality of data generated by both tools, as shown in Figure 3.6. With the same questionnaire, we can also gather user feedback to conduct usability evaluation of both tools, as seen in Figure 3.7.

We use violin plots in the results to show the data distribution, which can sometimes be skewed positively or negatively. Such skewed distributions are no longer normal distributions. Therefore, box plots are not suitable for use in this situation. To provide an easier understanding, we modify the violin plots to show the mean instead of the median and the interquartile range.

Figure 3.5


Figure 3.6


Figure 3.7

4. Attachments

Bachelor's Thesis


End Presentation



5. Bibliography

[1] S. Alsaqqa, S. Sawalha and H. Abdel-Nabi, “Agile Software Development: Methodologies and Trends,” International Journal of Interactive Mobile Technologies, vol. 14, no. 11, pp. 246–270, 2020.
[2] C. Batini, D. Barone, F. Cabitza and S. Grega, “A Data Quality Methodology for Heterogeneous Data,” International Journal of Database Management Systems, vol. 3, no. 1, pp. 60–79, 2011.
[3] C. Batini and M. Scannapieca, Data Quality: Concepts, Methodologies and Techniques, Springer Berlin Heidelberg, 2006.
[4] W. C. M. Belzak and J. R. Lockwood, “Estimating Test-Retest Reliability in the Presence of Self-Selection Bias and Learning/Practice Effects,” Applied Psychological Measurement, vol. 48, no. 7-8, pp. 323–340, 2024.
[5] J. Brooke, “SUS: A ’Quick and Dirty’ Usability Scale,” in Usability Evaluation In Industry, CRC Press, 1996.
[6] A. K. Elmagarmid, P. G. Ipeirotis and V. S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 1, pp. 1–16, 2007.
[7] W. Elouataoui, I. El Alaoui, S. El Mendili and Y. Gahi, “An Advanced Big Data Quality Framework Based on Weighted Metrics,” Big Data and Cognitive Computing, vol. 6, no. 4, p. 153, 2022.
[8] J. D. Gould, S. J. Boies and C. Lewis, “Making Usable, Useful, Productivity-Enhancing Computer Applications,” Communications of the ACM, vol. 34, no. 1, pp. 74–85, 1991.
[9] J. K. Kies, R. C. Williges and M. B. Rosson, “Coordinating Computer-Supported Cooperative Work: A Review of Research Issues and Strategies,” Journal of the American Society for Information Science, vol. 49, no. 9, pp. 776–791, 1998.
[10] Y.W. Lee, L. L. Pipino, R.Y. Wang and J. D. Funk, Journey to Data Quality, The MIT Press, 2006.
[11] W. Quesenbery, “The Five Dimensions of Usability,” in Content and Complexity, Routledge, 2003.
[12] F. Quin, D. Weyns, M. Galster and C. C. Silva, “A/B Testing: A Systematic Literature Review,” Journal of Systems and Software, vol. 211, p. 112 011, 2024.
[13] K.U. Schulz and S. Mihov, “Fast String Correction with Levenshtein Automata,” International Journal on Document Analysis and Recognition, vol. 5, no. 1, pp. 67–85, 2002.
[14] Software Engineering - Software product Quality Requirements and Evaluation (SQuaRE) - Data Quality Model, ISO/IEC 25012, 2008.
[15] R. Vaziri, M. Mohsenzadeh and J. Habibi, “Measuring Data Quality with Weighted Metrics,” Total Quality Management & Business Excellence, vol. 30, no. 5-6, pp. 708–720, 2019.
[16] R.Y. Wang and D. M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management Information Systems, vol. 12, no. 4, pp. 5–33, 1996.


Disclaimer

This wiki contains only some of the topics discussed in the thesis. It mainly serves as a summary and highlights important points. Please refer to the thesis attached for complete explanations. 

  • Keine Stichwörter