Data Cleaning With Python Conditionals
Python Programming Environment.
If you havne’t done so, create an account at PythonAnywhere [url removed, login to view]
If you haven’t done so, watch the video on creating and running programs at Python Anywhere.
Use the PythonAnywhere File Manager to create a file called conditionals.py. Write and execute your program in PythonAnywhere using Python 2.7.
When data values are collected, errors can occur. For example consider a case where you collect data on people’s favorite color using a Google Form survey on the Internet that has two questions like this:
Your favorite color:
Where people can type in any text as their answers. For age, you would expect an integer, but you might get answers like:
5 (invalid, too young, maybe they left out a digit)
16 (a valid answer)
21years (a valid integer but they added the unit “years”)
17.6 (a valid answer but they assumed fractions were OK)
-18 (a valid integer, but negative ages are invalid)
917 (a valid integer but they might have accidently typed the 9 before putting their age)
If your program is doing calculations, many of these inputs could cause problems, such as the program not working, or generating wrong answers. For instance, if you wanted to know the average age of the people responding, if your calculations allowed the 917 value, the average age calculation could yield a value that is much too high because that 917 value skewed the result.
Data scientists have several ways to handle bad data values. One is to discard the data value. Another is to make an assumption of what the right data value is and clean it. In the example above, a data scientist could do the following:
5 - discard it under the assumption we expect respondents to be 10 years of age or older.
16 - accept it as valid data.
21years - remove the “years” to yield a data value of 21.
17.6 - truncate the fractional part to yield a value of 17.
-18 - this is not a clear decision. The data scientist could assume the leading negative sign was a mistake and remove the leading sign to yield a data value of 18, or she could just discard the value as invalid.
917 - discard this as being an invalid age.
Write a Python program that accepts input from the user meant to represent an age. The program applies these data value rules:
Age data values must be positive two-digit integers from 10 to 99.
Input that contains a number followed by text must have the text removed. For example “21years” must yield 21.
Input that is a number with a fractional part must be truncated to the integer part. For example 17.6 must yield 17, and 23.9 must yield 23.
Any other input is invalid.
More than one rule may be applied. For instance, the input 17.6 years old should yield a data value of 17 after the “years old” is removed and 17.6 is truncated to 17.
After accepting the input and applying the rules, your program should print “The age data value is” and then either the valid data value, or the word “invalid”.
Here is an outline to the program
# accept the input into a variable called “original”
# set a variable “valid” to be True. If any rules A-D from above
Are violated, set “valid” to False.
# test if “original” is less than 2 characters long. If so,
set “valid” to False
# set a variable “age” to be the first two characters of “original”
# if any characters in “age” aren’t digits (Google Python’s isdigit()
function), then set “valid” to False.
# if “original” is more than 2 characters long and the third character
is a digit, then it is something like 917, so set “valid” to False
# Finally, if “valid” is still True, then print the data value,
otherwise print it as invalid.
Test your program using the sample input listed above and see that it does the right thing. Make up your own test cases as well.