Monday, June 1, 2015

Introduction to Data Classification

Data Classification

Part of Mike's Big Data, Data Mining, and Analytics Tutorial  

Appropriate data classification helps to answer the questions:
  • What type of data is included in our data set?
  • What operations (equality, order, divisibility) can be applied to the data?
  • What statistical procedures are appropriate for our data set?

What did we measure?

The first step in the data classification discussion is determining what was measured. We start out by looking at the actual characteristic being measured and determine if it can take on only a definable number of values (gender, political preference, number of hairs per 1x1 cm area of skin), or if we have a situation where we have infinitely many (and uncountable) numbers of possible values (lengths, masses, times, etc).
In the former case where we have easily definable sets of possible values, then there is a good chance that we have an underlying discrete characteristic. We can generally determine with exact certainty whether a specimen is male or female, the self-reported political preference, and we can count the number of hairs per 1x1 cm area of skin.
We can’t do the same with continuous characteristics (length: Is it 1 cm, or 1.1, or 1.11, or 1.11111…?). We can only measure these characteristics to a level of precision. These are good candidates for being considered “continuous” characteristics.
We then need to determine at what level we have actually measured the characteristic. With gender, we can generally measure at the same level as the underlying characteristic (male/female). With ordinal measurements, we can determine that one level is higher than another, but we lack information to determine “by how much.” With continuous measurements, we can determine both order and magnitude. It is possible to measure ordinal/continuous characteristics at a lower level (ex. lengths are determined to be “ok” or “not ok” -> nominal measurement of a continuous variable).

Discrete Characteristics and Measurements

There are two possible categories of discrete characteristics. The first characteristic is called “nominal.” The second categorization is called “ordinal.” The key differentiator is whether the measurements at a particular level can be sorted/ordered in a meaningful way.
As an example, we might determine gender for a sample of people. We can’t really say that Male > Female or Female > Male, we simply have different genders. Another example could be religion. Someone could self-identify as “Muslim”" and someone else could identify as “Christian.” Again, we can determine that the two people are different, but are unlikely to be able to determine any sort of order based on the characteristic of self-reported religion. Measurements/characteristics that can take on only 2 values (ex. gender) are called dichotomous. Measurments/characteristics that can take on 3 or more possible values (ex. religion, political preference, manufactured quality to specification) are called polychotomous or polytomous.
The next category of discrete characteristics are those where it is possible to define order in addition to equality. This typically applies to “scale,” “count,” and “low-resolution measurement” data. Examples of scales includecertain survey scales, income scales (ex. 10K-20K, 20K-30K, 30K-40K), and reflectivity (measured as high refliectivity -> no reflectivity). Count data typically ranges from 0 (absence of an element) to a large, but not readily bounded number. Examples of count data include paint scratches per square foot, number of dents per square meter, and number of bee hives per square mile. Data measured at the ordinal level can be determined to be equal or unequal and can be ordered (ex. higher counts are higher than lower counts, income range of 20K-30K is higher than the 10K-20K range), but the data might be in unequal increments.
Count data is special in that both ordinal statistical procedures and procedures typically applied to continuous data may be used. Sometimes you might encounter count data referred to as “absolute” data.
Data measured at the nominal and ordinal levels is often referred to as “qualitative” data.

Continuous Characteristics and Measurements

Continuous data allows us to determine equality, order, and magnitude. Magnitude allows us to concretely say how much bigger one value is from the next, and we have equal intervals between possible values (and often, infinitely many possible values). As an example, if we measure an object to be 3 meters and another object to be 1 meter, we have meaning when we say that the first object is three times larger than the second object.
Data measured at the continuous level is often referred to as “quantitative” data.

Takeaways

Appropriate data classification is critically important because without it none (or very little) of the analysis will be correct. It is important to note that it is generally not possible to measure an attribute at a higher scale than the scale represented by the underlying attribute (ex. It would be meaningless to calculate gender on a 0-100 point scale if it is indeed 2 valued).

Virtually all of the other posts that I write involving statistical procedures will include at least a brief discussion of data classification, primarily due to its importance in performing the correct analysis and obtaining the correct result.

Return to Mike's Big Data, Data Mining, and Analytics Tutorial  

No comments:

Post a Comment