Table of Contents
Data set is defined as collection of data. It can be taken from an array as numbers or from a source file.
Example: A file named ‘data’ contains an array which contains numbers in it.
[10, 12, 18, 22, 35]
As we will deal with Data in upcoming processes, so it is necessary to know what data we will have.
Data Types are categorized in three parts –
It is further divided in two parts –
- Discrete Data – Discrete Data is a data that takes only some specific values (number) and is limited to some certain values. E.g., counting number of students who took test.
- Continuous Data – Continuous Data is a data that is not limited to certain values and can take any numerical value. E.g., Maintaining record of student’s age, height and weight.
Categorical: Categorical Data is a type of data that can be further classified into different groups or categories and labelled on the based values. E.g., A record of student’s data classified as height, weight, age, and sex.
Ordinal: Ordinal data is a type of data where there is comparison exist between data. E.g., Product specification in terms of dimension and weight. Income of two different people.
After knowing data types, it will help us how to analyse data while dealing with data in Machine Learning.
Introduction – MMM
In Machine Learning and in Mathematics it is defined as –
Mean – the average value
Median – the mid value
Mode – the most common value
Example: A record that contains 10 student’s exam marks.
Marks = [45, 35, 85, 19, 59, 61, 87, 98, 78, 45]
A mean is a value that represents the average.
To determine the mean, you must find the total of all values and then divide it by the number of values.
Note: The NumPy module offers a way to accomplish this. Find out more details about the NumPy module through the NumPy Tutorial.
Example: Use the NumPy mean() method to find the average marks.
import numpy as np Marks = [45, 35, 78, 19, 59, 61, 78, 98, 78, 45] x = np.mean(Marks) print(x)
Output - 59.6
As shown above, it returned Mean from given data.
A median is what’s that is in the middle, after sorting all values in ascending order.
Note: Make sure that numbers must be sorted to identify the median.
Example: Use the NumPy median() method to find the mid value.
import numpy as np Marks = [45, 35, 78, 19, 59, 61, 78, 98, 78, 45] x = np.median(Marks) print(x)
Output - 60.0
As shown above, it returned Median from given data.
Note: If there are two numbers in middle position, then add both numbers and divide the sum by 2.
E.g., [45, 35, 78, 19, 59, 61, 78, 98, 78, 45]
(59+61)/2 = 120/2 = 60
Mode is the data that represents the most common value (number) in data.
Note: The SciPy module offers a way to accomplish this. Learn more about the SciPy module on the SciPy tutorial.
Example: Use the NumPy mode() method to find the most common value.
from scipy import stats Marks = [45, 35, 78, 19, 59, 61, 78, 98, 78, 45] x = stats.mode(Marks) print(x)
As a result, it returned mode, i.e., 78, and retuned how many times that number (mode) appeared in data.
If you find anything incorrect in the above-discussed topic and have any further questions, please comment below.