PLANIFICACION DE LA ACTIVIDAD PREVENTIVA
CUESTIONARIO CURSO FORMACIÓN PRL
free sulfur
dioxide
count
1599.000000
1599.000000
1599.000000
1599.000000
1599.000000
1599.000000
1599.000000
mean
8.319637
0.527821
0.270976
2.538806
0.087467
15.874922
46.467792
std
1.741096
0.179060
0.194801
1.409928
0.047065
10.460157
32.895324
min
4.600000
0.120000
0.000000
0.900000
0.012000
1.000000
6.000000
25%
7.100000
0.390000
0.090000
1.900000
0.070000
7.000000
22.000000
50%
7.900000
0.520000
0.260000
2.200000
0.079000
14.000000
38.000000
75%
9.200000
0.640000
0.420000
2.600000
0.090000
21.000000
62.000000
max
15.900000
1.580000
1.000000
15.500000
0.611000
72.000000
289.000000
fixed
acidity
volatile
acidity
citric acid
residual
sugar
chlorides
free sulfur
dioxide
count
4898.000000
4898.000000
4898.000000
4898.000000
4898.000000
4898.000000
4898.000000
mean
6.854788
0.278241
0.334192
6.391415
0.045772
35.308085
138.360657
std
0.843868
0.100795
0.121020
5.072058
0.021848
17.007137
42.498065
min
3.800000
0.080000
0.000000
0.600000
0.009000
2.000000
9.000000
25%
6.300000
0.210000
0.270000
1.700000
0.036000
23.000000
108.000000
50%
6.800000
0.260000
0.320000
5.200000
0.043000
34.000000
134.000000
75%
7.300000
0.320000
0.390000
9.900000
0.050000
46.000000
167.000000
max
14.200000
1.100000
1.660000
65.800000
0.346000
289.000000
440.000000
Sometimes it is easier to understand the data visually. A histogram of the white wine quality data citric acid samples is shown below. You can of course visualize other columns’ data or other datasets. Just replace %time df_csv.to_hdf(target, '/data')
df_hdf = dd.read_hdf(target, '/data') df_hdf.head()
import pandas as pd import numpy as np
# red wine quality data, packed in a DataFrame
red_df = pd.read_csv('winequality-red.csv',sep=';',header=0, index_col=False) # white wine quality data, packed in a DataFrame
white_df = pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False) # rose? other fruit wines? plum wine? :(
# for red wines red_df.describe()
# for white wines white_df.describe()
the DataFrame and column name below.
png
15.14.3 Detecting Features
Let us try out a some elementary machine learning models. These models are not always for prediction. They are also useful to find what features are most predictive of a variable of interest. Depending on the classifier you use, you may need to transform the data pertaining to that variable.
15.14.3.1 Data Preparation
Let us assume we want to study what features are most correlated with pH. pH of course is real-valued, and continuous. The classifiers we want to use usually need labeled or integer data. Hence, we will transform the pH data, assigning wines with pH higher than average as hi (more basic or alkaline) and wines with pH lower than average as lo (more acidic).
Now we specify which dataset and variable you want to predict by assigning vlues to SELECTED_DF and TARGET_VAR, respectively. We like to keep a parameter file where we specify data sources and such. This lets me create generic analytics code that is easy to reuse.
After we have specified what dataset we want to study, we split the training and test datasets. We then scale (normalize) the data, which makes most classifiers run better.
Now we pick a classifier. As you can see, there are many to try out, and even more in scikit-learn’s documentation and many examples and tutorials. Random Forests are data science workhorses. They are the go- to method for most data scientists. Be careful relying on them though–they tend to overfit. We try to avoid overfitting by separating the training and test datasets.
15.14.4 Random Forest
Now we will test it out with the default parameters.
Note that this code is boilerplate. You can use it interchangeably for most scikit-learn models. import matplotlib.pyplot as plt
def extract_col(df,col_name): return list(df[col_name])
col = extract_col(white_df,'citric acid') # can replace with another dataframe or column plt.hist(col)
#TODO: add axes and such to set a good example plt.show()
# refresh to make Jupyter happy
red_df = pd.read_csv('winequality-red.csv',sep=';',header=0, index_col=False) white_df = pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False) #TODO: data cleansing functions here, e.g. replacement of NaN
# if the variable you want to predict is continuous, you can map ranges of values # to integer/binary/string labels
# for example, map the pH data to 'hi' and 'lo' if a pH value is more than or # less than the mean pH, respectively
M = np.mean(list(red_df['pH'])) # expect inelegant code in these mappings Lf =lambda p: int(p < M)*'lo'+ int(p >= M)*'hi' # some C-style hackery # create the new classifiable variable
red_df['pH-hi-lo'] = map(Lf,list(red_df['pH'])) # and remove the predecessor
del red_df['pH']
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn import metrics
# make selections here without digging in code SELECTED_DF = red_df # selected dataset TARGET_VAR ='pH-hi-lo' # the predicted variable # generate nameless data structures df = SELECTED_DF
target = np.array(df[TARGET_VAR]).ravel()
del df[TARGET_VAR] # no cheating #TODO: data cleansing function calls here # split datasets for training and testing
X_train, X_test, y_train, y_test = train_test_split(df,target,test_size=0.2) # set up the scaler
scaler = StandardScaler() scaler.fit(X_train)
# apply the scaler
X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)
# pick a classifier
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor,ExtraTreeClassifier,ExtraTreeRegressor from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
clf = RandomForestClassifier()
# test it out
model = clf.fit(X_train,y_train) pred = clf.predict(X_test)
☁
Now output the results. For Random Forests, we get a feature ranking. Relative importances usually exponentially decay. The first few highly-ranked features are usually the most important.Feature ranking:
fixed acidity 0.269778 citric acid 0.171337 density 0.089660 volatile acidity 0.088965 chlorides 0.082945 alcohol 0.080437 total sulfur dioxide 0.067832 sulphates 0.047786 free sulfur dioxide 0.042727 residual sugar 0.037459 quality 0.021075
Sometimes it’s easier to visualize. We’ll use a bar chart.
png
15.14.5 Acknowledgement
This notebook was developed by Juliette Zerick and Gregor von Laszewski
15.15 F
INGERPRINTM
ATCHING⭕
Python is a flexible and popular language for running data analysis pipelines. In this section we will implement a solution for a fingerprint matching.
15.15.1 Overview
Fingerprint recognition refers to the automated method for verifying a match between two fingerprints and that is used to identify individuals and verify their identity. Fingerprints (Figure 1) are the most widely used form of biometric used to identify individuals.
Fingerprints
The automated fingerprint matching generally required the detection of different fingerprint features (aggregate characteristics of ridges, and minutia points) and then the use of fingerprint matching algorithm, which can do both one-to- one and one-to- many matching operations. Based on the number of matches a proximity score (distance or similarity) can be calculated.
We use the following NIST dataset for the study:
Special Database 14 - NIST Mated Fingerprint Card Pairs 2. (http://www.nist.gov/itl/iad/ig/special\_dbases.cfm) conf_matrix = metrics.confusion_matrix(y_test,pred)
var_score = clf.score(X_test,y_test)
# the results
importances = clf.feature_importances_ indices = np.argsort(importances)[::-1]
# for the sake of clarity num_features = X_train.shape[1]
features = map(lambda x: df.columns[x],indices) feature_importances = map(lambda x: importances[x],indices) print 'Feature ranking:\n'
for i in range(num_features):