Saturday, March 21, 2020

SPSS Modular


 Cross-Industry Standard Process for Data Mining (CRISP-DM) : methodology that guides you through the critical issues, to  makes sure that the important points are addressed


Stages in CRISP-DM 

1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Deployment


SPSS Modular 

SPSS project has a name "Stream" and the stream extension is .str

Other extensions:
.sav :  SPSS Statistics Data Save format, and used also when we cache node data
.sps is the extension for IBM SPSS Statistics syntax files
.nod for individual nodes
.slb for SuperNodes

SPSS Modeler requires a rectangular data structure when analyses are performed.
SPSS Modeler has no limit on the number of records, nor on the number of fields.




Notes:  Nodes from the Graphs, Modeling, Output, and Export palettes are terminal nodes and cannot be connected to each other.

Node Right Click Options



- you can drug nodes from palettes and add name to it using right click -->"Rename and annotation"
- to add a comment over a node, right click -> "New Comment"
- group some nodes in one "Super Node" --> select nodes --> right click --> "Super Node"
- "Disable Node" to ignore node effect and connect previous node with next node


Pallets\ Sources

All data source nodes have three main tabs in common: Data, Filter, and Types. 
- Data tab: used to specify the file name and to set options for the import. 

- Filter tab: used to Exclude fields from being imported or rename fields

- Types tab: used to instantiate the data and override the measurement levels assigned by default,
To instantiate the data, click the Read Values button. 
The measurement level categorical will be instantiated to flag if the field has two values, 
and to nominal if the field has more than two values. 
If a categorical field has too many unique values, more than 250 by default, its measurement level will be set to typeless
The reason is that a nominal field with more than 250 categories would put too high of a load on the modeling techniques. 
(The default threshold of 250 can be changed via Tools\Stream Properties\Options, Options tab, General item, in the Maximum members for nominal fields area).
Var. File, Fixed File and XML dialog boxes, have one more tab  (File tab) to specified file name on it and Data tab lets you override the fields' storages 

It can connect to any DB has ODBC  driver
 Import fixed formatted text files




 Import free formatted text files




 Import Data from Excel files, the Excel file should be closed or you will get SPSS error





SPSS Statistics Data Save format (.sav), and used also when we cache node data.

You can choose between reading variable names or variable labels, and values or value labels. It is advised to read the labels


 Cognos TM1 node and the IBM Cognos BI node: These two nodes enable you to integrate with Cognos.



 Generate data based on specific statistical distributions










Pallets\ Record Ops


group records using key fields, Only one record of the group will be retained.
Distinct node has more capabilities than only removing duplicates.
you can select Max, First, Most frequent, concat values  
group records using key fields,  Passing data through the Aggregate node, will change the record definition. We can group fields using Sum, Mean, Min, Max, SDev, Median, Count, Variance, 1st and 3rd Quartile
Appending Records from multiple datasets = Union
Options: Main dataset,  All datasets
When datasets have diff columns, we can include only the columns that match the main dataset, or add all columns from all datasets

Appending Columns from multiple datasets = Join (One to One, or One to Many)
Options: Inner join /Full outer join /Partial outer join /Anti-join
When datasets do not have the same records, use joins
anti join = (not in) means retain only the records that are unique to a specific dataset.

when developing a stream it is useful to have a Sample node to limit execution time, but in production mode you do not want to sample records



Pallets\ Fields Ops






Used to create a new field using SPSS formula


Update/Correct categorical column values, the new corrected value can be save to the same column or saved to new column.




Used to define the valid range for continuous data, and define the accepted values in categorical fields, used to define the null value,  Undefined ($null$) values are invalid, unless they are declared as blank. Blank can be range of values or a discrete values.
"Aggregate Node" does not respect "Type Node" settings!! and include all values in calculation.


Convert nominal field  rows  into a series of flag fields (columns)
User must click on  "Read Values" to instantiate the data during node setup




Pallets\ Graphics










Pallets\ Modeling





Pallets\ Output



View category and frequency for categorical data, view Histogram, Max, Min, mean for continuous fields 









Pallets\ Export



Notes: 
-Exporting to fixed width flat file is not supported. 
-When  export data to an IBM SPSS Statistics file, field names should not contains spaces
- To export data to a Database, Excel, or SPSS Statistics file, fields have to be instantiated. This means, we have a Type node before export node




SPSS default types : MODELER assigns a measurement level based on a field’s storage: 
- String fields are typed as categorical, without typing it as specifically flag, nominal, or ordinal 
- Numeric fields: are typed as continuous


Instantiate fields
- A field that has unknown storage is uninstantiated. 
- A field is partially instantiated when its storage is known, and its measurement level is initialized based on the field's storageو Partially instantiated fields have measurement level categorical or continuous. 
- When storage, measurement level and values are known, the field is fully instantiated. 

Notes:
 - Fields' measurement levels should be correct. If not, the results of any analysis will most likely be worthless.
- For example if success has values (0, 1), SPSS will assign it as a continuous, while it should be Categorical \ Flag 





Control Language for Expression Manipulation (CLEM)

- Native language to build expressions
- Field names are case sensitive in CLEM
- When a field name contains a blank or if it is a special field name it needs to be enclosed in single quotes for example: 'INCOME 2012', '$RC-churn'.
- String values should be within single or double quotes
- Fraction Number must be specified with a leading zero ==> 0.5 not .5
- Operators can be:
         • arithmetic:+, -, *, /, ** (raise to the power)
         • relational: >, <, <=, >=, =, /= (unequal)
         • logical: and, or, not (all in lower case)


CLEM expression

(gender= "male") or (job_status="full time" and income > 10000)

datetime_year (connect_date) = 2000 and handset = "ASAD"

means:  returns true for a customer if he connected in the year 2000 and has handset ASAD



Derive node you can derive a field of one of the following types: 
• Formula: An outcome of a formula. For example: a new field TAX derived as:
    TAX = INCOME * 0.20.

• Flag: A T/F field. For example: a new field ADULT,
    T when AGE >= 21, else F

• Nominal: A categorical field. For example: a new field AGECAT,
    1 when AGE <= 35, 2 when AGE > 35 and AGE <= 70, and 3 when AGE >70

• Conditional: An outcome of a formula, but computed conditionally.
   For example: a new field
     TAX = 0.1 * INCOME if INCOME <= 100000, and TAX= 10000 + 0.2 * (INCOME – 100000) if INCOME > 100000


Notes:
SPSS handle user-defined blank values as valid values when a field is derived.
The undefined value ($null$) is referred to as undef. When you specify undef as value, SPSS will return $null$.