Title: | Data Sets for Predictive Analytics Exam |
---|---|
Description: | Contains all data sets for Exam PA: Predictive Analytics at <https://exampa.net/>. |
Authors: | Guanglai Li [aut, cre], Sam Castillo [aut] |
Maintainer: | Guanglai Li <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.0 |
Built: | 2024-11-10 05:14:21 UTC |
Source: | https://github.com/sdcastillo/exampadata |
The data actuary_salaries contains the salaries of actuaries collected from the DW Simpson survey.
actuary_salaries
actuary_salaries
data.frame, 138 observations of 6 variables:
The industry of the actuary, having values of Casualty, Health, Penson, and Life
The number of exams passed. Values of ASA, FSA, 5,4,3,2,1
Years of experience, in the range 1 - 20
Annual salary range, in $1,000
Lower end of the annual salary range
Higher end of the annual salary range
Apartment applications as used in ExamPA.net’s Practice Exam
apartment_apps
apartment_apps
data.frame, 1430 observations of 41 variables:
The total number of people who apply for a lease at that apartment building, including all apartment units.
The sale price of each apartment unit.
The number of units in the apartment building.
Year that apartment building was sold or remodeled.
Month that apartment building was sold or remodeled.
Rates the overall material and finish of the building on a scale from 1 to 10 with 10 being the best and 1 being the worst.
Total square feet.
Above ground living area in square feet.
The number of bathroom of each unit.
Lot size in square feet.
Rates the external quality of the building on a scale from 1 to 10 with 10 being the best and 1 being the worst.
The number of full-size bathroom in each unit.
Whether or not each unit has a central air conditioning system (1 = yes, 0 = no).
1 = Attached garage.
1 = Basement garage.
1 = Build in garage.
1 = Detached garage.
1 = No garage.
1 = Dale
1 = Brookside.
1 = Clear Circle.
1 = College Circle.
1 = Crawford.
1 = Edwards.
1 = Gilbert.
1 = DOTRR.
1 = Meadow.
1 = Mitchel.
1 = North Ames
1 = North Ridge.
1 = North Ridge Heights.
1 = North West Ames.
1 = Old Town.
1 = Sawyer.
1 = Sawyer West.
1 = Somer St.
1 = Stone Bridge.
1 = SWISU.
1 = Timber.
1 = Veenker.
The mean sale price for all units in that neighborhood.
Automotive claims
auto_claim
auto_claim
data.frame, 10296 observations of 29 variables:
Policy number.
Date that policy was signed.
Number of claims.
Aggregate claim loss of policy (in thousands).
Number of child passengers.
Time to commute.
(1) Private or (2) commercial use.
(log) car value.
Whether the policy was retained or not.
Number of policies.
(0-1 dummy variables) Type of car : (base) Panel Truck, (2) Pickup,(3) Sedan, (4) Sports Car, (5) SUV, (6) Van
Whether the color of the car is (2) car or (1) not.
Whether the policyholder's license was (2) revoked in the past or (1) not.
Number of motor vehicle record points.
Whether there was a claim or not.
Age.
Number of children at home.
Year of job.
Annual income.
Gender of policyholder : (1) female or (2) male.
Whether the policyholder is (2) married or (1) not.
Whether (2) the policyholder grew up in a single-parent family or (1) not.
(0-1 dummy variables) Job class of policyholder: (base) Unknown, (2) Blue Collar, (3) Clerical, (4) Doctor, (5) Home Maker, (6) Lawyer, (7) Manager, (8) Professional, (9) Student
(0-1 dummy variables) Maximal level of education of policyholder: (base) less than High School, (2) Bachelors, (3) High School, (4) Masters, (5) PhD.
Value of home.
Whether they grew up in the same home as their current home.
(1) Rural or (2) urban area.
Year.
Credit data from UCI Machine Learning Repository.
bank_loans
bank_loans
data.frame, 41188 observations of 21 variables:
age (numeric).
type of job (categorical.
marital status (categorical).
'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
has credit in default? (categorical).
has housing loan? (categorical).
has personal loan? (categorical).
contact communication type (categorical).
last contact month of year (categorical).
last contact day of the week (categorical).
last contact duration, in seconds (numeric). Important note - this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
number of contacts performed during this campaign and for this client (numeric, includes last contact)
number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted).
number of contacts performed before this campaign and for this client (numeric).
outcome of the previous marketing campaign (categorical).
employment variation rate.
consumer price index.
consumer confidence index.
euribor 3 month rate.
number of employees.
has the client subscribed a term deposit?
bike sharing demand dataset
bike_sharing_demand
bike_sharing_demand
data.frame, 17376 observations of 10 variables:
Season. 1 - winter, 2 - spring, 3 - summer, 4 - fall.
Year. 0 - 2011, 1 - 2012
Hour.
Whether the day is a holiday.
Day of the week.
Weather situation. 1 - clear of partly cloudy, 2 - mist, 3 - rain or snow.
Normalized temperature in Celsius. The values are derived via (t - t_min)/(t_max - t_min), t_min = -9, t_max = +39.
Normalized humidity. The values are divided by 100 (max).
Normalized windspeed. The values are divided by 67 (max).
Count of rental bikes in each hour.
Boston housing data set
boston
boston
data.frame, 506 observations of 14 variables:
per capita crime rate by town.
proportion of residential land zoned for lots over 25,000 sq.ft.
proportion of non-retail business acres per town.
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nitrogen oxides concentration (parts per 10 million).
average number of rooms per dwelling.
proportion of owner-occupied units built prior to 1940.
weighted mean of distances to five Boston employment centers.
index of accessibility to radial highways.
full-value property-tax rate per $10,000.
pupil-teacher ratio by town.
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lower status of the population (percent).
median value of owner-occupied homes in $1000s.
Data used on June 18, 2020 Exam PA
customer_phone_calls
customer_phone_calls
data.frame, 10000 observations of 14 variables:
Age of the prospective customer. Integer from 17 to 98.
Occupation category. Factor with 11 levels.
Marital status. Factor with levels divorced, married, and single
Indicates whether the prospect has a housing loan. Factor with levels yes and no.
Indicates whether the prospect has a consumer loan. Factor with levels yes and no.
The type of phone the prospect uses. Factors with levels cellular and landline.
The month of the marketing call. Factor with 12 levels.
The day of the week of the marketing call. Factor with five levels.
Consumer price index at the time of the call. Numeric from 92.20 to 94.77.
Consumer confidence index at the time of the call. Numeric from -50.8 to -26.9.
Short term interest rate at the tie of the call. Numeric from 0.634 to 5.045.
Number of employees of ABC Insurance at the time of the call. Numeric from 4964 to 5228.
Indicator of purchase. Integer (1 for purchase, or 0 for no purchase.)
Years of education. Integer from 1 to 16.
Customer value data set from December 2019 PA
customer_value
customer_value
data.frame, 48842 observations of 8 variables:
Age of the prospective policyholder. Integer from 17 - 90
Indicator of the amount of education - it is not the number of years of education, but a higher number does mean more years. Integer from 1 to 16.
For married, AF means alternative form while civ means civil. Factor with seven levels.
Occupations have been grouped into five categories. There is no indication regarding what they mean. A sixth group represents cases where the occupation is unknown. Factor with six levels.
Capital gains recorded on investments. Integer from 0 to 99,999.
Number of hours worked per week. Integer from 1 to 99
A proprietary "insurance score" developed by MEB. Real number with two decimal places.
Indicator a policy holder being High or Low value. Factor with two levels.
Titanic passengers as used in ExamPA.net’s practice exam
exam_pa_titanic
exam_pa_titanic
data.frame, 906 observations of 11 variables:
Passenger id
Survived Y/N
Ticket class
Name
male, female
Age
# of siblings
# of parents or children aboard the Titanic.
Number fare
Cost of ticket.
Port of Embarkation. C = Cherbourg, Q = Queenstown, S = Southampton.
Health insurance claims as used in ExamPA.net’s Practice Exam. The data set consists of prior year’s health insurance claims, along with patient demographic information, from Freedom Health.
health_insurance
health_insurance
data.frame, 1338 observations of 7 variables:
Age of policy holder.
M or F.
Body Mass Index: weight divided by height.
Number of children.
Smoker status. Yes or No.
Geographic region.
Annual medical claims for this policy.
Auto crash data set from SOA June 2019 PA
june_pa
june_pa
data.frame, 23137 observations of 14 variables:
Measure the extent of the crash using factors such as number of injuries and fatalities, the number of vehicles involved, and other factors. A positive number with two decimal places.
Calendar year of the crash. Integer 2014 - 2019.
Calendar month of the crash. Integer 1 - 12 (1 = January, 12 = December.)
Time of day, on 4-hour blocks. Integer 1 - 6 (1 = midnight to 4am, 6 = 8pm to midnight.)
Special features of the road where the crash occurred. NONE = no special feature, INTERSECTION = the meeting of at least two roads, RAMP = exit or entrance ramp to a controlled access road, DRIVEWAY = entrance to home of business, OTHER.
Description of the road where the crash occurred. STRAIGHT-LEVEL = no curves or hills, STRAIGHT-GRADE = no curves, but on a hill (up or down), STRAIGHT-OTHER, CURVE-LEVEL = on a curve but no hill, CURVE-GRADE = on a curve and on a hill, CURVE-OTHER, OTHER.
Classification of the road type. STATE HWY = maintained by the state government, US HWY = maintained by the federal government.
Design of the road. TWO-WAY-PROTECTED-MEDIAN = traffic in both directions, separated with a barrier, TWO-WAY-UNPROTECTED-MEDIAN = separated but with no barrier, TWO-WAY-NO-MEDIAN = no separation, ONE-WAY, UNKNOWN.
Material used for the road surface. SMOOTH ASPHALT, COARSE ASPHALT, CONCRETE, GROOVED CONCRETE, OTHER.
Condition of the road. DRY, WET, ICE-SNOW-SLUSH, OTHER.
Lighting. DAYLIGHT, DARK-NOT-LIT = no street lamps in area, DARK-LIT, DUSK, DAWN, OTHER.
Weather conditions. CLEAR, RAIN, CLOUDY, SNOW, OTHER.
Any items that control traffic flow. SIGNAL = lighted stop/go signal, STOP-SIGN, YIELD, NONE, OTHER.
Whether the crash in a work area? YES/NO
Data used on June 16, 2020 Exam PA
patient_length_of_stay
patient_length_of_stay
data.frame, 10000 observations of 13 variables:
Number of days between admission into and discharge from hospital. Integer 1 - 14.
Patient gender. Male or Female.
Patient age (in 10-year age bands). [0, 10), [10, 20), ..., [90, 100)
patient race. AfricanAmerican, Asian, Caucasian, Hispanic, Other.
Patient weight (in 25-pound weight bands). [0, 25), [25, 50), [175, 200)
Identifier corresponding to the type of hospital admission. 1 = Emergency, 2 = Urgent, 3 = Elective, 4 = Not Available.
Indicates whether upon admission, metformin was prescribed or there was a change in the dosage. Up = dosage was increased, Down = dosage was decreased, Steady = dosage did not change, No = drug was not prescribed.
Indicates whether upon admission, insulin was prescribed or there was a change in the dosage. Up = dosage was increased, Down = dosage was decreased, Steady = dosage did not change, No = drug was not prescribed.
Indicates whether patient had been readmitted after an inpatient stay in the twelve months preceding the encounter. <30 = patient was readmitted in less than 30 days, >30 = patient was readmitted in more than 30 days, No = no record of readmission.
Number of procedures performed in the twelve months preceding the encounter. Integer 0 - 6.
Number of distinct medications administered in the twelve months preceding the encounter. Integer 1 - 67.
Number of the inpatient visits of the patient in the twelve months preceding the encounter. Integer 0 -21.
Number of diagnoses entered to the system in the twelve months preceding the encounter. Integer 1 - 16.
Data used on June 19, 2020 Exam PA
patient_num_labs
patient_num_labs
data.frame, 10000 observations of 14 variables:
Age of prospective customer. Integer from 17 to 98.
Occupation category. Factor with 11 levels.
Marital status. Factor with levels divorced, married, and single
Indicates whether the prospect has a housing loan. Factor with levels no, yes.
Indicates whether the prospect has a consumer loan. Factor with levels no, yes.
The type of phone the prospect uses. Factor with levels cellular, landline.
The month of the marketing call. Factor with 12 levels.
The day of the week of the marketing call. Factor with five levels.
Consumer price index at the time of the call. Numeric from 92.20 to 94.77.
Consumer confidence index at the time of the call. Numeric from -50.8 to -26.9.
Short term interest rate at the tie of the call. Numeric from 0.634 to 5.045.
Number of employees of ABC Insurance at the time of the call. Numeric from 4964 to 5228.
Indicator of purchase. Integer (1 for purchase, or 0 for no purchase.)
Years of education. Integer from 1 to 16.
pedestrian activity dataset
pedestrian_activity
pedestrian_activity
data.frame, 11373 observations of 7 variables:
The count of pedestrians during one hour starting at the indicated time.
Hourly weather condition, eleven categories.
Hourly temperature in degrees Fahrenheit.
Hourly precipitation in inches.
Time at beginning of the measuring hour.
Day of the week.
Predicted daily average temperature in degrees Fahrenheit.
SOA Hospital Readmissions Sample Exam, 2019.
readmission
readmission
data.frame, 66782 observations of 9 variables:
The target variable, it is 1 for patients that were readmitted, 0 otherwise.
M indicates male, F indicates female.
There are four categories: Black, Hispanic, Others, White.
The number of emergency room visits prior to the hospital stay associated with the readmission, an integer.
Diagnostic Related Group classification. There are three categories: MED (for medical), SURG (for surgical), UNGROUP.
Length of hospital stay in days, an integer.
The patient's age in years, an integer. (Note that while most Medicare recipients are age 65 or older there are circumstances in which whose under 65 can receive benefits.)
Hierarchical Condition Category risk score. It is designed to be an estimate of a patient's condition and prospective costs. It is a continuous variable, rounded to three decimal places. Higher numbers indicate greater risk.
Complications, with five levels: MedicalMCC.CC, MecialNoc, Other, SurgMCC.CC, SurgNoC, MCC.CC complications or comorbidities that may be major. NoC means no complications or comorbidities.
SOA Student Success PA Sample Project, 2019.
student_success
student_success
data.frame, 585 observations of 33 variables:
student's school (binary: GP (Grand Pines) or MHS (Marble Hill School)).
student's sex (binary: female or male).
student's age (numeric: from 15 to 22).
student's home address type (binary: U (Urban) or R (Rural)).
family size (binary: GT3 (>3) or LE3 (<3)).
parent's status (binary: A (Apart) or T (Together)).
mother's education (numeric from 0 - 4. 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education (high school), or 4 - higher education (college)).
father's education. (numeric from 0 - 4. 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education (high school), or 4 - higher education (college)).
mother's job (nominal, teacher, health (health care related), services (civil services, administrative or police), at_home, or other)
father's job (nominal, teacher, health (health care related), services (civil services, administrative or police), at_home, or other)
reason to choose school (nominal: home (close to home), reputation (school reputation), course (course preference), other).
student's guardian (nominal: mother, father, or other).
home to school travel time (numeric: 1 - < 15 min, 2 - 15 to 30 min, 3 - 30 min to 1 hour, or 4 - > 1 hour).
weekly study time (numeric: 1 - < 2 hour, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - > 10 hours).
number of past class failures (numeric: n if 0 <= n < 3, else 3).
extra educational support (binary: yes or no).
extra family supplement (binary: yes or no).
extra paid classes (binary: yes or no).
extra-curricular activities (binary: yes or no).
attended nursery school (binary: yes or no).
wants to take higher education (binary: yes or no).
internet access at home (binary: yes or no).
has a romantic relationship (binary: yes or no).
quality of family relationships (numeric: from 1 - very bad to 5 - very excellent).
free time after school (numeric: from 1 - very low to 5 - very high).
going out with friends (numeric: from 1 - very low to 5 - very high)
weekday alcohol consumption (numeric: from 1 - very low to 5 - very high).
weekend alcohol consumption (numeric: from 1 - very low to 5 - very high).
current health status (numeric: from 1 - very bad to 5 - very good).
number of school absences (numeric: from 0 to 75).
first trimester grade (numeric: from 0 to 20).
second trimester grade (numeric: from 0 to 20).
third trimester grade (numeric: from 0 to 20).
The travel insurance dataset.
travel_insurance
travel_insurance
data.frame, 10000 observations of 7 variables:
Distance traveled in trip, in km
Number of nights spent on trip
Main reason for the trip. Vacation includes holiday, leisure, or recration. Visit includes visiting friends or relatives.
Age of adult survey respondent in six age bins. 1: 19-24, 2: 25-34, 3: 35:44, 4: 45:54, 5: 55-64, 6: 65+
Number of other persons that accompanied the respondent on the trip
Main mode of transportation, car or plane
Total spending on trip, in Canadian $
The travel spending dataset.
travel_spending
travel_spending
data.frame, 4884 observations of 11 variables:
Calender quarter of trip
Trip province of origin
Distance traveled in trip, in km
Number of nights spent on trip
Main reason for the trip. Vacation includes holiday, leisure, or recration. Visit includes visiting friends or relatives.
Age of adult survey respondent in six age bins
Gender of adult survey respondent
Household income, in Canadian $
Number of other persons that accompanied the respondent on the trip
Main mode of transportation, car or plane
Total spending on trip, in Canadian $