• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/104

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

104 Cards in this Set

  • Front
  • Back

Data collected by the investigator for the specific study at hand

Primary data

Data collected by someone else and used by the investigator

Secondary data

Examples of sources for secondary data

County health departments


Vital statistics (birth, death certificates)


Hospital, clinic, school records, City and county governments


State government programs


Federal agency statistics (Census, CDC, EPA, etc.)

Limitations of secondary data

When was it collected? For how long?


Is the data set complete?


Are the data consistent/reliable?


Is the information exactly what you need?



Secondary data limitations: When was it collected? For how long?

May be out of data for what you want to analyze


May not have been collected long enough to detect trends


Years from different data sources may not match well

Secondary data limitations: Is the data set complete?

There may be missing information on some observations


Can result in incomplete or biased analysis

Secondary data limitations: Are the data consistent/reliable?

What measures were used and how were they collected?


Were variables discontinued over time?


Were measures collected in a consistent way over time?


Did variables change in definition over time?

Secondary data limitations: Is the information exactly what you need?

In some cases, may have to use "proxy variables"- variables that may approximate something you really wanted to measure. Is there correlation to what you actually want to measure?

Advantages of secondary data

No need to reinvent the wheel: If someone has already collected the data, take advantage of it if possible


It has great exploratory value: Exploring research questions and formulating hypotheses to test


Less expensive


Takes less time


It may be very accurate: when especially a government agency has collected the data, big investment of time and money

Examples of primary data

Surveys


Focus groups


Questionnaires


Personal interviews


Experiments


Observational study

Limitations of primary data

DO you have the time and money for designing your collection instrument, selecting your population or sample, pretesting/piloting the instrument to work out sources of bias, administering the instrument, and database creation?


Uniqueness: may not be able to compare to other populations


Researcher error: sample bias and measurement errors

How to choose between primary or secondary data

If the appropriate data exist in secondary form, then use them to the extent you can, keeping in mind limitations


But if it does not, and you are able to collect primary data, then it is the method of choice

Public, secondary data sets commonly used for health studies

County Health rankings


CDC Mortality


CDC BRFSS


NHANES


NHIS


EPA's TRI


HRSA's AHRF

The county health rankings data

Co-developed by Univ. of Wisconsin and Robert Wood Johnson Foundation


Designed primarily as a tool for county and state governments to identify health priority areas


Includes multiple data years


Each county within a state is ranked on health outcome factors and on four factors that may contribute to outcomes

County health rankings: Factors


THese four health factors contain the independent variables that may influence health outcomes, outcomes are dependent variables

Health behaviors, clinical care, social and economic factors, and physical environment

County Health Rankings Factors: Health behaviors

smoking, obesity, food, environment, physical inactivity, exercise opportunities, excessive drinking, motor vehicle death rate, STI rate, teen birth rate

County Health Rankings Factors: Clinical care

uninsured rate, PCP rate, dentists, preventable hospital stays, diabetic screening, mammography screening

County Health Rankings Factors: Social and Economic

High school graduation rate, some college, unemployment, children in poverty, inadequate social support, children in single parent households, violent crime rate



County Health Rankings Factors: Physical environment

Particulate matter, drinking water safety, access to recreational facilities, limited access to healthy foods, fast food restaurants

County Health rankings: Sources and Data

Sources for each variable is provided (US Census, CDC Surveys, area resource file, etc.)


Full datasets for each state or the nation can be downloaded as Excel files

Limits of county Health rankings

County level, not individuals


Some missing data


Years vary


"Construct validity" of some measures is questionable

CDC Mortality Data

Crude and age-adjusted mortality rates


Provided for every county and every year 1968-2013


Rates can be provided based on diagnosis, age, sex, race, and rural-urban settings

Limits of CDC Mortality Data

Diagnostic systems change


County data, not individual


Suppressed values for small numbers

Strengths of CDC Mortality Data

But long period of time, validated diagnostics


Population rates


Available by subgroups

CDC BRFSS

Behavioral Risk Factor Surveillance System


Annual telephone surveys conducted by CDC on health behaviors, health status, demographics 1984-2014


More than 500,000 interviews per year

CDC BRFSS Limits and strengths

Good for state or national summaries or trends


Addition of cell phones in 2011 makes trends problematic pre to post 2011 (threat to internal validity: selection)


Individual data, not county


Dependent on telephone surveys, about 50% nonreponsive

CDC: NHANES

National Health and Nutrition Examination Survey


Annual surveys since 1975


More recent two-year aggregates starting 1000-2000, up to 2013-2014


Each two-year sample has ~12,000


Includes extensive data collection including questionnaire, blood and urine samples, physical examination


Data organized into different files with a common identifier

NHANES Limits

Some measures change from year to year


Some measures on smaller sub-samples


Same persons are not measured over time (cross-sectional)


No public geographic data

NHANES Strengths

Data available at individual level


Hundreds of publications using NHANES data


Lab and exam data are unique

CDC NHIS

National Health Interview Survey


In-person household interviews collected by US Census for CDC since 1957


Tracks national health status, health care access, and progress toward achieving national health objectives (Healthy people 2020)

FOur core survey sections of NHIS

Household


Family


Sample Adult


Sample child


-Includes questions on health conditions, health status, activity limitations, health behaviors, health care use, mental health services, communication disorders

NHIS overlap with NHANES and BRFSS

NHIS is not as in-depth as NHANES< different sampling strategy (more representative than BRFSS); bigger sample than NHANES

NHIS Limits

Some measures change from year to year


Some persons are not measured over time (cross-sectional)


No public geographic data

NHIS Strengths

But available at individual level


Hundreds of publications using NHIS data


"The principal source of information on the health of the civilian noninstitutionalized population of the United States"

EPA: TRI

Toxics Release Inventory


Facilities that manufacture, use, store or dispose of identified toxic chemicals must report quantity of releases since 1988


Research shows that TRI releases are associated with outcomes as higher cancer rates, CVD, birth defects


Purpose of public reporting is to encourage release reduction and development of safer chemicals

TRI limits

Only facilities with at least 10 full-time employees that manufacture, process or use a listed chemical at a certain amount have to report

Facilities may underreport



TRI Strengths

Individual facility and chemical data each year


Unique data


Encourages lower use of toxics

Area Health Resources File (AHRF)

Produced by HRSA


County level database produced each year


Combination of demographic, some limited health data


Rich source for health care measures (numbers of doctors by specialty, hospitals)

AHRD Limits

County aggregate data, not individual


SOme measures not available or updated annually

AHRD Strengths

Unique data for health care professions supply over time

INdicators (Indiana Indicators)

A resource for state, county, regional indicators


Gathered from other sources


Not research-friendly as raw data are not available

Combining secondary data

Research opportunities in combining data across 2 or more of these secondary sources (TRI release and CDC mortality rates; county ranking factors and CDC mortality rates)

Other secondary data sources

SEER


National Inpatient Sample


Dept. of Energy data


EPA power plant data


Medicare claims


NCHS birth records

All members of a defined group

Population

A portion or a subset of a population

Sample

What makes a good sample?

one that is representative of the population


-Just like the population, only smaller


-Exception for qualitative research

Ways to define a target population for selecting a sample

Total size


Inclusion and exclusion criteria


Sample drawn from the target population using probability or non-probability sampling

Methods of sampling: Probability Sampling

Every member of the population has a known, non-zero probability of being selected


Also called 'representative sampling'

Methods of sampling: Non-probability sampling

Some members have a change to be selected, others do not


Based on convenience or judgment about the target population to select from, or about the features in the sample you want to be sure to represent


Also called 'non-representative sampling'

Probability sampling: Simple Random Sampling

-Selected by using chance or random numbers


-Each individual in the population has equal chance of being selected


-Therefore, all members of the population must be known and have an equal chance of being selected


Ex. Drawing names from a hat, random numbers

Probability sampling: Systematic Sampling

-Select a random starting point and then select every k'th subject in the population


-The population must be known


-All members must have the possibility of being selected

Probability sampling: Stratified Sampling

-Divide the population into at least two different groups with common characteristics (e.g. men and women), then randomly draw some subjects (%) from each group (called strata or stratum)


-To be sure that important subgroups are represented


-oversampling is also a deliberate approach to make sure that small groups are represented


-all of population must be known

Probability sampling: Cluster sampling

-Groups are selected, not individuals


-Divide the population into naturally occurring groups (called clusters), randomly select some of the groups, and then collect data from all or random sample of members of the selected groups


-Groups or clusters might be schools, precincts, counties, etc.

Probability sampling: Multistage sampling

-More than one sampling step


-Usually starts with cluster sampling, then do simple random, systematic, or stratified sampling within each cluster


-Or start with stratifying clusters

Non-probability sampling: Convenience sampling

-Use subjects that are easily accessible


-Probably not representative of the population, but may be useful or early exploratory work or if other methods are not available


-Ex. using family members or students in a classroom, mall shoppers

Non-probability sampling: Respondent-assisted sampling

-Persons sampled first identify persons to be sampled next


-Ex. snowball sampling


-Can be useful when target population is known or unabailable

Non-probability sampling: Quota sampling

-Divide population into important groups (similarly to stratified sampling) (E.g., age groups for males and females)


-Then sample from each group until target samples sizes (quota) are reached


-This is non-probability sampling because some members have no chance of selection after quotas are reached

Non-probability sampling: Purposive sampling

-A form of qualitative inquiry


-Used in focus groups or other qualitative methods


-Deliberately want to represent certain opinions or perspectives

Types of probability sampling

Simple random


Systematic


Stratified


Cluster


Multi-stage

Advanages of probability sampling

More rigorous, represents population better

Types of non-probability sampling

Convenience


Respondent-assisted (snowball)


Quota


Purposive


Multi-stage

Advantages of non-probability sampling

More practical, can be used to target certain subgroups

Type of sampling where respondents recriut their peers, offering a financial reward for each recruitment, and those recruited can also get a reward for recruiting others

Respondent driven sampling (RDS)


(A type of respondent-assisted sampling)

What two factors determine how large a sample should be?

Sample size


Response rates

Sample size

Generally, more is better for quantitative research


-A bigger sample is more likely to represent the population than a small sample


-A bigger sample makes it easier to detect statistically significant effects


-But a really big sample might lead to significant p values by chance

Sample size can be estimated from what 3 considerations?

1. What type I error you select


2. What type II error you select


3. Effect size

Sample size considerations: Type I error

p<0.05 is typical


The probability of concluding that an effect exists in the population when it really does not


-Also called alpha

Sample size considerations: Type II error

p< 0.20 is typical


The probability of concluding that no effect exists in the population when it really does


-This is also called betta

Power

1-Beta=Power


Power is the ability to detect an effect when one is there or not


0.80 or more is typical

Sample size considerations: Effect size

How big the effect really is in the population


-Usually the hardest part to estimate, we often dont know what the real effect is in the population, thats why we're doing the study!


-Bigger effects can be detected with smaller samples, small effects require big samples

Formula for effect size and how to estimate it

Effect size can be estimated based on the type of statistical test (E.g. a t-test to compare two means)


Effect size=(pup mean1-pop mean 2)/SD


This requires knowing the pop mean and SD


Conventional effect sizes: small=0.2,medium=0.5,large=0.8

Four interrelated factors for sampling

Sample size, alpha, beta, effect size


-Once we know any 3, the 4th is determined

Why are big samples generally better?

small samples may be unable to detect effects

Sample response rates

Generally, higher is better (will represent the population with less bias)


-No absolute rule (response rates of 70-80% are often considered strong but are hard to achieve, reponse rates >50% sometimes acceptable means there is more known than unknown

Ways to deal with survey non-response

Make multiple requests, compare early wave vs late wave responders


Weighting responses: underrepresented groups have higher weight

Ways to deal with item non-response

Can delete surveys case-wide (deleting the whole observation) if an item has missing data


Imputation

Estimating a missing item based on other non-missing data

Imputation

A missing observation is assigned the mean value from the non-missing cases


Crude, can be ok if extent of missing data is small

Mean imputation

A more accurate but more technically difficult type of imputation

Estimating the missing value based on known items from the same person

Definition: The science of learning from data, and of measuring, controlling and communicating uncertainty

Statistics

Summarizing data for easier interpretation


Also called "summary statistics"


Univariate (they describe or summarize one variable at a time)

Descriptive statistics

Making predictions or inferences about a population from sample data


Statistics to infer about a population based on results from a sample

Inferential statistics

3 common descriptive statistics

Measures of central tendency


Measures of variation


Frequency distributions

The average value of a set of scores in the data

Central tendency

3 common measures of central tendency

Mean, median (middle value), mode (most common value)

When are mean, median, and mode useful and when are they not approptiate?

All are Useful when the variable is interval or ratio


Means and medians are not appropriate for nominal or ordinal measures (mode may be okay)

Measures of the spread or variability in the data values

Measures of variation

Common measures of variation (5)

Range (highest minus lowest)


Maximum, minimum


Standard deviation (average distance between each data point and the mean)


Variance


Standard error

Frequency distribution: how are they organized? WHat are they useful for? What are they used for?

Organized from low to high

Useful for summarizing counts


Are used mainly nominal or ordinal variables, but can be used for interval or ratio variables with limited response options



Familiar statistical significance tests for inferential statistics (7)

Chi-square


t-test


ANOVA (F-test)


r


R^2


odds ratios


confidence intervals

Statistical significance test used for categorical (nominal or ordinal) data


-Not for continuous


-Cell values are counts

Chi-Square

Statistical significance test used to compare 2( and only 2!) means


-Dependent variable is continuous


-cell values are means


-Independent variable is categorical (unpaired: 2 different groups; paired: 2 scores from the same group)

T-test

Statistical significance test used to compare 2 or more means (usually at least 3)


-Dependent variable is continuous (interval or ratio)


-Independent variable is categorical (usually nominal): 2 or more different groups; 2 or more measures from the same group over time (repeated measures)

ANOVA (F-test)

ANOVA vs. ANCOVA

Analysis of variance (ANOVA)


Analysis of covariance (ANCOVA)-has more than 1 inependent variable


Both are often used with experiments

In ANOVA or ANCOVA tests, an effect of one independent variable on a dependent variable


-One line on a graph or parallel lines

Main effects

In ANOVA or ANCOVA tests, the effect of one independent variable is influenced by another independent variable


-Two nonparallel lines on a graph

Interactions (moderators)

Statistical significance test stands for a correlation between two variables; both variables are continuous (Pearson) or one is continuous and one is categorical (Spearman)


-can vary from -1 to +1


-Measures linear relationship between two variables


-Correlation is not causation

r

Statisical significance test: association between 1 ontinuous dependent measure and 2 or more independent measures


-independent measures can be continuous or categorical


-Can vary from 0 to 1


-Estimates the amount of variance in the dependent variable that is 'explained' by the set of independent variables


-Very common technique in applied, non-experimental studies

R squared (Multiple linear regression)

What is it called when a R squared test has a categorical dependent variable?

Multiple logistic regression

parametric vs. non-parametric tests

Most common are parametric (assume some form of population distribution...usually normal)


-With ordinal data, small samples (N<30) or non-normal distributions, non-parametric tests may be better (often based on comparing ranks (original or converted original data

Common graphs for statistical results

Bar charts


Donut and pie charts


Line charts (not appropriate for categoris, but for growth or change over time, or for group differences)

Features of graphs

Can be drawn inappropriately leading to misleading conclusions


-Watch the "scales"


-Omission of labels or units on the axes


-Data values provided or not


-Self-contained legent

Features of tables

In published papers, Table 1 usually presents descriptive statistics, Tables 2+ usually inferential (can be many exceptions)


-Like figures, tables should be clearly described and self-contained. Test and tables should not duplicate too much


-Essential information, enough but not too much

Things to know to choose the right statistical test for quantitative studies (4 things)

Independent and dependent variables


How the variables will be measured


Sample size


Design