Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
104 Cards in this Set
- Front
- Back
Data collected by the investigator for the specific study at hand |
Primary data |
|
Data collected by someone else and used by the investigator |
Secondary data |
|
Examples of sources for secondary data |
County health departments Vital statistics (birth, death certificates) Hospital, clinic, school records, City and county governments State government programs Federal agency statistics (Census, CDC, EPA, etc.) |
|
Limitations of secondary data |
When was it collected? For how long? Is the data set complete? Are the data consistent/reliable? Is the information exactly what you need? |
|
Secondary data limitations: When was it collected? For how long? |
May be out of data for what you want to analyze May not have been collected long enough to detect trends Years from different data sources may not match well |
|
Secondary data limitations: Is the data set complete? |
There may be missing information on some observations Can result in incomplete or biased analysis |
|
Secondary data limitations: Are the data consistent/reliable? |
What measures were used and how were they collected? Were variables discontinued over time? Were measures collected in a consistent way over time? Did variables change in definition over time? |
|
Secondary data limitations: Is the information exactly what you need? |
In some cases, may have to use "proxy variables"- variables that may approximate something you really wanted to measure. Is there correlation to what you actually want to measure? |
|
Advantages of secondary data |
No need to reinvent the wheel: If someone has already collected the data, take advantage of it if possible It has great exploratory value: Exploring research questions and formulating hypotheses to test Less expensive Takes less time It may be very accurate: when especially a government agency has collected the data, big investment of time and money |
|
Examples of primary data |
Surveys Focus groups Questionnaires Personal interviews Experiments Observational study |
|
Limitations of primary data |
DO you have the time and money for designing your collection instrument, selecting your population or sample, pretesting/piloting the instrument to work out sources of bias, administering the instrument, and database creation? Uniqueness: may not be able to compare to other populations Researcher error: sample bias and measurement errors |
|
How to choose between primary or secondary data |
If the appropriate data exist in secondary form, then use them to the extent you can, keeping in mind limitations But if it does not, and you are able to collect primary data, then it is the method of choice |
|
Public, secondary data sets commonly used for health studies |
County Health rankings CDC Mortality CDC BRFSS NHANES NHIS EPA's TRI HRSA's AHRF |
|
The county health rankings data |
Co-developed by Univ. of Wisconsin and Robert Wood Johnson Foundation Designed primarily as a tool for county and state governments to identify health priority areas Includes multiple data years Each county within a state is ranked on health outcome factors and on four factors that may contribute to outcomes |
|
County health rankings: Factors THese four health factors contain the independent variables that may influence health outcomes, outcomes are dependent variables |
Health behaviors, clinical care, social and economic factors, and physical environment |
|
County Health Rankings Factors: Health behaviors |
smoking, obesity, food, environment, physical inactivity, exercise opportunities, excessive drinking, motor vehicle death rate, STI rate, teen birth rate |
|
County Health Rankings Factors: Clinical care |
uninsured rate, PCP rate, dentists, preventable hospital stays, diabetic screening, mammography screening |
|
County Health Rankings Factors: Social and Economic |
High school graduation rate, some college, unemployment, children in poverty, inadequate social support, children in single parent households, violent crime rate |
|
County Health Rankings Factors: Physical environment |
Particulate matter, drinking water safety, access to recreational facilities, limited access to healthy foods, fast food restaurants |
|
County Health rankings: Sources and Data |
Sources for each variable is provided (US Census, CDC Surveys, area resource file, etc.) Full datasets for each state or the nation can be downloaded as Excel files |
|
Limits of county Health rankings |
County level, not individuals Some missing data Years vary "Construct validity" of some measures is questionable |
|
CDC Mortality Data |
Crude and age-adjusted mortality rates Provided for every county and every year 1968-2013 Rates can be provided based on diagnosis, age, sex, race, and rural-urban settings |
|
Limits of CDC Mortality Data |
Diagnostic systems change County data, not individual Suppressed values for small numbers |
|
Strengths of CDC Mortality Data |
But long period of time, validated diagnostics Population rates Available by subgroups |
|
CDC BRFSS |
Behavioral Risk Factor Surveillance System Annual telephone surveys conducted by CDC on health behaviors, health status, demographics 1984-2014 More than 500,000 interviews per year |
|
CDC BRFSS Limits and strengths |
Good for state or national summaries or trends Addition of cell phones in 2011 makes trends problematic pre to post 2011 (threat to internal validity: selection) Individual data, not county Dependent on telephone surveys, about 50% nonreponsive |
|
CDC: NHANES |
National Health and Nutrition Examination Survey Annual surveys since 1975 More recent two-year aggregates starting 1000-2000, up to 2013-2014 Each two-year sample has ~12,000 Includes extensive data collection including questionnaire, blood and urine samples, physical examination Data organized into different files with a common identifier |
|
NHANES Limits |
Some measures change from year to year Some measures on smaller sub-samples Same persons are not measured over time (cross-sectional) No public geographic data |
|
NHANES Strengths |
Data available at individual level Hundreds of publications using NHANES data Lab and exam data are unique |
|
CDC NHIS |
National Health Interview Survey In-person household interviews collected by US Census for CDC since 1957 Tracks national health status, health care access, and progress toward achieving national health objectives (Healthy people 2020) |
|
FOur core survey sections of NHIS |
Household Family Sample Adult Sample child -Includes questions on health conditions, health status, activity limitations, health behaviors, health care use, mental health services, communication disorders |
|
NHIS overlap with NHANES and BRFSS |
NHIS is not as in-depth as NHANES< different sampling strategy (more representative than BRFSS); bigger sample than NHANES |
|
NHIS Limits |
Some measures change from year to year Some persons are not measured over time (cross-sectional) No public geographic data |
|
NHIS Strengths |
But available at individual level Hundreds of publications using NHIS data "The principal source of information on the health of the civilian noninstitutionalized population of the United States" |
|
EPA: TRI |
Toxics Release Inventory Facilities that manufacture, use, store or dispose of identified toxic chemicals must report quantity of releases since 1988 Research shows that TRI releases are associated with outcomes as higher cancer rates, CVD, birth defects Purpose of public reporting is to encourage release reduction and development of safer chemicals |
|
TRI limits |
Only facilities with at least 10 full-time employees that manufacture, process or use a listed chemical at a certain amount have to report
Facilities may underreport |
|
TRI Strengths |
Individual facility and chemical data each year Unique data Encourages lower use of toxics |
|
Area Health Resources File (AHRF) |
Produced by HRSA County level database produced each year Combination of demographic, some limited health data Rich source for health care measures (numbers of doctors by specialty, hospitals) |
|
AHRD Limits |
County aggregate data, not individual SOme measures not available or updated annually |
|
AHRD Strengths |
Unique data for health care professions supply over time |
|
INdicators (Indiana Indicators) |
A resource for state, county, regional indicators Gathered from other sources Not research-friendly as raw data are not available |
|
Combining secondary data |
Research opportunities in combining data across 2 or more of these secondary sources (TRI release and CDC mortality rates; county ranking factors and CDC mortality rates) |
|
Other secondary data sources |
SEER National Inpatient Sample Dept. of Energy data EPA power plant data Medicare claims NCHS birth records |
|
All members of a defined group |
Population |
|
A portion or a subset of a population |
Sample |
|
What makes a good sample? |
one that is representative of the population -Just like the population, only smaller -Exception for qualitative research |
|
Ways to define a target population for selecting a sample |
Total size Inclusion and exclusion criteria Sample drawn from the target population using probability or non-probability sampling |
|
Methods of sampling: Probability Sampling |
Every member of the population has a known, non-zero probability of being selected Also called 'representative sampling' |
|
Methods of sampling: Non-probability sampling |
Some members have a change to be selected, others do not Based on convenience or judgment about the target population to select from, or about the features in the sample you want to be sure to represent Also called 'non-representative sampling' |
|
Probability sampling: Simple Random Sampling |
-Selected by using chance or random numbers -Each individual in the population has equal chance of being selected -Therefore, all members of the population must be known and have an equal chance of being selected Ex. Drawing names from a hat, random numbers |
|
Probability sampling: Systematic Sampling |
-Select a random starting point and then select every k'th subject in the population -The population must be known -All members must have the possibility of being selected |
|
Probability sampling: Stratified Sampling |
-Divide the population into at least two different groups with common characteristics (e.g. men and women), then randomly draw some subjects (%) from each group (called strata or stratum) -To be sure that important subgroups are represented -oversampling is also a deliberate approach to make sure that small groups are represented -all of population must be known |
|
Probability sampling: Cluster sampling |
-Groups are selected, not individuals -Divide the population into naturally occurring groups (called clusters), randomly select some of the groups, and then collect data from all or random sample of members of the selected groups -Groups or clusters might be schools, precincts, counties, etc. |
|
Probability sampling: Multistage sampling |
-More than one sampling step -Usually starts with cluster sampling, then do simple random, systematic, or stratified sampling within each cluster -Or start with stratifying clusters |
|
Non-probability sampling: Convenience sampling |
-Use subjects that are easily accessible -Probably not representative of the population, but may be useful or early exploratory work or if other methods are not available -Ex. using family members or students in a classroom, mall shoppers |
|
Non-probability sampling: Respondent-assisted sampling |
-Persons sampled first identify persons to be sampled next -Ex. snowball sampling -Can be useful when target population is known or unabailable |
|
Non-probability sampling: Quota sampling |
-Divide population into important groups (similarly to stratified sampling) (E.g., age groups for males and females) -Then sample from each group until target samples sizes (quota) are reached -This is non-probability sampling because some members have no chance of selection after quotas are reached |
|
Non-probability sampling: Purposive sampling |
-A form of qualitative inquiry -Used in focus groups or other qualitative methods -Deliberately want to represent certain opinions or perspectives |
|
Types of probability sampling |
Simple random Systematic Stratified Cluster Multi-stage |
|
Advanages of probability sampling |
More rigorous, represents population better |
|
Types of non-probability sampling |
Convenience Respondent-assisted (snowball) Quota Purposive Multi-stage |
|
Advantages of non-probability sampling |
More practical, can be used to target certain subgroups |
|
Type of sampling where respondents recriut their peers, offering a financial reward for each recruitment, and those recruited can also get a reward for recruiting others |
Respondent driven sampling (RDS) (A type of respondent-assisted sampling) |
|
What two factors determine how large a sample should be? |
Sample size Response rates |
|
Sample size |
Generally, more is better for quantitative research -A bigger sample is more likely to represent the population than a small sample -A bigger sample makes it easier to detect statistically significant effects -But a really big sample might lead to significant p values by chance |
|
Sample size can be estimated from what 3 considerations? |
1. What type I error you select 2. What type II error you select 3. Effect size |
|
Sample size considerations: Type I error |
p<0.05 is typical The probability of concluding that an effect exists in the population when it really does not -Also called alpha |
|
Sample size considerations: Type II error |
p< 0.20 is typical The probability of concluding that no effect exists in the population when it really does -This is also called betta |
|
Power |
1-Beta=Power Power is the ability to detect an effect when one is there or not 0.80 or more is typical |
|
Sample size considerations: Effect size |
How big the effect really is in the population -Usually the hardest part to estimate, we often dont know what the real effect is in the population, thats why we're doing the study! -Bigger effects can be detected with smaller samples, small effects require big samples |
|
Formula for effect size and how to estimate it |
Effect size can be estimated based on the type of statistical test (E.g. a t-test to compare two means) Effect size=(pup mean1-pop mean 2)/SD This requires knowing the pop mean and SD Conventional effect sizes: small=0.2,medium=0.5,large=0.8 |
|
Four interrelated factors for sampling |
Sample size, alpha, beta, effect size -Once we know any 3, the 4th is determined |
|
Why are big samples generally better? |
small samples may be unable to detect effects |
|
Sample response rates |
Generally, higher is better (will represent the population with less bias) -No absolute rule (response rates of 70-80% are often considered strong but are hard to achieve, reponse rates >50% sometimes acceptable means there is more known than unknown |
|
Ways to deal with survey non-response |
Make multiple requests, compare early wave vs late wave responders Weighting responses: underrepresented groups have higher weight |
|
Ways to deal with item non-response |
Can delete surveys case-wide (deleting the whole observation) if an item has missing data Imputation |
|
Estimating a missing item based on other non-missing data |
Imputation |
|
A missing observation is assigned the mean value from the non-missing cases Crude, can be ok if extent of missing data is small |
Mean imputation |
|
A more accurate but more technically difficult type of imputation |
Estimating the missing value based on known items from the same person |
|
Definition: The science of learning from data, and of measuring, controlling and communicating uncertainty |
Statistics |
|
Summarizing data for easier interpretation Also called "summary statistics" Univariate (they describe or summarize one variable at a time) |
Descriptive statistics |
|
Making predictions or inferences about a population from sample data Statistics to infer about a population based on results from a sample |
Inferential statistics |
|
3 common descriptive statistics |
Measures of central tendency Measures of variation Frequency distributions |
|
The average value of a set of scores in the data |
Central tendency |
|
3 common measures of central tendency |
Mean, median (middle value), mode (most common value) |
|
When are mean, median, and mode useful and when are they not approptiate? |
All are Useful when the variable is interval or ratio Means and medians are not appropriate for nominal or ordinal measures (mode may be okay) |
|
Measures of the spread or variability in the data values |
Measures of variation |
|
Common measures of variation (5) |
Range (highest minus lowest) Maximum, minimum Standard deviation (average distance between each data point and the mean) Variance Standard error |
|
Frequency distribution: how are they organized? WHat are they useful for? What are they used for? |
Organized from low to high
Useful for summarizing counts Are used mainly nominal or ordinal variables, but can be used for interval or ratio variables with limited response options |
|
Familiar statistical significance tests for inferential statistics (7) |
Chi-square t-test ANOVA (F-test) r R^2 odds ratios confidence intervals |
|
Statistical significance test used for categorical (nominal or ordinal) data -Not for continuous -Cell values are counts |
Chi-Square |
|
Statistical significance test used to compare 2( and only 2!) means -Dependent variable is continuous -cell values are means -Independent variable is categorical (unpaired: 2 different groups; paired: 2 scores from the same group) |
T-test |
|
Statistical significance test used to compare 2 or more means (usually at least 3) -Dependent variable is continuous (interval or ratio) -Independent variable is categorical (usually nominal): 2 or more different groups; 2 or more measures from the same group over time (repeated measures) |
ANOVA (F-test) |
|
ANOVA vs. ANCOVA |
Analysis of variance (ANOVA) Analysis of covariance (ANCOVA)-has more than 1 inependent variable Both are often used with experiments |
|
In ANOVA or ANCOVA tests, an effect of one independent variable on a dependent variable -One line on a graph or parallel lines |
Main effects |
|
In ANOVA or ANCOVA tests, the effect of one independent variable is influenced by another independent variable -Two nonparallel lines on a graph |
Interactions (moderators) |
|
Statistical significance test stands for a correlation between two variables; both variables are continuous (Pearson) or one is continuous and one is categorical (Spearman) -can vary from -1 to +1 -Measures linear relationship between two variables -Correlation is not causation |
r |
|
Statisical significance test: association between 1 ontinuous dependent measure and 2 or more independent measures -independent measures can be continuous or categorical -Can vary from 0 to 1 -Estimates the amount of variance in the dependent variable that is 'explained' by the set of independent variables -Very common technique in applied, non-experimental studies |
R squared (Multiple linear regression) |
|
What is it called when a R squared test has a categorical dependent variable? |
Multiple logistic regression |
|
parametric vs. non-parametric tests |
Most common are parametric (assume some form of population distribution...usually normal) -With ordinal data, small samples (N<30) or non-normal distributions, non-parametric tests may be better (often based on comparing ranks (original or converted original data |
|
Common graphs for statistical results |
Bar charts Donut and pie charts Line charts (not appropriate for categoris, but for growth or change over time, or for group differences) |
|
Features of graphs |
Can be drawn inappropriately leading to misleading conclusions -Watch the "scales" -Omission of labels or units on the axes -Data values provided or not -Self-contained legent |
|
Features of tables |
In published papers, Table 1 usually presents descriptive statistics, Tables 2+ usually inferential (can be many exceptions) -Like figures, tables should be clearly described and self-contained. Test and tables should not duplicate too much -Essential information, enough but not too much |
|
Things to know to choose the right statistical test for quantitative studies (4 things) |
Independent and dependent variables How the variables will be measured Sample size Design |