[course homepage]

Problem Set 2 - Ancestry

[Problem Set 2 PDF][Problem Set 2 Answer Key]


Data files and code templates for this problem set are available on comet at:


You should make a directory for problem set 2 in your working directory with subdirectories for code and results:

mkdir /oasis/projects/nsf/csd524/$USER/ps2
mkdir /oasis/projects/nsf/csd524/$USER/ps2/code
mkdir /oasis/projects/nsf/csd524/$USER/ps2/results

Installing python packages

Use the following commands to install useful python packages:

pip install --user sklearn pandas pyvcf

PS2 data

The data directory contains the following files you will use in the problem set:

To see how these files were created from the original 1000 Genomes file, see:

PS2 templates

The templates directory contains:

Using 23andMe data

To include your own 23andme results in the data used for the PCA problem, see:


and edit the paths appropriately.