Introduction to Data Science (SDS 192) aims to equip students with the knowledge and tools to understand, critically evaluate, manipulate, and explain data. This is an introductory course, and no prior experience is necessary1. Students will learn how to read and write code, but also how to create, organize, and collaborate on coding projects while critically examining the projects goals and data sources. We will be primarily using the R language, along with supplemental tools.
Each week follows the same basic structure. Monday and Wednesday classes include lectures to introduce new concepts. Each lecture is followed by interactive problem sets designed to reinforce concepts through active learning. Slides from lecture will be posted online after class. The problem sets for any class are “due” at the start of the next class period when the answers will be released; most problem sets can be completed in class. In-class problem sets do not contribute toward your grade. They are intended to reinforce material and help you test your own understanding.
Friday classes are devoted to lab activities or project work time. Students are expected to come to class for these activities. Labs include more involved problem sets that incorporate topics from the current and prior weeks. Students work on labs in groups of two to four people. Labs are reviewed through GitHub Classroom where feedback is provided.
For a full list of assignments and due dates, please see the course schedule.
This is a 4-credit course. You should be spending 12-hours total per week on this course. Expect to spend around 8.25 hours (12 hours - 3.75 hours/week of in-class instruction) on class material per week outside of class.
I am a sociologist that studies abuses of power in government. I earned my Ph.D. at the University of California, Davis in in sociology with a designated emphasis in computational social science. I combine computational methods such as social network analysis, natural language processing, geospatial analysis, and machine learning with open source and governmental data to uncover patterns of malfeasance and misfeasance by our public servants. From the political networks of politicians and prohibition gangsters to bias hidden in the text of academic recruitment, I use new methods to work on old problems of corruption and inequality.
I am a visiting assistant professor in the Statistical & Data Sciences (SDS) program. I have experience working with both United States and United Kingdom governmental organizations applying machine learning to real-world problems. In the UK, I worked with the national lab for data science and machine learning, the Alan Turing Institute, on early-detection systems in foster care to assure children are receiving adequate services. Meanwhile in the US I worked with the Internal Revenue Service to build a machine learning system that determined the credibility of incoming fraud reports.
You can send me a message on the course Slack workspace, and I will respond when I am able, typically within 24 hours during the work week. To message me, click the + button next to “Direct Messages” and search for my name.
If your question is not sensitive in nature, consider putting it in the #coding-help
or #course-help
channel instead. There is a good chance one of your classmates will be able to answer before I can.
Slack questions should be brief or administrative in nature. For more in-depth questions and troubleshooting please attend office hours.
You can schedule a meeting with me on Calendly. Drop-ins are welcome, but priority is given to those who make an appointment. Group appointments, to address a similar question, are welcome.
If you are coming to office hours with a coding question, make sure you have the code ready at the start of your appointment. Have your computer booted up and your project open.
If you cannot find an open time slot, please message me for an appointment. I will attempt to find a time that works for both of us.
Students are not expected to buy any materials for this course. Data science is built on free and open collaboration. There is no shortage of high-quality learning material available. This reader, as well as all assignments, are currently available for free.
Students are required to have a working computer (preferably a laptop) and reliable internet connection for this course. Any recent computer should be sufficient, with the notable exception of Chromebooks. Chromebooks lack access to the majority of the tools used by data scientists.
If you only have access to a Chromebook, please speak with me as soon as possible.
I will not be taking attendance in this course, and you do not need to inform me when you will be absent. If you are sick, please stay home. Given the standards-based grading system (discussed below), no single class, assignment, or even quiz will negatively impact your grade. That said, it will be very difficult to keep up with course material without consistent attendance.
If you miss a class, you should contact a peer to discuss what was missed, and check the course reader website for any upcoming deadlines. I won’t have the capacity to re-deliver missed material in office hours.
Quizzes cannot be made up after the open period has passed. If you have a known scheduling conflict with a quiz, please speak with me as soon as possible to arrange an alternative time.
Please see the SDS department’s official policy regarding remote learning:
In keeping with Smith’s core identity and mission as an in-person, residential college, the Program in Statistical & Data Sciences affirms College policy (as articulated by Provost Michael Thurston and Dean of the College Alex Keller) that students will attend class in person. As such, SDS courses will not provide options for remote attendance. Students who have been determined to require a remote attendance accommodation by the Office of Disability Services will be the only exceptions to this policy. As with any other kind of accommodations under the Americans with Disabilities Act (ADA), please notify your instructor during the first week of classes to schedule a meeting with them to discuss how we can work with you to provide the most accessible course possible.
Data science is inherently collaborative, so I fully expect students to collaborate. You are encouraged to work together on most assignments—ask questions on Slack, create study groups, and share helpful resources you find. However, anything you submit must be your own work. You need to be the person who writes the text and/or code. Multiple students should not submit identical work. Please note: The only avenue in which collaboration is not allowed is on quizzes.
All students, staff, and faculty are bound by the Smith College Honor Code:
Students and faculty at Smith are part of an academic community defined by its commitment to scholarship, which depends on scrupulous and attentive acknowledgement of all sources of information and honest and respectful use of college resources.
Smith College expects all students to be honest and committed to the principles of academic and intellectual integrity in their preparation and submission of course work and examinations. All submitted work of any kind must be the original work of the student who must cite all the sources used in its preparation.-Smith Academic Honor Code
Any cases of dishonesty or plagiarism will be reported to the Academic Honor Board. Examples of dishonesty or plagiarism include:
Learning to code is similar to learning a new language; you will only learn by doing. No amount of rote copying will advance you beyond the most elementary levels of understanding. Please keep this in mind.
If someone else helps you understand a concept better, give them a nod in the #shoutouts
channel on Slack.
As participants in this course we are committed to making participation a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion. Examples of unacceptable behavior by participants in this course include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
As the instructor I have the right and responsibility to point out and stop behavior that is not aligned with this Code of Conduct. Participants who do not follow the Code of Conduct may be reprimanded for such behavior. Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the instructor.
All students and the instructor are expected to adhere to this Code of Conduct in all settings for this course: seminars, office hours, and over Slack.
This Code of Conduct is adapted from the Contributor Covenant.
Smith College is committed to providing support services and reasonable accommodations to all students with disabilities. To request an accommodation, please register with the Office of Disability Services Office (ODS) at the beginning of the semester.
This course will be graded using a standards-based grading system. Rather than tallying up the percentage of questions you answer correctly, I assess your responses by using a pre-defined set of course standards and then assign a level of proficiency. Throughout the semester, this course offers multiple opportunities to showcase the depth of your understanding in light of these standards.
In traditional points-style grading, an average is taken of all your assignments, and your final grade is based on that average. This means all assignments are given equal consideration in your final grade.
Mean of
A1-A5
In contrast, standards-based grading is focused on your progression through the course. Functionally, only your best score for each standard is kept. All others are effectively forgotten. The hope is that without the worry of “getting a bad grade” when you are new to a concept, you will feel free to safely engage with complicated topics early on, make mistakes, and have opportunities to show improvement without penalization.
Max of
A1-A5
A standards-based grading system carries a number of other benefits:
The following table lists all the standards you are evaluated on in this course. There are 15 total standards, separated into 4 categories. Each standard states what conditions must be met to reach each proficiency level. There are four proficiency levels for each standard, each requiring more complete understanding of the material. These levels are inclusive, meaning to reach the 4th level, “Exceeds Standard” you must also meet all the requirements of level 3, “Meets Standard.”
You will have multiple opportunities to demonstrate your understanding of each standard. Any assignment that is reviewed is an opportunity to increase your proficiency level in a standard. In addition to the four levels of proficiency, there is also an extra point available in each standard called “Individual Standard.” You may fulfill this requirement only on quizzes, but only need to reach the “Meets Standard” criteria on a standard to do so.
You can demonstrate proficiency in any reviewed assignment, but can only fulfill the “Individual Standard” criteria on a quiz.
Standard | Does Not Meet Standard | Progressing Toward Standard | Meets Standard | Exceeds Standard | Individual Standard |
---|---|---|---|---|---|
Data Importing | Cannot import data or uses R Studio visual tools to import data. | Manually organizes or modifies data before importing it into R. | Can import raw data into R using the appropriate function for the data source. | Can author API calls or use other remote sources and import data directly into R. | |
Data Cleaning | Cleans data in a non-programmatic way. | Can clean data programmatically on a cell-by-cell basis to prepare it for analysis. | Can assign the correct common data types (logical, integer, numeric, factor, and string) to loaded data and understand the uses of each. Can clean data for analysis in a vectorized way. | Can prepare data for advanced types (dates, time series, etc.). Can prepare data from non-traditional sources such as OCR or web scraping. | |
Data Reshaping | Formats data in a non-programmatic way. | Can derive new measures from existing data and append it to dataframes. | Can pivot data between wide and long formats, and can explain the use case of each. | Can use lists as stores of arbitrary data structures, and subset/combine the data held within them. For example, can use a list to store iteration output, then later combine them. | |
Data Aggregation & Subsetting | Transforms data in a non-programmatic way. | Creates multiple copies of data in several intermediate stages of transformation that are used for different steps of analysis. | Can combine and split data sets using the appropriate merge or subset techniques. | Can split or merge data sets using either SQL-like calls (such as the x_join() series of functions) or approximate string matching. | |
Functions | Copies-and-pastes similar code with small changes. | Creates simple functions with consistent inputs. | Creates simple functions that can handle novel inputs, with logic to handle the data appropriately. | Created functions include conditionals and error checking to test for faulty data and describe the issue. Functions can intake multiple forms of data and handle both appropriately. For an example, see the <93>Make it Flex<94> section of Lab 5. | |
Iteration | Copies-and-pastes similar code several times within or between scripts. | Uses for loops or apply functions to iterate through vector data to preform a single data manipulation. | Can use either loops or apply functions to iterate over a vector of data and preform multi-step manipulations. | Can use loops or apply functions and explain the use cases for each. Can iterate over complex data structures such as dataframes or lists. | |
Visualization Structure | Selects inappropriate formats for data visualization. | Selects sub-optimal visualization formats or uses excessive visualizations where a single one would be sufficient. | Selects suitable formats for data visualization (bar, line, boxplot, etc.) and can explain the reasoning behind that choice. | Effectively mixes visualization formats or isolates individual elements to clearly communicate a message. For example, including a miniature table of the most important values within a bar plot. | |
Visualization Aesthetics | Chooses visual cues and colors for purely aesthetic reasons without attention to data representation. | Data visualizations attempt to represent underlying data, but use methods unsuited to the task which leave ambiguity for the viewer. | Data visualizations use color, scale, and shapes effectively to differentiate and communicate underlying data. | Data visualizations are highly customized with bespoke elements, such as callouts, to clearly communicate the message of the visualization. Aesthetics are sensitive to accessibility concerns. | |
Visualization Context | Produces data visualizations that are unclear, confusing, devoid of context, or impossible to understand without reading the text. | Produces data visualizations with readable axis labels, units, and legends (where appropriate). | Produces data visualizations that are clear and understandable with minimal textual explanation. | Produces data visualizations that are self-contained and can be understood on their own without textual explanation. | |
Data Ethics | Does not consider data ethics or investigate data provenance. | Can articulate common pitfalls and relate them to the project at hand. Confirms data types and scales using data documentation. | Reads data documentation to understand data collection/generation and measurements. Can highlight and explain to readers the potential concerns specific to the data or project. | Either creates data documentation for used data, or includes notes in code to the data sources and explains potential pitfalls. Considers and articulates relevant concerns related to the current project unprompted throughout the work cycle. | |
Code Style | Code style is inconsistent and/or lacks documentation. | Code comments explain the broad strokes of intended behavior. Indentation is consistent and predictable. Uses print statements to track the status of code execution. | Consistently comments all code and makes use of the built-in section headings in R Studio. For user created functions, the inputs and outputs are clearly explained, and examples are provided. | Includes <93>sanity checks<94> for data validity in code. For longer scripts or iterations includes print statements to track execution progress. | |
Git/Github | Does not use git for version control. | Uses git and GitHub for version control and can contribute to group repositories with commits, pushes, and pulls. | Uses git and GitHub effectively. Code commits are of appropriate size and commented well. Can branch and merge repositories while resolving any merge conflicts. Does not include sensitive files in commits. | Uses Github effectively for collaboration. Can create issues, ask for review, and merge branches in a manner suitable for a collaborative environment. |
Your completion of these standards are converted into a final letter grade using the following process. Each of the 12 standards will be converted into a four-point scale, with one point available for meeting the “Individual Standard” on a quiz.
On this scale, there are 60 points total in the course (12 standards * 5 possible points). I sum the highest level of proficiency you reach in each standard over the course of the semester to arrive at your final score. For example, if someone were to reach “Exceeds Standard” in all standards, but could never do so on a quiz, they would receive 48 of 60 points (4 points * 12 standards). Similarly, if someone reaches “Meets Standard” in all topics, including on quizzes, but did not reach “Exceeds Standard” in any topic, they would likewise receive 48 of 60 points.
The summed points will be converted into letter grades using the following table.
Letter | Points |
---|---|
A | 57-60 |
A- | 54-56 |
B+ | 52-53 |
B | 50-51 |
B- | 48-49 |
C+ | 46-47 |
C | 44-45 |
C- | 42-43 |
D+ | 40-41 |
D | 38-39 |
D- | 36-37 |
F | 0-35 |
Assignments turned in late will not be reviewed, and will not be considered for demonstrating proficiency in course standards. Keep in mind, missing an assignment will not hurt your grade, but does remove one chance for you to demonstrate your knowledge of course material. If you do not think you will be able to turn in an assignment by the deadline, you may request an extension. To do so, please send me a message explaining why you are unable to complete the assignment in the expected time frame. Extension requests must be made–and accepted–before the assignment due date.
After the due date, late assignments are only reviewed if there are emergency circumstances preventing you from turning the assignment in on time.
Q: So if I reach “Exceeds Standard” and fulfill the individual standard on a quiz for a topic early in the semester, I can just skip those questions for the rest of the class?
A: Theoretically yes, but I would recommend you answer all questions to make sure you’re not letting your knowledge slip.
If this is your first course in the SDS department, you also need to enroll in SDS 100. ↩︎