We had a chance to sit down with AACRAO 2022 Annual Meeting presenters Sung-Woo Cho, Associate Vice Provost for Academic Data Analytics, and Nathan Greenstein, Assistant Director of Machine Learning, both from the University of Oregon, to discuss their
upcoming session in Portland, OR, "Using Machine Learning to Predict Student Success and Combat Inequity."
Learning opportunities are myriad for members attending this session ranging from machine learning application in higher education to ensuring equity and combatting bias. Beginning the interview Sung-woo and Nathan gave us a little background on themselves
and their motivation for presenting at this year's AACRAO Annual Meeting.
This is a relatively new unit, and we're about a year into its existence. So we're in the Office of the Provost and report to the Provost. Our mission is to use the university's data, whether in the form of student records or text data in the form of student responses, surveys, or course evaluations, and most of what we're doing these days is using machine learning to either predict student outcomes using numerical data or to better understand what these students are saying and thinking, particularly as it relates to the instruction that they're getting through text analytics using NLP (Natural Language Processing) methods.
I came to all of this through a slightly strange path. I originally studied cognitive science, so I was first exposed to human learning and natural neural networks in the actual brain and gradually slowly started pivoting that over towards artificial neural networks and other machine learning concepts. I realized that there was a real need to start applying that work to the socially-oriented spaces where I'm interested in working - things like education, housing, and things like that.
In my role at the University of Oregon, we're trying to enable some modernized student success interventions that wouldn't be possible without this ability to, somewhat, predict the future that we think machine learning provides.
We think about this in a couple of ways. We know that proactive interventions are typically more effective than reactive ones. So anticipating and preventing a problem is way better than waiting for a problem to happen than trying to clean up the mess. So that could be someone stopping out, failing a key class, or whatever the student success barrier is. Generally speaking, we want to be preventing it before it happens instead of trying to react after it happens. That (approach) is more efficient from a resource perspective, those interventions tend to be much easier, and it's just more effective on the student success side - the outcomes tend to be better if we can prevent something from ever happening. The problem with that is that we don't know when something will happen until it does unless we can use something like machine learning to predict it beforehand.
The goal is not to feed everyone's data into an algorithm and know their entire future and know exactly what will happen to them at every turn; the goal is more for allocating our resources as efficiently as possible. So we can see "this is the group of students who are the greatest risk for a negative outcome," and that doesn't mean it is going to happen to them, actually, by and large most of the students who get marked as at-risk are still going to be fine, but it's a process of identifying the people who are most vulnerable to some interruption taking them away from a given course. So if we can find those students who might be most vulnerable to that kind of issue and try to give them some extra support so that they can achieve the resiliency they need to not be blown off course by something like that, then we consider that a much better use of the university's resources in terms of advising or any other student success intervention. We consider that a better use of resources than not doing anything until there is a problem and trying to react to it after the fact.
I'm curious, can you tell me a bit about data acquisition methods and privacy, and also your work in pulling that data cross-departmentally?
With regards to privacy, all the work we're doing uses no PII. So all of the predictive analytics work and the research we're doing, thus far, is not using any PII. So in our minds, there was no reason to try and use PII from the start, because there's so much we need to learn first about how these algorithms are performing, and whether they're equitable or not before we even try to touch PII.
So in terms of privacy, that hasn't been an issue on the predictive analytics side. The text analytics side is also the same, so we have no PII, and if we have text information with PII, we scrub the data first or redact the PII before we use it.
The question on data acquisition is probably a tougher one. I think the University of Oregon, as with many other institutions, has some silos between units, divisions, and departments with regard to data governance. That's one of the things I think our unit has been charged with, not fully; other units are also doing this as well, but to try and improve this culture of sharing data within or between units and departments and across campus. We're proactively trying to do this with a more structured version of data governance and thinking about data lakes and making sure that data is more easily accessible than having to go through an email chain and having to get a thumbs up from everyone on that chain, that's highly inefficient. So those are the types of things that we're thinking through and will act upon with regard to data accessibility in the future.
I'll just add, for the project we've been working on most recently, we have gotten data from Admissions, Financial Aid, Information Services Group, Housing, Honours College, Orientation Programs Division, First-Year Programs Division, Athletics, and from the Registrar. This is just for one single machine learning model, and your question really hits on one of the areas that we are spending a lot of our time in. Trying to get access to data from across different parts of the university and trying to bring it all together in a way that makes sense. And also bring it together in a way that everyone feels good about. There's a lot of justifiable skepticism about handing over a lot of sensitive data. We've been trying hard to message that point that we're not collecting personally identifiable information.
I, at no point, have any way of figuring out whose data I'm looking at; even if I have it all in front of me, it is all anonymized. So we're not taking the results of our work and handing it to anyone and saying, "here is what about this student makes us concerned for their future outcomes" we're saying "for a whole constellation of reasons and a bunch of really complicated mathematical relationships that no one human could understand, that's where the machine learning comes in, this student is potentially vulnerable to being blown off course by some external factor that we can't control." So what we're not going to tell this student, or tell their advisor, or tell anyone that this person doesn't have what it takes, instead we're going to say that because of structural factors that exist beyond our control, this person might be vulnerable to some interruption and we're going to try and provide them with extra resiliency to overcome that if it does ever happen. We're not trying to set them up for a negative outcome that they might not have ever experienced, we're trying to give them extra support to weather that storm if it does hit them.
What kind of guardrails or measures are you taking to ensure equity through this process?
We're trying to do as much as we can to improve a given student's success outcome, but we're pretty much setting the universe in which we operate as the universe where we are advancing equity and not harming it.
For this most recent project, we've been working with a group within the university called Undergraduate Education and Student Success. We've been meeting with them regularly to discuss everything that we're doing from the perspective of students and advisors. We don't necessarily have the first-hand experience to know what the potential vulnerabilities are, what the potential sensitivities are, from a student's perspective or an advisor's perspective. So we've been leaning on them heavily to use that first-hand knowledge to inform what we're doing.
From the beginning, it has been one of our mantras to ensure that the technical things we are doing are being done equitably and fairly. Our mission was never to unleash a technical solution and just never think about those issues. That's really a core value of ours right from the beginning.
I'd like to add that we definitely understand some of the skepticism around machine learning and algorithmic bias or algorithmic unfairness. I really applaud that concern because it is ultimately rooted in the concern for the well-being of the vulnerable people we're trying to support. We hear those concerns, and we have in the past advocated for responsible machine learning and for really taking those algorithmic bias concerns very seriously. So one of the things we do with any machine learning model before ever letting it out free into the world is a rigorous audit to see if the model might be performing differently for different groups or unintentionally cementing some negative outcome. One of the big problems with machine learning is that it learns from biased human behavior in the past. So you see these examples of setting bail or setting parole in the justice system, and those (algorithms) are just learning from biased humans who've been making unfair decisions. One of the things we like about machine learning is that the entire thing can be subject to statistical analysis to see if it is performing in a statistically different way for different groups. Is the model more likely to unintentionally cement some negative bias that has existed in the past, is it likely to have a higher success rate? We worry not just about what outcomes the model predicts but also about how accurate it is; if the model is really accurate for one group and not very accurate for another group, we'd consider that another inequity because we're providing a different level of service to different demographics.
What are some other practical applications for this work that you've discussed or worked on?
Within the umbrella of machine learning, we're doing important things like trying to predict student outcomes. On the flip side, we're working with text in the form of what students are saying through student course evaluation feedback, or in the near future, advising sessions notes. We're implementing machine learning strategies to better understand and to better categorize what these students are saying through written text. We have created and are continuing to create a series of practitioner guides that are based on hundreds of thousands, the past few years at this point, course evaluation feedback. So these are designed to hone in on a particular topic such as inclusivity or accessibility; we're trying to use these different topics and use machine learning to understand what students are generally saying about those topics, and then also give exemplar quotes from students on a particular topic in inclusiveness.
Machine learning, for us, is becoming a pivotal tool because we can ingest a ton of information and use these methods to better understand outcomes, predicted outcomes, or to better understand themes in what students are saying at a much faster level. In many cases, it's making things that could have been impossible due to the sheer amount of time and labor that it would take to churn through hundreds of thousands of student feedback responses into a reality or possibility for us.
One other avenue that we're not working on now but is certainly on our radar for the future is identifying potential structural obstacles within the university itself. These could be things like scheduling gotcha's that tend to delay student graduation, if there is a class that fills up or conflicts with another class, certainly we can find those manually or anecdotally, but there's a ton of data in terms of student degree progression, what courses they're taking, at what times, and there are patterns in that data that could be brought out with machine learning. So we could find some of those "gotcha's" and try to adjust the scheduling system to support that.
Maybe there are weeder courses that are disproportionately gatekeeping students from non-traditional backgrounds or something like that, that might never come out in the data just because there's just such an overwhelming amount; every student's performance in every class over every quarter is just too much for any one person to look at, and we might never see those patterns without something like machine learning to distill that down to usable data.
Those attending this session will get to dive deep with presenters Sung-Woo Cho and Nathan Greenstein getting a behind-the-scenes look at how this technology works at the University of Oregon, advice for implementation, and an opportunity for Q&A. Learn more about the work this team is doing here.
Interested in attending? Register for AACRAO's Annual Meeting soon to receive our "early-bird" discount.