Conclusiones y recomendaciones: - Evaluación de resultados e impactos

Capítulo 5 Evaluación de resultados e impactos

5.5. Conclusiones y recomendaciones:

JOE BLITZSTEIN

141

and gathering data. Then, you clean the data, so there’s some data wrangling. There is exploratory data analysis, which involves looking for problems, biases, weird outliers, or strange anomalies in the data, as well as trying to get a sense of some possible conjectures you could formulate.

Then, it goes a little bit into modeling. We took a Bayesian approach to that. There are full courses on Bayesian data analysis, and this was just a short introduction. Then, there’s communicating and visualizing the results.

The sequence of steps is not linear — you iterate between those steps in a non-linear way. We defined it as the data science process, and we wanted to introduce that process through examples. To go into detail, that process would have taken six courses, but we wanted to put them together into one introductory course on how to think like a data scientist. The course needed to include applications that are of current interest like predicting elections, movie and restaurant ratings, and network analysis, rather than using a lot of canned, stale data sets that no one has cared about in the last 50 years. So, we wanted interesting data, but that’s not enough. We wanted interesting data but we also wanted to ask relevant questions about the data.

Why is important for data scientists to understand the data science process instead of just going through the work?

I think it’s important in whatever you do to have a sense of direction, instead of just aimlessly trying things. You want some sense of where things are going. I’m not saying it’s not useful to just grab data and hack around with it. You can learn from doing that, but in terms of doing something that will have long-term scientific value, I think that depend on relevant research questions.

Much of statistics is about distinguishing signal from noise, distinguishing valid from invalid signals, so-called “discoveries”. You need to look for patterns, but you can’t just assume that whatever pattern you find is real. You have to perform some validation, and if you cannot communicate the results in the end, it’s not worth much either.

All of these ingredients are crucial. Different people can specialize in different parts of the process. No one can be a complete expert at every step, but data scientists in industry are working in teams. To be effective in teamwork, you have to understand some basics

With statistics, you can actually do cool math and also feel that you’re analyzing interesting data and doing something useful for the real world.

JOE BLITZSTEIN

142

of what your teammates are doing. You need to be able to give them feedback, and you need to be able to understand their feedback about what you’re doing. You have to see how the various pieces fit together into the overall process.

How did you get interested in data science and teaching this data science class? It’s probably a combination of a lot of factors. I noticed more and more possible data science ideas and applications ever since the Netflix prize and Nate Silver. The combination of so many datasets that were never available before made me really interested, for my own sake, as well as for teaching it to students. I felt some concern that students might not have the right kind of CS training to be able to participate in all these opportunities. So, I wanted to play a role in fixing that.

Your data science class was very popular this year. Did you expect this level of popularity? How many students ended up enrolling in the class?

I had guessed there would be 100 or 150 students (which would already be a very large course), but we ended up with more than twice that many; we ended up having 350 or so enrollments. We tried to keep the prerequisites reasonable, but it did require at least some very basic background in Stat and CS. We didn’t want to limit enrollment or do a lottery, so we tried to send the message that this was going to be a hard class. You’re going to do a lot of work, but you’ll learn a lot. That was the idea, but I didn’t expect it to have that much demand.

Why do you think there was such a large demand for the class?

It’s hard to know. I think there are some students who took Stat 110 and wanted to have a follow-up, even though the material is different. In Stat 110, we do probability and it’s a fairly mathematical course, but we’re not analyzing data. In a data science course, we’re not doing math, but we are analyzing data. I see Stat 110 and the Data Science course as complementary, in that we are emphasizing stories and a certain way of thinking about the world in both of them.

So, it’s the applied analog, and I have a huge number of Stat 110 students who were interested in going further. Then, Hanspeter had a lot of students interested in his visualization course. The visualization itself is great, but it’s very limited if you don’t actually know how to analyze the data. So, the whole theme of big data attracts a lot of interest.

JOE BLITZSTEIN

143

Data Science course. I want to extend that question: What is the role of storytelling, communication, and visualization in data science?

I think they’re incredibly important parts of it. Anyone with a basic level of CS can scrape a big data set and start computing things. And anyone with the right statistics background, if presented with a clear data set, can start running some regression in a mechanical way. I think there’s a real art to getting interpretable results and then communicating those results, especially in the age of big data where you have thousands of variables. In the old days of regression, you might have two predictors, and it’s a lot easier to see what’s going on. Now, we have thousands of variables and some very complicated models, and it becomes very difficult to see what’s going on.

I think communication includes communicating with yourself too! You are trying to make sense of the data in a way that human beings can understand. If you attend conferences, it’s generally hard to remember anything from the majority of a presentation. Presenters tend to rush through their slides and try to show a lot of results, but are they really explaining what the story is?

So, if statisticians (or anyone) are falling to communicate why their results are important and are failing to explain those results in an interpretable way, that’s just a lot less exciting. Visualization definitely plays an important role in that case. A picture is worth a thousand words. Sometimes instead of staring at a huge table of numbers, a few graphs can give you much more intuitive information.

Do you have any advice for data scientists or people in the industry who may want to become better communicators? What kind of philosophy would you like to impart to make them care more about the storytelling and communication part of data science? Why is the teaching part of data science so important?

I think it’s an important part of clarity of thinking. As a data scientist, you’re going to need to collaborate with many different types of people with many different backgrounds. You have to be able to put yourself in their shoes and explain things in terms of what they’re interested in and what their background is. In many cases, when you can’t explain something clearly, it’s a sign that you haven’t thought it through fully yourself. So, teaching and learning go together. Learning to explain something to someone in an interpretable way makes it a lot clearer in your own understanding.

Much of statistics is about distinguishing signal from noise, distinguishing valid from invalid signals, so-called “discoveries”.

JOE BLITZSTEIN

144

In terms of concrete advice on developing these communication skills, I think of it in terms of something like the golden rule, which I call the conditional golden rule: try to present the idea in a way that you would have appreciated seeing it presented. It’s conditional because you have to adjust for the fact that as a data scientist who’s been immersed in a project for months or years, you have to step back and realize that the person you’re talking to may have never even heard of what you’re doing. They don’t know any details about the data. They don’t know your notation, and they may not even know statistics.

Also, read some of the classic design books by Edward Tufte (he’s a famous example), The

Visual Display of Quantitative Information. Try to find and follow good examples.

What’s your opinion on his book and his philosophy on visualizing information? I really like his books. In a sense, he’s a victim of his own fame, in that these books are so popular that it’s almost a visualization bible. So naturally, there’s going to be a backlash of people asking, “what gives him the right to say what you can or can’t do?” I wouldn’t take everything he says religiously, but these are important things to think about. Clear communication is incredibly important.

What are your favorite philosophies about visualization? What is your favorite piece of knowledge from this book, and what is your best advice for visualizing quantitative information?

I think the best advice is just to think hard about what you want your audience to take away from the visualization. It’s sad to think of how many talks I’ve been to, presentations on all kinds of subjects, where the speaker will make ridiculous mistakes, like not labeling their axes or having things so small that the audience can’t see what is going on.

Sometimes, presenters want to show some kind of comparison, but the things they’re trying to compare are on separate slides. Graphs are effective in showing something changing over time or a comparison between things, and it is more about relative information than absolute information most of the time. You want to make it as easy as possible to see those comparisons. Avoid something that looks really fancy but distracts attention from the fundamental comparison you’re trying to display.

Can you tell our readers more about your story behind the conditional golden rule? There were two course reviews about Stat 110 that went well together. One of them said

In many cases, when you can’t explain something clearly, it’s a sign that you haven’t thought it through fully yourself.

JOE BLITZSTEIN

145

I designed the course to the credo that it should be taught in the way I myself would like to take it as a student, which is the golden rule. Then, the other one, which is a counterpoint to that, said that the homework only induced pain, not learning. The joke is, that if you combine those two things, it implies that I’m a masochist.

Obviously, I’m trying to induce learning, not pain, but it does require a lot of hard work to learn all these things. I try to make as many resources available as possible, in terms of having great Teaching Fellows, having lots of office hour times, and having large amounts of practice problems. It’s just like if you were practicing a sport or a musical instrument. It’s something that you need to practice, practice, practice. Just doing a few homework problems a week is not going to be enough.

It’s like learning a whole new language. Language courses tend to meet every day, and you have to go to labs. There are tons of things going on, but statistics and data science are new languages, too. They should be approached in the same way. You have to do the math and CS as well as learning grammar and syntax. You just have to immerse yourself in the learning process.

For my fellow students and me, we’re very fortunate to be in this environment where our only duty is to learn. But there are many data scientists out there who feel like they’re missing some knowledge and are trying hard to fill the gap. My question is in reaction to those data scientists. What’s the best way to keep on learning after university?

I noticed that’s a trap that people fall into, thinking, “I’m perpetually feeling unprepared.” It’s a dangerous way of thinking — that until you know X, Y, Z and W, you’re not going to be able to do data science. Once you start learning this thing, you realize there are four other things you need to learn. Then, you try to learn those things, and you realize you don’t have this, this, and this.

You do need some basic foundation in statistics and CS skills, but both statistics and computer science are enormous fields that are also rapidly evolving. So, you need durable concepts. Right now, for people that want to do data science, I highly recommend learning R and Python. But in 10 or 20 years, who knows what the main languages will be?

It’s a mistake to think, “why am I learning R now? R won’t be used in 20 years.” Well, first of all, R might still be used in 20 years, but even if it isn’t, there’s going to be a need for the thinking that produced R. The people who create the successors to R will have

It’s a dangerous way of thinking — that until you know X, Y, Z and W, you’re not going to be able to do data science.

JOE BLITZSTEIN

146

probably grown up using R. So, they’re still going to have that frame of reference.

You want the skills that are language-independent. You need fundamental ways of thinking about uncertainty and communicating those thoughts in a way that is not that dependent on any particular programming language. It’s definitely important to have that kind of foundation, but keep in mind that it’s hopeless for anyone to actually know all the relevant parts of statistics and CS, even for some small portion of data science. It’s not feasible for anyone, but it doesn’t mean that you can’t make useful contributions.

In fact, I think it’s a good idea to continue learning something new every day. The way you can learn something, and really remember it, is by using it in your work. Instead of saying, “I need to study these five books so that I will know enough to become a data scientist,” it should be about getting a basic level and foundation. Then, start immersing yourself in a real, applied problem. You will realize what types of methods you need. Then, go and study the books and papers that are relevant for that. You will understand them so much better because they’re in the context of a problem that you care about.

You have to be energetic and work really hard, but not get discouraged just because you don’t know everything. And just because you don’t know everything, it doesn’t mean you can’t contribute useful things while gradually expanding your understanding and knowledge.

To strengthen one’s understanding in a concept, would you also recommend teaching that concept to other people (stemming back to your philosophy on storytelling and communication)?

Yes. I think that’s a great way of checking your own understanding. It’s a lot of fun. You’re helping someone. You have to think about the important things to emphasize, the common misconceptions, etc. Think back to when you first learned the concept, the obstacles and conceptual roadblocks that you had to get past, and the most important things to emphasize. That is very useful for everyone.

What are the parallels between being a data scientist and being an educator? Communication and feedback. If you’re just lecturing to a class and not paying attention to see if the students are actually understanding, that’s a pretty stupid way to teach. There’s a story of a professor who got a really poor teaching evaluations, and the

You have to be energetic and work really hard, but not get discouraged just because you don’t know everything.

JOE BLITZSTEIN

147

evaluations said his lectures were very unclear. He said, “My lectures aren’t unclear. The students just don’t understand.”

Communication is a two-way street and you have to pay attention in various ways, through feedback, watching people’s expressions, trying to get people to speak up and feel comfortable asking questions. Do whatever you can when you’re teaching to assess what people understand and what they don’t. A lot of that information stays the same from year to year.

That’s the reason why every week in Stat 110, I ask the teaching fellows for the most common mistakes from the homework. I can clarify those things or they can be clarified in the sections for that year. Those things tend to stay fairly constant from year to year, too. I don’t have a formal data set, but I am trying to gather as much information as I can about what the students understand and what they do not.

Data science is like that, too. You don’t compute something without getting feedback on whether it’s working or not. You’re communicating messages to people, but you need feedback on whether or not that message is getting across.

This is a very important idea in software development, too, with continuous deployment and instant feedback and quick iterations. It’s nice connecting data science and software engineering principles. As a data scientist, you’re always getting feedback and trying to improve.

I think that’s extremely important. That’s another mistake I’ve noticed, the tendency in applied problems where new students just want to fit one model and be done with it. But, the world is too complicated. There are too many challenges with data. We know the saying, “All models are wrong, but some models are useful.”

It’s not realistic to expect that the first model you come up with will actually work well, but if it takes too long to figure out how to fit that model and run the computations on some massive data set, you may feel that you need to move on.

That’s very unsatisfying. What you need to do, first of all, is get comfortable on Python so that you can fit the models very quickly. If you have a large data set, fit it on a subset first so you can quickly get models and better intuition. You have to iterate and build something better.

You have to manage your time so that you can actually go through a whole series of models and get feedback on which one is actually working through measures of fit or predictive capabilities. Even just explaining or communicating with someone else to try

In document DESARROLLO RURAL 2002 (página 83-86)