Harness the Power of Web Scraping

Ever heard this old joke?

“What’s the difference between minor and major surgery?”

“It’s minor if it’s you; it’s major if it’s me.”

The same concept usually applies to internet privacy. If my data is being mined without my consent, that’s major. But what about mining somebody else’s data for my own purposes?

Every time you Google something, you’re benefiting from web scraping, the automated process of extracting data from websites. Looking for the best airfare? Cheapest hotel? Yep, many of those price trackers that make your vacation affordable run on web scraping. But Priceline and Kayak don’t let you peer behind the coding curtain.

Have you ever tried to do your own web scraping for personal or business use? The big draw isn’t just finding valuable information; it’s downloading it into your own workable format for analysis and action. Imposing structure on unstructured data. There are plenty of web scraping tools and explorations of their relative merits, but I hadn’t found the right one for me until late last fall . . .

My Intro to Scrapy

I’ve never kept it a secret. I’m a big fan of the programming language Python. It’s flexible, logical and easy to read. But while it has libraries and tools that can be used for web scraping, I wasn’t aware of a full framework until it fell into my lap.

A friend embroiled in Ph.D. research found this script he couldn’t get to work, so he brought it to me. And Christmas came early! He’d stumbled upon Scrapy: “An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.”

Scrapy and I connected instantly. Learning it, and then having fun with it, filled all my personal project free time for two months before I put it to work in service to my next major life goal: homeownership.

Of course I know Zillow, Redfin and Trulia have all the property data anyone could ever want. Remember: Web scraping isn’t just accessing the data. It’s for people who want to stockpile the data, play with it and work it. Web scraping enables you to turn the data you need into the information you want.

Now my wife, mother-in-law and I get an email every morning with a table of all the new home listings for easy comparison of the factors we’ve deemed important. And all that data lives in my digital ecosystem. I can call on it anytime to inform decisions, formulate an offer, and pushback on that greedy counteroffer.

Spreading the Gospel

What’s even better than working smarter? Helping other people work smarter!

When one of the hosts of a favorite local meetup reached out to me looking for a guest presenter, I couldn’t help but gush about my current pet project. Apparently, my enthusiasm was contagious. I found myself on the spring agenda, staring down my first big technical presentation.

Community learning is the heart and soul of open-source programming. It’s also a diverse and dynamic beast. How do you energize and educate a room full of people with varying interests and skill levels? Some are just there for the free pizza and beer. A few newcomers know nothing, but are super eager to learn. And others, well, they’re just plain intimidating with how much they know.

I decided to give ‘em the wow factor – a live code demonstration. But I wasn’t leaving anything to chance. I had a backup of the code readily available to copy and paste if my fingers happened to mess up while the rest of me provided the necessary presentation patter.

Beyond the code demo, some might say I over-prepped, but really, is there such a thing? I read through lots of documentation, got familiar with the Scrapy blog community, and plumbed the depths of the framework author’s website. I knew I wasn’t expected to know it all, but I wanted to be able to point people in the direction of whichever aspects most interested them.

We were able to cover a lot of ground at the meetup because I made sure to check in with the audience early and often. If there were concepts most people were familiar with, I could avoid belaboring the point. That saved time for the areas where people needed more explanation.

Go Forth & Work Smarter

All in all, the evening was a success. I got the wows I was looking for from the code demo, and no, I didn’t have to resort to my back-up copy. So I’d call that a win. But more encouraging, a handful of people came up to me with that same light I had in my eyes last November: “This is awesome! I can’t wait to get home and play with this!”

Chances are those meetup participants are now using Scrapy to mine Craigslist for cars, bikes, or the perfect job. Maybe some are implementing their own set of criteria for a real estate purchase. It’s liberating and empowering to harness the same web scraping mega-corporations use to improve our personal lives and achieve long-term goals.

I’m also hoping some of my attendees bring Scrapy to their workplace. At its simplest, it can save an intern a bunch of time populating a spreadsheet of potential leads/contacts, phone numbers and emails pilfered from various websites. More strategically, if a company is investigating alternative data models, Scrapy may provide a sense of what data is out there and how readily accessible it is.

If you want to learn more about the Scrapy framework, check out the GitHub repository I set up for my presentation. Reach out if you come up with a new and interesting web scraping use case – personal or professional. I’d love to hear about either or both!

About Colin

Colin Reynolds is a modern-day philosopher and a self-professed “growth-minded productivity junky trying to automate everything.” Those two traits would seem to be at odds, but the more you automate, the more time you have to think. Thinking is one of Colin’s favorite pastimes. As a devotee of Cal Newport’s Deep Work, he consistently makes a point to turn off the distractions and enjoy some dedicated thought time.

While finishing up his bachelor’s of philosophy at Towson University, Colin took a few business courses and poured himself into learning Excel. It was his voluntary application of those self-taught Excel skills at a college internship that landed Colin his first paying gig. Since then, he has continued to ingratiate himself with employers and clients alike by showing them cool things they didn’t know were possible.

Colin now has a toddler showing him cool, new things he didn’t know were possible. His daughter loves to take Colin and his wife on weekend hikes, drawing their attention to the sights and sounds of nature with enthusiastic babbles of approval. Dedicated thought time may get harder to come by as family obligations increase, but where there’s a will, there’s a way (and it always involves automation).