Category: software

Do not use Selenium for web scraping

Published on: 15.12.2018

Disclaimer:
This is primarily written from Python programming language ecosystem point of view.

I have noticed that Selenium has become quite popular for scraping data from web pages.

Yes, you can use Selenium for web scraping, but it is not a good idea.

Also personally, I think that articles that teach how to use Selenium for web scraping are giving a bad example of what tool to use for web scraping.

Why you should not use Selenium for web scraping

First,Selenium is not a web scraping tool.

It is “for automating web applications for testing purposes” and this statement is from the homepage of Selenium.

Second, in Python, there is a better tool Scrapy open-source web-crawling framework.

The intelligent reader will ask: “What is a benefit in using Scrapy over Python?

You get speed and a lot of speed (not Amphetamine :-)), speed in development and speed in web scraping time.

There are tips on how to make Selenium web scraping faster, and if you use Scrapy then you do not have those kinds of problems and you are faster.

Just because these articles exist is proof (at least for me) that people are using the wrong tool for the job, an example of “When your only tool is a hammer, everything looks like a nail“.

For what should you use Selenium

I personally only use Selenium for web page testing.

I would try to use it for automating web applications (if there are no other options), but I never had that use case so far.

Exception on when you can use Selenium

The only exception that I could see for using Selenium as web scraping tool is if a website that you are scraping is using JavaScript to get/display data that you need to scrape.

Scrapy does have the solution for JavaScript with Splash, but I have never used it, so far I always found some workaround.

What to use instead of Selenium for web scraping

As you can guess, my advice is to use Scrapy.

I choose Scrapy because I spend less time developing web scraping programs (web spiders) and execution time is fast.

I have found Scrapy to be faster in development time because of a Scrapy shell and cache.

In execution, it is fast because multiple requests can be done simultaneously, this means that data delivery will not be in the same order as requested, just that you are not confused when debugging.

What about Beautiful Soup + Requests

I have used this combination in the past before I decided to invest time in learning Scrapy.

Do not make the same mistake as I did, development time and execution time is much faster with Scrapy, than with any other tool that I have found so far.

Last words

This is not rant about using Selenium for web scraping, for not production system and learning/hobby it is fine.

I get it, Selenium is easy to start and you can see what is happing in real time on your screen, that is a huge benefit for people starting to do/learn web scraping and it is important to have this kind of early moral bosts when you are learning something new.

But I do think that all these article and tutorial using Selenium for web scraping should have a disclaimer not to use Selenium in real life (if you need to scrape 100K pages in a day, it is not possible to do it in single Selenium instance).

To start with Scrapy it is harder, you have to write XPath selectors and look at source code of HTML page to debug is not fun, but if you want to have fast web scraping that is the price.

Conclusion

After you learn Scrapy you will be faster than with Selenium (Selenium just have a lower-angle learning curve), I personally needed a few days to get the basics.

Introduction to Python packet Dataset

Published on: 01.12.2018

Python packet dataset describes itself as databases for lazy people and they are correct.

For saving data with dataset all you need is just a Python dictionary, the keys of the dictionary are columns in a table and that is all.

Dataset will automatically make all tables and columns necessary.

Internal data is stored in SQLite, PostgreSQL or MySQL database, my experience has only been with SQLite so far.

My experience

In one project I use it just for memory database, after scraping data from a website it is stored in-memory SQLite.

Then I can use standard dataset API to retrieve data with certain criteria and sort it, before emailing it.

On another project, I use it to store data in SQLite and later to retrieve it.

I must admit that for everything else than basic searching, filtering and sorting you have to write SQL queries.

One useful feature is upsert, upsert is a smart combination of insert and update.

If rows with matching keys exist they will be updated, otherwise a new row is inserted in the table.

There is also a feature to export data to CSV or JSON.

Conclusion

If you think that using DB on your next project is overkill, but you do need to filter, search or sort data, take a look at datase.

It is much better than to make custom solutions, I know because I did stored data in pickle format and wrote a custom function for filtering, sorting and retrieving data from pickle, before I learned about dataset.

The question of tradeoff in software, business, and life

Published on: 15.11.2018

In software development, it is common to have discussions about what technology is better or the best.

Those discussions look like a wise discussion for beginners, looking for a perfect solution, the holy grail.

But they are useless because there is no perfect solution, the much more important question to answer is what tradeoffs are you making and why?

Why tradeoffs are necessary?

In any system, if you want to increase one aspect of the system that has to come at the expense of some other aspect.

Let us take the car for example.

I am taking the car as an example because I suppose it is easy to understand.

If you want to make a car acceleration faster, you have to make it lighter and fuel consumption will go up.

So, to increase acceleration you have to decrease weight and fuel efficiency.

This is a simplified example, there are many imperfections, but I hope that reader can get the point.

Basically, you have to do tradeoff.

Back to the discussion on tradeoffs in software development

When you add business aspect into considerations, it is even more complicated.

Things that make sense from a technical standpoint, are a disaster for business and vice versa.

The hard thing about a tradeoff between business and technology is it is almost impossible to have one person who can understand just one side completely so what to say about both at the same time.

Today software systems are so complicated that it is common that there is no single person who understands everything.

That is why REST API is popular, but that is the discussion for another day.

Concrete software example

I have one personal program, that I use every day, it is responsible for saving me 1000$ on average per year, so I do have the real monetary use of it.

And SQLite DB is the main part of it, and I do not ever use indexes in it (no cost benefit from it).

I know that SQLite for my use case, from point of speed, is not the best option.

But I took SQLite because it was fast to start, backups are just copying one file and I am running SQL queries once per day while I am sleeping.

Currently, an average time for all SQL queries are around 30 seconds, and as DB file gets larger query time will also increase.

Even if it gets to 1 hour (what I am not expecting even in the next 100 years), that would be fine for my use cases.

My deployment platform is shared hosting with the flat monthly bill so increased CPU time is also not a problem from me, altho if I used platform with serverless billing per CPU time it could be.

Conclusion

Know what tradeoffs are you making and even more important is why.

Introducing PyAutoGUI

Published on: 01.11.2018

I found out about PyAutoGUI Python packet from Automate the Boring Stuff with Python book.

With PyAutoGUI you can automate GUI interaction on your computer.

PyAutoGUI is working on Windows, OSX and Linux, altho I have used it only on Windows.

Most GUI automation can be done just by automating mouse movements and keyboard input and PyAutoGUI supports it.

There are also functions for providing message boxes, saving the screenshot and finding an element from the image on the screen.

Personally, I have found PyAutoGUI useful, for automating some of my workflows.

To be honest, I use it for 45 minutes of work every month (I know this is not much), but I have found that if I manually have to do same interactions for 45 minutes it is really killing my soul, so I decide just to automate it.

If you want to automate web page interaction use Selenium, because with Selenium you can access elements on a web page independent of their position on the screen.

Because with PyAutoGUI you can just move the mouse to specific coordinates, that means if the resolution or layout of GUI has been changed, you need to update coordinates in your code.

Keep all coordinate in your code as constant in one place, so that you do not need to change them all over your code when change is needed.

This theoretically could be avoided if you use images as reference for finding elements on your GUI, but you will have the same problem if the appearance of the elements is changed, but then instead of changing coordinate in the code you have to change all reference images, what is more work.

My personal preference is to use hardcoded coordinated instead of images as the reference.

Using Selenium has also become popular for scraping pieces of information from web pages, but better is to use specialized framework for scraping like Scrapy, because of additional features like caching HTTP responses (quite a time saver in development).

Making web apps with Jupyter notebook

Published on: 01.10.2018

This article will explain how to make Jupiter notebook as a GUI app on the web.

What is Jupiter notebook

Jupiter notebook is browser-based REPL.

REPL enables you to program in an interactive environment, you can write and then execute your next line of code while all previous lines are already in the executed state.

This trivial feature enables me to cut prototyping development time because for testing the next line of code I do not need to run the whole program again. (REPL is useful only for some types of situations)

I know this explanation is useless if you do not know what is a programming language and have no experience with REPL style prototyping, but if you are in this category I do not know how to explain it (probably it is impossible).

Point is, it makes programming prototyping faster because for testing next line in your code you do not need to run the previous code again and again.

Previously Jupiter notebook was called IPython Notebooks, at that time only Python was available as programing language.

Now it is possible to use Jupiter notebook with many programming languages, altho my experience is only with Python.

Personally, I use Jupiter notebook for exploratory data analysis.
Loading data to Pandas and then trying to understand data with visualizations (Seaborn, Bokeh).

Sharing Jupiter notebook with non-technical persons

Often I would run the same code with different parameters, to produce slightly different visualizations.

If you are familiar with Jupiter notebook environment than you know that this means running the same cell with SHIFT + ENTER, from Cell menu or some other shortcut.

This got me thinking if I wanted to give my notebook to a non-technical person (somebody who know how to use Word, and Excel without knowledge of how to write formulas ) it would be trivial for that person to use it.

Also, a person could change the code and get the unintended outcome (syntax error or wrong result).

Ipywidgets

This problem could be solved with Ipywidgets widgets.

With Ipywidgets widgets you can make GUI inside of Jupiter notebook, it is perfect when you want for somebody (even you) to expose some functionality of your Jupiter notebook with GUI elements.

For having this kind of GUI
Coin Toss GUI

This is the necessary code:

Appmode

This was good but still there where 2 problems:

  • the user had to run all cell as the first step
  • user still had access to the code

Fortunately, Appmode is Jupyter extensions that turn notebooks into web applications.

By default user can still go back in the “code mode”, but it can be easily removed

Hosting your Jupiter notebook

If you are hosting it inside of your network that you just need to run notebook server, like for local development, but add some security.

Github will give you view only of any notebook that is hosted on their server, and there are many more websites with the same functionality.

If you want interactive hosting of your Jupiter notebook so that people can execute them, then there is Binder.

Currently, it is in beta and your Jupiter notebook needs to be in public Github repository.

Conclusion

With the right combination of:

You can execute your Jupiter notebook as a web app for free.

Coin Toss code can be used as an example of how to host Jupiter notebook as a GUI app on the web.

Part of the inspiration came from Bloomberg bqplot project.

Personally, I have found it useful for sharing interactive visualizations.

Sending email from Python

Published on: 15.09.2018

I will show how to send emails from Python programing language using your Gmail account.

Mail server

When you want to send email from GUI/CLI app or from some computer language (in this example Python), you need to have access to a mail server.

The mail server is a computer that is in charge of sending and receiving your email.

Usually, you have access to the mail server via username and password.

With Gmail username is your email address and password should only be known to you :-).

Use yagmail

For sending emails from Python using your Gmail account IMHO best is to use yagmail packet.

You can waste time with smtplib library, but if you are using a Gmail account, just use yagmail.

Install it with:
pipenv install yagmail

Code examples

This is just a simple code example, for production code do not put your password in the source code.

Personally, I use one JSON file (this file is committed to source control with dummy credentials) per project for storing usernames and passwords for my personal projects.

This is a simple one line example:

I usually use it like this:

Adding image in the body of the email, you need to use yagmail.inline:

My advice and experience

Gmail have daily limits of how much emails can you send.

At the time of this writing, it is 500 emails per day, if you exceed it you will not be able to send emails for next 24h (at least that happened to me, and I have sent only 100 emails in a row).

So, my advice is to have separate Gmail account for just for sending emails automatically from your code, you do not want to lose the ability to send email from your personal email account (I was there and it is not funny).

Also, have some delay between sending emails, when I got my account locked for 24h I have only sent around 100 emails in a row, but after I have added time.sleep(15) I never had that problem again.

I am not using Gmail to spam people, I am using it for sending the email report to myself, mostly from web scraping tasks (around 20 per day).

If you have error 534

Solution is

Google blocks sign-in attempts from apps which do not use modern security standards (mentioned on their support page). You can, however, turn on/off this safety feature by going to the link below:

Go to this link and select Turn On
https://www.google.com/settings/security/lesssecureapps

From:
https://stackoverflow.com/a/26852782/2006674

Conclusion

Sending emails from Python is easy with yagmail.

Understanding Monty Hall dilemma with hacker statistics

Published on: 01.09.2018

With help of hacker statistics, the proof will be provided that in Monty Hall game you should always switch doors because you are doubling (from 33% to 66%) your success rate.

What is hacker statistics

In hacker statistics, you run simulations to calculate the probability of some event.

For example, the probability of the coin toss game is 50% heads, 50% tails.

Let say, for the sake of argument that you do not believe it.

Believe me, there are people who do not believe.

You could calculate the probability and after calculating you would get 50% for heads and 50% for the tails.

But you still do not believe or you think that maybe your calculation is wrong.

Then you can test it in practice.

Take a coin flip it a 100 times(or even a million) divide number of heads with the number of coin flips and the probability of heads is calculated.

As you can see, this is very time-consuming, especially if you want to run the experiment again.

An alternative to the manual coin toss is to write a computer program that will run the simulations and calculate the probabilities.

This is a much wiser approach because you can run it multiple times and it is faster (an average computer can do million of coin toss simulation per second).

Monty Hall game

I heard of Monty Hall game a few years ago in movie 21.

Here is the scene from the movie 21 with Monty Hall described.

Basic idea is that you have 3 doors, behind one door is a car, and behind other 2 doors is a goat.

You can choose only one door of three.

If you chose a door that has a car you get a car if you choose a door with a goat you get nothing (not even a goat).

The goal of the game is to choose a door with a car, but you do not know behind which door is the car.

So you choose some door, and probability for your success is 33%.

Monty Hall dilemma

After you have chosen one door, the game host will open one of the other two doors and goat will be shown.

The dilemma is the following: Is it in your interest to switch door?

The important thing to understand is that the game host will always open one door where is a goat.

Every single time, this will be done even when you choose a door with the prize or if you do not choose.

If the game host was only opening another door when you chose the door with the prize, then the best solution is not to switch the door.

If the game host was only opening another door when you chose the door with the goat, then the best solution is to switch the door.

Point is that every time game host will open another door where is a goat.

The answer to the puzzle is that if you switch door you will win 66% of the time.

This was counterintuitive to my understanding.

The best explanation, that I was able to understand I have found at https://www.youtube.com/watch?v=4Lb-6rxZxx0.

Monty Hall solution with hacker statistics

Altho I understood the math and logic why switching doors provide 66% probability of the win, I still did not believe that it is correct.

So I decided to make the computer program to simulate the game and solve this conundrum once and for all.

I did two simulations with a million iterations each.

The first simulation calculated the probability of win if you did not switch door.

No surprises there, the probability was 33%.

The second simulation calculated the probability of win if you did switch door.

To my amazement at that time, simulation shoved that probability of win if you switch door is really 66%.

Dou to my disbelief, I run it few dozens time, but each time probability of a win was at 66%.

I understood the math behind it, I did the simulation that proved math, but still, I had a hard time accepting that it is correct to switch door because it was counterintuitive to my own(false) reason.

Interesting to mention is that I needed few weeks to accept results of my own simulation.

Some interesting facts on Monty Hall

Wikipedia entry on Monty Hall is extensive.

A lot of interesting facts can be found.

Even experts have a hard time understand it:
“After the problem appeared in Parade, approximately 10,000 readers, including nearly 1,000 with PhDs, wrote to the magazine, most of them claiming vos Savant was wrong (Tierney 1991). Even when given explanations, simulations, and formal mathematical proofs, many people still do not accept that switching is the best strategy (vos Savant 1991a). Paul ErdÅ‘s, one of the most prolific mathematicians in history, remained unconvinced until he was shown a computer simulation demonstrating the predicted result (Vazsonyi 1999).”

One striking sentence is
“Pigeons repeatedly exposed to the problem show that they rapidly learn always to switch, unlike humans”

Are pigeons smarter than humans?

Interactive Monty Hall Game

If you still do not believe that switching door is the best strategy you can play it online.

Just do not switch tab, you will need to wait a few minutes.

Conclusion

In Monty Hall game you should always switch doors because you are doubling (from 33% to 66%) your success rate.

How to backup personal GitHub repositories

Published on: 01.08.2018

I will show how to do a backup of your GitHub repositories with python-github-backup

Why to bother with a backup of GitHub

I can already see that there will be comments regarding why to do the backup of GitHub.

  • “It is a waste of time.”
  • “GitHub internally already have backups.” (I hope so)
  • “They will not lose your code” (But maybe I will)
  • “They will not go overnight out of business.”

Response to all those comments is:
You will not be worst off if you have your own backup.

If forever reason (GitHub go under, all repositories deleted by accident, alien attack) GitHub is not available anymore, I have my own backup of code that I have written.

Paid solution

If you are looking for a paid solution, BackHub looks like a good solution.
I have no experience with BackHub, nor am I in any way associated with it.

Free solution

After researching all available options I have decided to go with python-github-backup because it had more stars and contributors on GitHub than other projects.

I have used the number of stars and contributors on GitHub as the assumption that python-github-backup is more in use than other solutions so there are more people who will continue to support it in future.

In order to access your GitHub personal data, you need to have a personal access token.

After that, you can install it with pip/pipenv:
(I have installed it in separate virtualenv)

pipenv install github-backup
or
pip install github-backup

Run it with:
/full/path/github_backup/venv/bin/github-backup sasa-buklijas -t your_personal_access_token -o /full/path/github_backup --all

This command will backup all your GitHub information to /full/path/github_backup directory.

It would be tiresome to run this command every day, so I have automated it on my online hosting.

Crontab

My usecase:
15 00 * * * /full/path/github_backup/venv/bin/github-backup sasa-buklijas -t your_personal_access_token -o /full/path/github_backup --all > /full/path/github_backup/last_log.txt

> /full/path/github_backup/last_log.txt is used to have an output of the last backup command.
>> can be used to have outputs of all backup commands, but I have found that having just last one is enough.

Conclusion

“You will not be worse off if you have your own backup.”

Do you have backups of your own personal GitHub repositories, if you have, what do you use for backup?

What programming language should you learn?

Published on: 01.06.2018

This is written for persons that do not know any programing language and they are thinking what programing language they should learn first.

Altho, I think that reasoning behind decisions in this article can help you with choosing your next programing language also.

If you want to do something, first know why you want to do it

What are you trying to accomplish?

Same is with learning programming language.

I have listed few main reasons why persons want to learn a programing language:

  • get a job (on-site or freelance)
  • make some software app/website/web app
  • just to learn to do programming

I want to learn programming to get a job

Programming jobs (and salaries) are location dependent, due to this reason do research which programming languages job are available in your area.

If you plan to move/migrate do same for that area.

Check the local programming jobs listing to get a clue.

It is good to visit local programming meetups, if you plan to be a professional software developer start on your networking also.

Meetups are also a good way to see who is hiring.

If you plan to do freelance then you are not location depended.

What, I would argue, is even making thing more difficult, because you do not have location constraint.

Anyway do cost/benefit analysis and pick some language that makes sense according to your own constraints.

I want to learn programming to make software

You want to make some software (desktop app, website, web app, mobile app, etc).

You could pay profession to do it for you, but for some reason (eg. you are still in high-school, etc) you want to do it by your self.

Do research and find out what programming language is best for software that you plan to build.

I personally optimize for time to market.

If you plan to make a web app, there is no reason for you to learn C++, believe me, there is not.

Currently, in the year 2018, there are already known programming languages (tools) for most of the use cases.

But you also need to be careful, because most software developers will suggest programming languages that they know.

So, do not ask just one person but at least few dozens.

And always ask them what is the reasoning behind their decision.

I want to learn programming just to know how to programme

You do not want a job, you have no idea what to make with programming, you just want to learn programming.

Then you can pick any language, altho my suggestion is to pick something that does have some real-life usage and it is not complicated for beginners.

My humble suggestion is to choose Python “… is easy for beginners, practical for professionals, and exciting for hackers …” from Fluent Python.

Conclusion

Know what are you trying to accomplish and pick programing language for that purpose.

The largest expense for the programmer

Published on: 01.05.2018

I will talk about largest expense from somebody who is developing software (primarily writing code) but from the business owner perspective (not from employe perspective).

I got this idea after starting to develop my own software products, not at the time when I was writing code for others.

The topic could be rephrased as “The largest expense for the business owner who is also the sole developer”.

But, I also think that it is correct for software development in general.

Time is not on your side

After prolong thinking about a subject, I have come to conclusion (of course I can be wrong), that largest expense is time.

By the time I mean how much time you will spend to make some software.

One can argue that this is also the only expense (with some hardware, room, and electricity).

So, next question is on what activity in software development is most of the time spent(or wasted).

Hardware is cheap

Better hardware (SSD, more RAM, faster CPU, etc) will reduce time in development and you should use it.

But hardware price is relatively cheap against other time expenses.

Let’s say that by using SSD you will get 10 more minutes of work per workday (altho I would argue that it is at least double).

Multiplying by 260 workdays per year, that is 2600 minutes or 43h.

1 TB SSD is 400$, so even if you only bill 10$ for your working hour, ROI is one year and this is no brain investment. (the number can change, but you get the point)

Anyway, moral of the story is that hardware is cheap.

The largest expense is learning how to do something new

Most of the time is spent on learning how to do something new.

New programing language, new frameworks, new libraries, new tools, new new new …

The endless supply of new things thing that needs to be learned.

Somebody could come to the conclusion learning how to learn fast is the solution.

Certainly learning fast is useful, but it is not the solution, because there are more things to learn that there is time to do it.

Temporary nature of the knowledge capital

And also, in software development, you have additional problems of “temporary nature of the knowledge capital”.

Basically, what you learn today, probably will not be useful in 5 years.

Maybe it will not even exist anymore.

So far this is nothing new, anybody, who has been programming for more than 5 years have practical experience that technologies (languages, frameworks, libraries, tools, etc) go away, new ones come and now you need to spend the time to learn new ways to do the old things.

What is new, or at least what I am trying to argue in the essay, is that from the business standpoint, we should look at it as an expense.

Especially as the largest expense in software development.

Somebody can say: “Hey douchebag, I like learning new technologies and using them in my job. You are just some old dinosaur who does not like programming and probably was never good at it.”

Well, I can agree that 35 I am certainly not young anymore.

But my age does not change the fact is that learning something is not same as doing something.

If I want to do something, first I must know how to do it, and to know it, you need to learn it.

So, learning is an expense to doing.

And if for everything that you need to do, first, you need to learn, that is a lot of learning (and big expense).

Is programming is young man game?

I would say that part of the problem is that most programmers are young and do not have much life/work experience.

Statistic from StackOverflow 2018 Developer Survey Results support that calim.

30% have only 2 years of professional coding. That is one third.

57.5% have only 5 years of professional coding.

In some work field, even after 5 years, you are still considered just a beginner.

And only 12,7% have more than 15 years of professional coding.

Hobby or business

Also 81% of professional developers code as a hobby.

I do not think that this is a bad thing.

But at the same time, if you consider something a hobby, then you will not treat it as a business.

In business, there is income and expense.

And by subtracting expense from income, you get profit.

And you should have some if you expect for your business to survive.

In the hobby, there is only fun.

That is why it is called the hobby.

And hobby is the expense, from the bookkeeping perspective, but everybody is considering his/her hobby as fun, not as the expense.

These two reasons, probably more the second one, are probably main reasons why most programmes do not see learning as the expense.

At least, until they have burnout and then they switch to something else, like project management (real job title should be: project reporting) or leave software development completely.

Looking from the previously mentioned statistic only 6.9% of developer are older than 45 years.

But it is well known that programming is mostly young man game, probably due to “temporary nature of the knowledge capital”.

How to reduce your largest expense

I think that specialization is the large part of the solution.

Find your niche and stick to it.

Then you can reduce (or even completely eliminate) the expense of time spent on learning how to program something and can acquire domain-specific knowledge.

When to learn new things

I am not saying that you should never learn new things.

Not at all.

Just do cost/benefit analysis before.

I will use my own experience as an example.

I do/did a lot of web scraping.

First I have done it with Beautiful Soup and a lot of custom code.

I wrote my own caching, ORM, etc.

In web scraping, the easiest part is to write XPath selectors, there are a lot of other things and over time I have made my own small framework for all of that.

And with each new website that I scraped I had to and new features or improve old ones.

And this was taking larger and larger percent of my time as I was scraping more and more challenging websites.

After some time I decided to try Scrapy .

To scrape the first website with Scrapy I need around one week, with 90% of the time was spent on learning Scrapy (learning how to do the old thing with the new framework).

If I have used my custom old framework I would be finished it in 3 days.

But I would have to write more code than with Scrapy.

Currently, with Scrapy in one day, I can do scraping that I needed at least 3-4 days with my old custom framework.

This is the example when learning new framework was useful.

Conclusion

In software development business for a software developer, the largest expense is time spent learning new technologies.

Reduce it by finding your niche and specializing.

Use cost/benefit analysis to determine should you use learn some new technology.