I have addeda Github repository with all the code I used and results that I mined in a repository named Dutchie. You can access the repo here: https://github.com/Lerie82/dutchieLerie Taylor – Bigfisher, LerieLab
This article is about mining data from dispensaries to build a product list for a freelance programming contract. I will be attempting to build and store a formatted list of Marijuana products. I will use free online resources, tools, and everything else a programmer (or your average hacker) would use to complete the contract.
You can find an article about cloning this website on this very same blog. The website the data is coming from (initially) is going to be a local dispensary that I frequent quite often. Dutchie is the dispensary’s online intermediary between the customer and the business, it’s a software-as-a-service business? I was a little disappointed at how easy it was to find a dashboard login, but I digress.
The goal is to have a product listing. Let’s break down the information we can get from the dispensary website. You should always just go through and check the website and see what you can gather for raw data just from the source or by visual aid. There are ways to just automate the process with screenshots and some OCR software.
Using apt you can search for and find and install the tesseract package. This package will scan an image for words and return them to a file. I have tried to capture a product page and a product listing page.
Another option would be to gather product listing pages and product listing page links to CURL or WGet the data. This requires a big footprint on the server and may result in an IP ban. When you get banned by your IP Address you will have to implement proxy rotation or something like tor rotations to help get the data.
You can select the product list and then copy and paste it into a text file. This is time consuming if you have thousands and thousands of listings and pages to go through. This is almost never an option for someone paying to have data mined. In this case it seems to be the easiest solution.
As you can see in the image on the left thew format is pretty well already there and we can simply strip out the rest of the data we don’t need. The only data missing from here are the images, and I think we can manage to grab those from the HTML source from each listing page. I viewed the source and there was a nicely formatted div space which I can stick in a regex parser, like regex101.
This is where the imgix service from earlier is going to haunt me. I have mined enough resources online to know when the rabbit hole is too deep. I then came across my salvation link while viewing the HTTP headers. The following URI let’s me get a JSON formatted file to parse, seems to be an inventory. Applying this schema to other dispensaries will allow me to pull data from different dispensaries (that use Dutchie).
And then when you URL decode the string you will see the light, and see what I am talking about when I say it was my salvation on this project. This method promises scalability in the project.
There is a package named jq that you can use to parse JSON formatted files. This may be the fastest way to parse that JSON inventory information.
I am going to parse each piece of data into their respectively named files and then count the lines in each of them. Each file should have the same amount of lines.
If the line count is off then this method we have been pursuing will not work and this is another rabbit hole. Luckily, we have hit 50 lines in each file. Now we can parse the files into a single file. You can pull the first line of a file using the head command and specifying the -n flag.
head -q -n1 brands grams images names prices qty status strains types
"Joilo" "3g" "https://s3-us-west-2.amazonaws.com/dutchie-images/4cbae5596a381791807aae2b2aea886c" "Mimosa Dream Solo+ Pre-Roll 6-Pack | 3g" 44 6 "Active" "Hybrid" "Pre-Rolls"