Mining a Dispensary

I have addeda Github repository with all the code I used and results that I mined in a repository named Dutchie. You can access the repo here:

Lerie Taylor – Bigfisher, LerieLab

This article is about mining data from dispensaries to build a product list for a freelance programming contract. I will be attempting to build and store a formatted list of Marijuana products. I will use free online resources, tools, and everything else a programmer (or your average hacker) would use to complete the contract.

Imgix Dashboard Login Screen
Imgix Dashboard Login Screen

You can find an article about cloning this website on this very same blog. The website the data is coming from (initially) is going to be a local dispensary that I frequent quite often. Dutchie is the dispensary’s online intermediary between the customer and the business, it’s a software-as-a-service business? I was a little disappointed at how easy it was to find a dashboard login, but I digress.

Product List

The goal is to have a product listing. Let’s break down the information we can get from the dispensary website. You should always just go through and check the website and see what you can gather for raw data just from the source or by visual aid. There are ways to just automate the process with screenshots and some OCR software.

OCR Mine

Using apt you can search for and find and install the tesseract package. This package will scan an image for words and return them to a file. I have tried to capture a product page and a product listing page.

OCR Tesseract Results


Another option would be to gather product listing pages and product listing page links to CURL or WGet the data. This requires a big footprint on the server and may result in an IP ban. When you get banned by your IP Address you will have to implement proxy rotation or something like tor rotations to help get the data.

Copy Mine

Copy and Pasted Content
Copy and Pasted Content

You can select the product list and then copy and paste it into a text file. This is time consuming if you have thousands and thousands of listings and pages to go through. This is almost never an option for someone paying to have data mined. In this case it seems to be the easiest solution.

As you can see in the image on the left thew format is pretty well already there and we can simply strip out the rest of the data we don’t need. The only data missing from here are the images, and I think we can manage to grab those from the HTML source from each listing page. I viewed the source and there was a nicely formatted div space which I can stick in a regex parser, like regex101.

HTML Source with Image Highlighted
HTML Source with Image Highlighted

This is where the imgix service from earlier is going to haunt me. I have mined enough resources online to know when the rabbit hole is too deep. I then came across my salvation link while viewing the HTTP headers. The following URI let’s me get a JSON formatted file to parse, seems to be an inventory. Applying this schema to other dispensaries will allow me to pull data from different dispensaries (that use Dutchie).


And then when you URL decode the string you will see the light, and see what I am talking about when I say it was my salvation on this project. This method promises scalability in the project.

Screenshot of the JSON

Parsing JSON

There is a package named jq that you can use to parse JSON formatted files. This may be the fastest way to parse that JSON inventory information.

Parsing JSON with jq
Parsing JSON with jq

I am going to parse each piece of data into their respectively named files and then count the lines in each of them. Each file should have the same amount of lines.

Counting the lines in each file
Counting the lines in each file

If the line count is off then this method we have been pursuing will not work and this is another rabbit hole. Luckily, we have hit 50 lines in each file. Now we can parse the files into a single file. You can pull the first line of a file using the head command and specifying the -n flag.

head -q -n1 brands grams images names prices qty status strains types
"Mimosa Dream Solo+ Pre-Roll 6-Pack | 3g"
#Lerie Taylor / 2022 / parse stuff from the url below
#curl -O – "" >results.json
cat results.json |jq .data[].products[].Status >status
cat results.json |jq .data[].products[].Prices[] >prices
cat results.json |jq .data[].products[].strainType >strains
cat results.json |jq .data[].products[].Image >images
cat results.json |jq .data[].products[].Name >names
cat results.json |jq .data[].products[].POSMetaData.children[].quantityAvailable >qty
cat results.json |jq .data[].products[].type >types
cat results.json |jq .data[].products[].POSMetaData.canonicalBrandName >brands
cat results.json |jq .data[].products[].POSMetaData.children[].option >grams
echo "[" >final.json
for i in {1..50}
brand="\"brand\":"$(head -q -n$i brands |tail -n1)
gram="\"gram\":"$(head -q -n$i grams |tail -n1)
image="\"image\":"$(head -q -n$i images |tail -n1)
name="\"name\":"$(head -q -n$i names |tail -n1)
price="\"price\":"$(head -q -n$i prices |tail -n1)
qty="\"qty\":"$(head -q -n$i qty |tail -n1)
status="\"status\":"$(head -q -n$i status |tail -n1)
strain="\"strain\":"$(head -q -n$i strains |tail -n1)
type="\"type\":"$(head -q -n$i types |tail -n1)
echo $line >>final.json
echo "]" >>final.json
view raw hosted with ❤ by GitHub

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with
Get started
%d bloggers like this: