How to convert nested JSON to dictionary in Python

Question

I was asked to help parse a JSON file that is delivered by the iTunes Store Customer Reviews API JSON endpoint. It is not so important how this API works or if there are better APIs for this. Instead, let’s assume that we found our favorite API to work with and that our request makes perfect sense and now we have to deal with the API’s response, JSON in this case. This article will guide you through the necessary steps to parse this JSON response into a pandas

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

3. I will focus heavily on the concepts and code development and less on explaining each line of code. Ideally, you should be already familiar with at least a little Python and its standard data types, most importantly dictionaries.

Nội dung chính Show

How to convert nested JSON file to dictionary in Python?
How to convert list of JSON to dictionary in Python?
How to create a dictionary from JSON in Python?
How to read a JSON file into a dictionary in Python?

First, I want to understand what I am dealing with and because the display of the JSON response is not so nice for the original URL, I use a JSON pretify tool like http://jsonprettify.com/.

This will give me the following reformatted JSON response

{
    "feed": {
        "author": {
            "name": {
                "label": "iTunes Store"
            },
            "uri": {
                "label": "http://www.apple.com/uk/itunes/"
            }
        },
        "entry": [
            {
                "author": {
                    "uri": {
                        "label": "https://itunes.apple.com/gb/reviews/id1413855597"
                    },
                    "name": {
                        "label": "VedantJM"
                    },
                    "label": ""
                },
                "updated": {
                    "label": "2022-05-31T14:20:49-07:00"
                },
                "im:rating": {
                    "label": "5"
                },
                "im:version": {
                    "label": "2.38"
                },
                "id": {
                    "label": "8727815893"
                },
                "title": {
                    "label": "Brilliant"
                },
                "content": {
                    "label": "Adonissss",
                    "attributes": {
                        "type": "text"
                    }
                },
                "link": {
                    "attributes": {
                        "rel": "related",
                        "href": "https://itunes.apple.com/gb/review?id=1500780518&type=Purple%20Software"
                    }
                },
                "im:voteSum": {
                    "label": "0"
                },
                "im:contentType": {
                    "attributes": {
                        "term": "Application",
                        "label": "Application"
                    }
                },
                "im:voteCount": {
                    "label": "0"
                }
            },
            ...

I’ve only shown the first

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

4object of the

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

5list. So the JSON response is structured in the following way:

we have a single root element “feed”
this root element has only two children, “author” and “entry”, from which I am only interested in “entry”
“entry” is a list of objects and each object has a set of properties like “author”, “link” and ,”im:rating”
Each property is again a JSON object
The most simple property is an object with just a “label” key and a value.
More complex properties like “author” are again nested

Before I dive deeper in how to parse this nested structure, let me try pandas

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

6 method first.

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

The output of this is the following table:

feedauthor{‘name’: {‘label’: ‘iTunes Store’}, ‘uri’: {‘l…entry[{‘author’: {‘uri’: {‘label’: ‘https://itunes….icon{‘label’: ‘http://itunes.apple.com/favicon.ico’}id{‘label’: ‘https://mzstoreservices-int-st.itun…link[{‘attributes’: {‘rel’: ‘alternate’, ‘type’: ‘…rights{‘label’: ‘Copyright 2008 Apple Inc.’}title{‘label’: ‘iTunes Store: Customer Reviews’}updated{‘label’: ‘2022-06-02T11:44:53-07:00’}

This is clearly not what I had in mind. The first problem I should eliminate is that pandas cannot possibly know that I am only interested in the “entry” list, so I will first fetch the JSON response, parse it into a dictionary and access the “entry” value:

import requests

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"

r = requests.get(url)

data = r.json()
entries = data["feed"]["entry"]

Thus,

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

7 looks like this:

[{'author': {'label': '',
             'name': {'label': 'hdydgdbs'},
             'uri': {'label': 'https://itunes.apple.com/gb/reviews/id1351156521'}},
  'content': {'attributes': {'type': 'text'},
              'label': 'This meditation app is above all, it works and is '
                       'free, i reccomend it to everyone who wants to '
                       'meditate'},
  'id': {'label': '8730361700'},
  'im:contentType': {'attributes': {'label': 'Application',
                                    'term': 'Application'}},
  'im:rating': {'label': '5'},
  'im:version': {'label': '2.38'},
  'im:voteCount': {'label': '0'},
  'im:voteSum': {'label': '0'},
  'link': {'attributes': {'href': 'https://itunes.apple.com/gb/review?id=1500780518&type=Purple%20Software',
                          'rel': 'related'}},
  'title': {'label': 'Amazing app'},
  'updated': {'label': '2022-06-01T08:25:00-07:00'}},
  ...

Now, I can try pandas again. Note, that I no longer have a JSON string but a normal Python list, containing dictionaries. Therefore, I can directly use pandas

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

3class:

df = pd.DataFrame(entries)

The first rows of this data frame looks as follows (

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

9):

authorupdatedim:ratingim:versionidtitlecontentlinkim:voteSumim:contentTypeim:voteCount0{‘uri’: {‘label’: ‘https://itunes.apple.com/gb…{‘label’: ‘2022-06-01T08:25:00-07:00’}{‘label’: ‘5’}{‘label’: ‘2.38’}{‘label’: ‘8730361700’}{‘label’: ‘Amazing app’}{‘label’: ‘This meditation app is above all, i…{‘attributes’: {‘rel’: ‘related’, ‘href’: ‘htt…{‘label’: ‘0’}{‘attributes’: {‘term’: ‘Application’, ‘label’…{‘label’: ‘0’}1{‘uri’: {‘label’: ‘https://itunes.apple.com/gb…{‘label’: ‘2022-05-31T14:20:49-07:00’}{‘label’: ‘5’}{‘label’: ‘2.38’}{‘label’: ‘8727815893’}{‘label’: ‘Brilliant’}{‘label’: ‘Adonissss’, ‘attributes’: {‘type’: …{‘attributes’: {‘rel’: ‘related’, ‘href’: ‘htt…{‘label’: ‘0’}{‘attributes’: {‘term’: ‘Application’, ‘label’…{‘label’: ‘0’}2{‘uri’: {‘label’: ‘https://itunes.apple.com/gb…{‘label’: ‘2022-05-31T08:25:36-07:00’}{‘label’: ‘5’}{‘label’: ‘2.38’}{‘label’: ‘8726950116’}{‘label’: ‘Perfect’}{‘label’: ‘This app is the one for meditations…{‘attributes’: {‘rel’: ‘related’, ‘href’: ‘htt…{‘label’: ‘0’}{‘attributes’: {‘term’: ‘Application’, ‘label’…{‘label’: ‘0’}

Much better but still not there yet. We have the correct columns and each row is indeed one entry from the entries list. However, all values are strings and, worse, a string representation of the inner dictionaries (and sometimes multiple nested dictionaries). I cannot work with data like this so we have to manually parse the list of entries, which I will explain next.

Looking again at the structure of the entries (see Listing “JSON response”), the strategy is simple: go through each entry, and as long as the value is a dictionary, concatenate the keys to a single column name and the final value is the value for this column and row.

Now a very crude first attempt could be to hardcode all attribute names like this:

parsed_data = defaultdict(list)

for entry in entries:
    parsed_data["author_uri"].append(entry["author"]["uri"]["label"])
    parsed_data["author_name"].append(entry["author"]["name"]["label"])
    parsed_data["author_label"].append(entry["author"]["label"])
    parsed_data["content_label"].append(entry["content"]["label"])
    parsed_data["content_attributes_type"].append(entry["content"]["attributes"]["type"])
    ...

This implementation might be naive and does not generalize at all to any other use case, but it is still a highly effective method to begin with because it forces you to explicitly state the JSON structure down to the last element. That this method works can be tested again with the pandas

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

3class that can create a data frame from a dictionary that has a list of values for each column:

pd.DataFrame(parsed_data)

The output will be a data frame like this:

author_uriauthor_nameauthor_labelcontent_labelcontent_attributes_type0https://itunes.apple.com/gb/reviews/id1351156521hdydgdbsThis meditation app is above all, it works and…text1https://itunes.apple.com/gb/reviews/id1413855597VedantJMAdonisssstext2https://itunes.apple.com/gb/reviews/id1413779831dtnvcgiifghThis app is the one for meditations, great sel…text

However, aiming at a more general solution that can deal automatically with all attributes/properties without knowing the structure (but relying on the fact that there are only two levels of nested dictionaries, at least for now), I derived at the following solution:

parsed_data = defaultdict(list)

for entry in entries:
    for key, val in entry.items():
        for subkey, subval in val.items():
            if not isinstance(subval, dict):
                parsed_data[f"{key}_{subkey}"].append(subval)
            else:
                for att_key, att_val in subval.items():
                    parsed_data[f"{key}_{subkey}_{att_key}"].append(att_val)

The code is not the most beautiful one but I will come to this later. For now let’s focus on the intend: For each entry I look at the first key-value pair, knowing that value is always a dictionary (object in JSON). Now I have to deal with two different cases. In the first case, the value dictionary is flat and does not contain another dictionary, only key-value pairs. This is the simple case in which I combine the outer key with the inner key to a column name and take the value as column value for each pair. In the second case, the dictionary contains a key-value pair where the value is again a dictionary. I rely on the fact that there are at most two levels of nested dictionaries so I iterate over the key-value pairs of the inner dictionary and again combine the outer key and the most inner key to a column name and take the inner value as column value.

This procedure gives me a dictionary where the keys are the column names of the data frame and each key has a list as value with the row values for this column. This is the perfect format for the pandas DataFrame class to create a data frame from:

df = pd.DataFrame(parsed_data)
df.head()

And the first rows look like this:

author_uri_labelauthor_name_labelauthor_labelupdated_labelim:rating_labelim:version_labelid_labeltitle_labelcontent_labelcontent_attributes_typelink_attributes_rellink_attributes_hrefim:voteSum_labelim:contentType_attributes_termim:contentType_attributes_labelim:voteCount_label0https://itunes.apple.com/gb/reviews/id1351156521hdydgdbs2022-06-01T08:25:00-07:0052.388730361700Amazing appThis meditation app is above all, it works and…textrelatedhttps://itunes.apple.com/gb/review?id=15007805…0ApplicationApplication01https://itunes.apple.com/gb/reviews/id1413855597VedantJM2022-05-31T14:20:49-07:0052.388727815893BrilliantAdonisssstextrelatedhttps://itunes.apple.com/gb/review?id=15007805…0ApplicationApplication02https://itunes.apple.com/gb/reviews/id1413779831dtnvcgiifgh2022-05-31T08:25:36-07:0052.388726950116PerfectThis app is the one for meditations, great sel…textrelatedhttps://itunes.apple.com/gb/review?id=15007805…0ApplicationApplication0

And there it is! I have a few more columns than originally expected because I decided to keep every bit of information by flattening the nested structure of dicts into a single dict where each combination of attributes is preserved by concatenating the different keys into a single column name, separated by an underscore “_”. This data frame has 50 rows and 16 columns, which is in accordance with the original JSON response. If you dislike the additional “label” part in the column names, it is easy to get rid of it:

df.columns = [col if not "label" in col else "_".join(col.split("_")[:-1]) for col in df.columns]

Right now, all columns have the data type

import requests

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"

r = requests.get(url)

data = r.json()
entries = data["feed"]["entry"]

1, which is not ideal memory-wise, but does not have a huge impact as long as the data set is as small as this. However, I can change the dtype with a simple one-liner:

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

0

Pandas

import requests

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"

r = requests.get(url)

data = r.json()
entries = data["feed"]["entry"]

2 method confirms the cast:

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

1

To conclude this article, I want to improve the reusability of my code. The first obvious thing to do would be to extract the parsing logic into one or several functions with proper type annotation and docstring. However, this is not the focus of this article so I will leave this part to the more practically inclined reader.

Instead, I want to stress that my solution (Listing “advanced implementation”) breaks for deeper nested JSON structures. That is because I had to explicitly iterate over the inner dictionaries with a

import requests

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"

r = requests.get(url)

data = r.json()
entries = data["feed"]["entry"]

3loop for each dictionary. A better solution to such a problem is a recursive approach where we apply a divide-and-conquer paradigm to handle the complexity. In other words what I really intend to do is to go into each dictionary as long as there are inner dictionaries and once I reach the end, add all values as separate columns:

import pandas as pd

url = "https://itunes.apple.com/gb/rss/customerreviews/id=1500780518/sortBy=mostRecent/json"
pd.read_json(url)

2

Isn’t that a beauty! Like often when a recursive approach is more natural to the task at hand the recursive implementation is more readable and often shorter than the iterative approach. You can verify yourself that the data frame obtained by this approach is identical to the data frame obtained from the previous iterative solution.

There are of course other approaches. A common strategy is to flatten the original JSON by doing something very similar like we did here: pull out all nested objects by concatenating all keys and keeping the final inner value. If you change the original JSON like this you obtain a JSON that can be directly fed into pandas. There is even a module you can use right out of the box: flatten_json. But where would be the fun in this…

I hope you enjoyed following me on this little journey and as always I am open for your comments, discussions and questions.

How to convert nested JSON file to dictionary in Python?

This function is used to parse the JSON string..

Syntax: json.load(file_name).

Parameter: It takes JSON file as the parameter..

Return type: It returns the python dictionary object..

How to convert list of JSON to dictionary in Python?

To convert Python JSON string to Dictionary, use json. loads() function. Note that only if the JSON content is a JSON Object, and when parsed using loads() function, we get Python Dictionary object. JSON content with array of objects will be converted to a Python list by loads() function.

How to create a dictionary from JSON in Python?

Convert JSON to a dictionary.

Import the json module in the program..

Open the sample JSON file which we created above..

Convert the file data into dictionary using json. ... .

Check the type of the value returned by the json. ... .

Print the key: value pairs inside the Python dictionary using a for loop..

How to read a JSON file into a dictionary in Python?

JSON String to Python Dictionary To do this, we will use the loads() function of the json module, passing the string as the argument. json. loads(data_JSON) creates a new dictionary with the key-value pairs of the JSON string and it returns this new dictionary.

How to convert nested JSON to dictionary in Python

How to convert nested JSON file to dictionary in Python?

How to convert list of JSON to dictionary in Python?

How to create a dictionary from JSON in Python?

How to read a JSON file into a dictionary in Python?

Bài Viết Liên Quan

Có thể bạn quan tâm

Xem Nhiều

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội