git rekt — gemini-redirect.git (5d664f791772051dc11114d916bce7915232d197): blog/mdad/developing-a-python-application-for-cassandra/index.html

blog/mdad/developing-a-python-application-for-cassandra/index.html (view raw)
  1<!DOCTYPE html><html lang=en><head><meta charset=utf-8><meta name=description content="Official Lonami's website"><meta name=viewport content="width=device-width, initial-scale=1.0, user-scalable=yes"><title> Developing a Python application for Cassandra | Lonami's Blog </title><link rel=stylesheet href=/style.css><body><article><nav class=sections><ul><li><a href=/>lonami's site</a><li><a href=/blog class=selected>blog</a><li><a href=/golb>golb</a></ul></nav><main><h1 class=title>Developing a Python application for Cassandra</h1><div class=time><p>2020-03-23T00:00:00+00:00<p>last updated 2020-04-16T07:52:26+00:00</div><p><em><strong>Warning</strong>: this post is, in fact, a shameless self-plug to my own library. If you continue reading, you accept that you are okay with this. Otherwise, please close the tab, shut down your computer, and set it on fire.__(Also, that was a joke. Please don’t do that.)</em><p>Let’s do some programming! Today we will be making a tiny CLI application in <a href=http://python.org/>Python</a> that queries <a href=https://core.telegram.org/api>Telegram’s API</a> and stores the data in <a href=http://cassandra.apache.org/>Cassandra</a>.<h2 id=our-goal>Our goal</h2><p>Our goal is to make a Python console application. This application will connect to <a href=https://telegram.org/>Telegram</a>, and ask for your account credentials. Once you have logged in, the application will fetch all of your open conversations and we will store these in Cassandra.<p>With the data saved in Cassandra, we can now very efficiently query information about your conversations given their identifier offline (no need to query Telegram anymore).<p><strong>In short</strong>, we are making an application that performs efficient offline queries to Cassandra to print out information about your Telegram conversations given the ID you want to query.<h2 id=data-model>Data model</h2><p>The application itself is really simple, and we only need one table to store all the relevant information we will be needing. This table called <code>**users**</code> will contain the following columns:<ul><li><code>**id**</code>, of type <code>int</code>. This will also be the <code>primary key</code> and we’ll use it to query the database later on.<li><code>**first_name**</code>, of type <code>varchar</code>. This field contains the first name of the stored user.<li><code>**last_name**</code>, of type <code>varchar</code>. This field contains the last name of the stored user.<li><code>**username**</code>, of type <code>varchar</code>. This field contains the username of the stored user. Because Cassandra uses a <a href=https://cassandra.apache.org/doc/latest/architecture/overview.html>wide column storage model</a>, direct access through a key is the most efficient way to query the database. In our case, the key is the primary key of the <code>users</code> table, using the <code>id</code> column. The index for the primary key is ready to be used as soon as we create the table, so we don’t need to create it on our own.</ul><h2 id=dependencies>Dependencies</h2><p>Because we will program it in Python, you need Python installed. You can install it using a package manager of your choice or heading over to the <a href=https://www.python.org/downloads/>Python downloads section</a>, but if you’re on Linux, chances are you have it installed already.<p>Once Python 3.5 or above is installed, get a copy of the Cassandra driver for Python and Telethon through <code>pip</code>:<pre><code>pip install cassandra-driver telethon
  2</code></pre><p>For more details on that, see the <a href=https://docs.datastax.com/en/developer/python-driver/3.22/installation/>installation guide for <code>cassandra-driver</code></a>, or the <a href=https://docs.telethon.dev/en/latest/basic/installation.html>installation guide for <code>telethon</code></a>.<p>As we did in our <a href=/blog/mdad/cassandra-operaciones-basicas-y-arquitectura/>previous post</a>, we will setup a new keyspace for this application with <code>cqlsh</code>. We will also create a table to store the users into. This could all be automated in the Python code, but because it’s a one-time thing, we prefer to use <code>cqlsh</code>.<p>Make sure that Cassandra is running in the background. We can’t make queries to it if it’s not running.<pre><code>$ bin/cqlsh
  3Connected to Test Cluster at 127.0.0.1:9042.
  4[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
  5Use HELP for help.
  6cqlsh> create keyspace mdad with replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
  7cqlsh> use mdad;
  8cqlsh:mdad> create table users(id int primary key, first_name varchar, last_name varchar, username varchar);
  9</code></pre><p>Python installed? Check. Python dependencies? Check. Cassandra ready? Check.<h2 id=the-code>The code</h2><h3 id=getting-users>Getting users</h3><p>The first step is connecting to <a href=https://core.telegram.org/api>Telegram’s API</a>, for which we’ll use <a href=https://telethon.dev/>Telethon</a>, a wonderful (wink, wink) Python library to interface with it.<p>As with most APIs, we need to supply <a href=https://my.telegram.org/>our API key</a> in order to use it (here <code>API_ID</code> and <code>API_HASH</code>). We will refer to them as constants. At the end, you may download the entire code and use my own key for this example. But please don’t use those values for your other applications!<p>It’s pretty simple: we create a client, and for every dialog (that is, open conversation) we have, do some checks:<ul><li>If it’s an user, we just store that in a dictionary mapping <code>ID → User</code>.<li>Else if it’s a group, we iterate over the participants and store those users instead.</ul><pre><code>async def load_users():
 10    from telethon import TelegramClient
 11
 12    users = {}
 13
 14    async with TelegramClient(SESSION, API_ID, API_HASH) as client:
 15        async for dialog in client.iter_dialogs():
 16            if dialog.is_user:
 17                user = dialog.entity
 18                users[user.id] = user
 19                print('found user:', user.id, file=sys.stderr)
 20
 21            elif dialog.is_group:
 22                async for user in client.iter_participants(dialog):
 23                    users[user.id] = user
 24                    print('found member:', user.id, file=sys.stderr)
 25
 26    return list(users.values())
 27</code></pre><p>With this we have a mapping ID to user, so we know we won’t have duplicates. We simply return the list of user values, because that’s all we care about.<h3 id=saving-users>Saving users</h3><p>Inserting users into Cassandra is pretty straightforward. We take the list of <code>User</code> objects as input, and prepare a new <code>INSERT</code> statement that we can reuse (because we will be using it in a loop, this is the best way to do it).<p>For each user, execute the statement with the user data as input parameters. Simple as that.<pre><code>def save_users(session, users):
 28    insert_stmt = session.prepare(
 29        'INSERT INTO users (id, first_name, last_name, username) '
 30        'VALUES (?, ?, ?, ?)')
 31
 32    for user in users:
 33        row = (user.id, user.first_name, user.last_name, user.username)
 34        session.execute(insert_stmt, row)
 35</code></pre><h3 id=fetching-users>Fetching users</h3><p>Given a list of users, yield all of them from the database. Similar to before, we prepare a <code>SELECT</code> statement and just execute it repeatedly over the input user IDs.<pre><code>def fetch_users(session, users):
 36    select_stmt = session.prepare('SELECT * FROM users WHERE id = ?')
 37
 38    for user_id in users:
 39        yield session.execute(select_stmt, (user_id,)).one()
 40</code></pre><h3 id=parsing-arguments>Parsing arguments</h3><p>We’ll be making a little CLI application, so we need to parse console arguments. It won’t be anything fancy, though. For that we’ll be using <a href=https://docs.python.org/3/library/argparse.html>Python’s <code>argparse</code> module</a>:<pre><code>def parse_args():
 41    import argparse
 42
 43    parser = argparse.ArgumentParser(
 44        description='Dump and query Telegram users')
 45
 46    parser.add_argument('users', type=int, nargs='*',
 47        help='one or more user IDs to query for')
 48
 49    parser.add_argument('--load-users', action='store_true',
 50        help='load users from Telegram (do this first run)')
 51
 52    return parser.parse_args()
 53</code></pre><h3 id=all-together>All together</h3><p>Last, the entry point. We import a Cassandra Cluster, and connect to some default keyspace (we called it <code>mdad</code> earlier).<p>If the user wants to load the users into the database, we’ll do just that first.<p>Then, for each user we fetch from the database, we print it. Last names and usernames are optional, so don’t print those if they’re missing (<code>None</code>).<pre><code>async def main(args):
 54    from cassandra.cluster import Cluster
 55
 56    cluster = Cluster(CLUSTER_NODES)
 57    session = cluster.connect(KEYSPACE)
 58
 59    if args.load_users:
 60        users = await load_users()
 61        save_users(session, users)
 62
 63    for user in fetch_users(session, args.users):
 64        print('User', user.id, ':')
 65        print('  First name:', user.first_name)
 66        if user.last_name:
 67            print('  Last name:', user.last_name)
 68        if user.username:
 69            print('  Username:', user.username)
 70
 71        print()
 72
 73if __name__ == '__main__':
 74    asyncio.run(main(parse_args()))
 75</code></pre><p>Because Telethon is an <code>[asyncio](https://docs.python.org/3/library/asyncio.html)</code> library, we define it as <code>async def main(...)</code> and run it with <code>asyncio.run(main(...))</code>.<p>Here’s what it looks like in action:<pre><code>$ python data.py --help
 76usage: data.py [-h] [--load-users] [users [users ...]]
 77
 78Dump and query Telegram users
 79
 80positional arguments:
 81  users         one or more user IDs to query for
 82
 83optional arguments:
 84  -h, --help    show this help message and exit
 85  --load-users  load users from Telegram (do this first run)
 86
 87$ python data.py --load-users
 88found user: 487158
 89found member: 59794114
 90found member: 487158
 91found member: 191045991
 92(...a lot more output)
 93
 94$ python data.py 487158 59794114
 95User 487158 :
 96  First name: Rick
 97  Last name: Pickle
 98
 99User 59794114 :
100  Firt name: Peter
101  Username: pete
102</code></pre><p>Telegram’s data now persists in Cassandra, and we can efficiently query it whenever we need to! I would’ve shown a video presenting its usage, but I’m afraid that would leak some of the data I want to keep private :-).<p>Feel free to download the code and try it yourself:<p><em>download removed</em><h2 id=references>References</h2><ul><li><a href=https://docs.datastax.com/en/developer/python-driver/3.22/getting_started/>DataStax Python Driver for Apache Cassandra – Getting Started</a><li><a href=https://docs.telethon.dev/en/latest/>Telethon’s Documentation</a></ul></main><footer><div><p>Share your thoughts, or simply come hang with me <a href=https://t.me/LonamiWebs><img src=img/telegram.svg alt=Telegram></a> <a href=mailto:totufals@hotmail.com><img src=img/mail.svg alt=Mail></a></div></footer></article><p class=abyss>Glaze into the abyss… Oh hi there!