git rekt — gemini-redirect.git (697f46b06bde25e3fbc5ea43480eedc1ac516769): blog/mdad/developing-a-python-application-for-cassandra/index.html

blog/mdad/developing-a-python-application-for-cassandra/index.html (view raw)
  1<!DOCTYPE html>
  2<html>
  3<head>
  4<meta charset="utf-8" />
  5<meta name="viewport" content="width=device-width, initial-scale=1" />
  6<title>Developing a Python application for Cassandra</title>
  7<link rel="stylesheet" href="../css/style.css">
  8</head>
  9<body>
 10<main>
 11<p><em><strong>Warning</strong>: this post is, in fact, a shameless self-plug to my own library. If you continue reading, you accept that you are okay with this. Otherwise, please close the tab, shut down your computer, and set it on fire.__(Also, that was a joke. Please don’t do that.)</em></p>
 12<div class="date-created-modified">Created 2020-03-23<br>
 13Modified 2020-04-16</div>
 14<p>Let’s do some programming! Today we will be making a tiny CLI application in <a href="http://python.org/">Python</a> that queries <a href="https://core.telegram.org/api">Telegram’s API</a> and stores the data in <a href="http://cassandra.apache.org/">Cassandra</a>.</p>
 15<h2 class="title" id="our_goal"><a class="anchor" href="#our_goal">¶</a>Our goal</h2>
 16<p>Our goal is to make a Python console application. This application will connect to <a href="https://telegram.org/">Telegram</a>, and ask for your account credentials. Once you have logged in, the application will fetch all of your open conversations and we will store these in Cassandra.</p>
 17<p>With the data saved in Cassandra, we can now very efficiently query information about your conversations given their identifier offline (no need to query Telegram anymore).</p>
 18<p><strong>In short</strong>, we are making an application that performs efficient offline queries to Cassandra to print out information about your Telegram conversations given the ID you want to query.</p>
 19<h2 id="data_model"><a class="anchor" href="#data_model">¶</a>Data model</h2>
 20<p>The application itself is really simple, and we only need one table to store all the relevant information we will be needing. This table called <code>**users**</code> will contain the following columns:</p>
 21<ul>
 22<li><code>**id**</code>, of type <code>int</code>. This will also be the <code>primary key</code> and we’ll use it to query the database later on.</li>
 23<li><code>**first_name**</code>, of type <code>varchar</code>. This field contains the first name of the stored user.</li>
 24<li><code>**last_name**</code>, of type <code>varchar</code>. This field contains the last name of the stored user.</li>
 25<li><code>**username**</code>, of type <code>varchar</code>. This field contains the username of the stored user.
 26Because Cassandra uses a <a href="https://cassandra.apache.org/doc/latest/architecture/overview.html">wide column storage model</a>, direct access through a key is the most efficient way to query the database. In our case, the key is the primary key of the <code>users</code> table, using the <code>id</code> column. The index for the primary key is ready to be used as soon as we create the table, so we don’t need to create it on our own.</li>
 27</ul>
 28<h2 id="dependencies"><a class="anchor" href="#dependencies">¶</a>Dependencies</h2>
 29<p>Because we will program it in Python, you need Python installed. You can install it using a package manager of your choice or heading over to the <a href="https://www.python.org/downloads/">Python downloads section</a>, but if you’re on Linux, chances are you have it installed already.</p>
 30<p>Once Python 3.5 or above is installed, get a copy of the Cassandra driver for Python and Telethon through <code>pip</code>:</p>
 31<pre><code>pip install cassandra-driver telethon
 32</code></pre>
 33<p>For more details on that, see the <a href="https://docs.datastax.com/en/developer/python-driver/3.22/installation/">installation guide for <code>cassandra-driver</code></a>, or the <a href="https://docs.telethon.dev/en/latest/basic/installation.html">installation guide for <code>telethon</code></a>.</p>
 34<p>As we did in our <a href="/blog/mdad/cassandra-operaciones-basicas-y-arquitectura/">previous post</a>, we will setup a new keyspace for this application with <code>cqlsh</code>. We will also create a table to store the users into. This could all be automated in the Python code, but because it’s a one-time thing, we prefer to use <code>cqlsh</code>.</p>
 35<p>Make sure that Cassandra is running in the background. We can’t make queries to it if it’s not running.</p>
 36<pre><code>$ bin/cqlsh
 37Connected to Test Cluster at 127.0.0.1:9042.
 38[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
 39Use HELP for help.
 40cqlsh&gt; create keyspace mdad with replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
 41cqlsh&gt; use mdad;
 42cqlsh:mdad&gt; create table users(id int primary key, first_name varchar, last_name varchar, username varchar);
 43</code></pre>
 44<p>Python installed? Check. Python dependencies? Check. Cassandra ready? Check.</p>
 45<h2 id="the_code"><a class="anchor" href="#the_code">¶</a>The code</h2>
 46<h3 id="getting_users"><a class="anchor" href="#getting_users">¶</a>Getting users</h3>
 47<p>The first step is connecting to <a href="https://core.telegram.org/api">Telegram’s API</a>, for which we’ll use <a href="https://telethon.dev/">Telethon</a>, a wonderful (wink, wink) Python library to interface with it.</p>
 48<p>As with most APIs, we need to supply <a href="https://my.telegram.org/">our API key</a> in order to use it (here <code>API_ID</code> and <code>API_HASH</code>). We will refer to them as constants. At the end, you may download the entire code and use my own key for this example. But please don’t use those values for your other applications!</p>
 49<p>It’s pretty simple: we create a client, and for every dialog (that is, open conversation) we have, do some checks:</p>
 50<ul>
 51<li>If it’s an user, we just store that in a dictionary mapping <code>ID → User</code>.</li>
 52<li>Else if it’s a group, we iterate over the participants and store those users instead.</li>
 53</ul>
 54<pre><code>async def load_users():
 55    from telethon import TelegramClient
 56
 57    users = {}
 58
 59    async with TelegramClient(SESSION, API_ID, API_HASH) as client:
 60        async for dialog in client.iter_dialogs():
 61            if dialog.is_user:
 62                user = dialog.entity
 63                users[user.id] = user
 64                print('found user:', user.id, file=sys.stderr)
 65
 66            elif dialog.is_group:
 67                async for user in client.iter_participants(dialog):
 68                    users[user.id] = user
 69                    print('found member:', user.id, file=sys.stderr)
 70
 71    return list(users.values())
 72</code></pre>
 73<p>With this we have a mapping ID to user, so we know we won’t have duplicates. We simply return the list of user values, because that’s all we care about.</p>
 74<h3 id="saving_users"><a class="anchor" href="#saving_users">¶</a>Saving users</h3>
 75<p>Inserting users into Cassandra is pretty straightforward. We take the list of <code>User</code> objects as input, and prepare a new <code>INSERT</code> statement that we can reuse (because we will be using it in a loop, this is the best way to do it).</p>
 76<p>For each user, execute the statement with the user data as input parameters. Simple as that.</p>
 77<pre><code>def save_users(session, users):
 78    insert_stmt = session.prepare(
 79        'INSERT INTO users (id, first_name, last_name, username) ' 
 80        'VALUES (?, ?, ?, ?)')
 81
 82    for user in users:
 83        row = (user.id, user.first_name, user.last_name, user.username)
 84        session.execute(insert_stmt, row)
 85</code></pre>
 86<h3 id="fetching_users"><a class="anchor" href="#fetching_users">¶</a>Fetching users</h3>
 87<p>Given a list of users, yield all of them from the database. Similar to before, we prepare a <code>SELECT</code> statement and just execute it repeatedly over the input user IDs.</p>
 88<pre><code>def fetch_users(session, users):
 89    select_stmt = session.prepare('SELECT * FROM users WHERE id = ?')
 90
 91    for user_id in users:
 92        yield session.execute(select_stmt, (user_id,)).one()
 93</code></pre>
 94<h3 id="parsing_arguments"><a class="anchor" href="#parsing_arguments">¶</a>Parsing arguments</h3>
 95<p>We’ll be making a little CLI application, so we need to parse console arguments. It won’t be anything fancy, though. For that we’ll be using <a href="https://docs.python.org/3/library/argparse.html">Python’s <code>argparse</code> module</a>:</p>
 96<pre><code>def parse_args():
 97    import argparse
 98
 99    parser = argparse.ArgumentParser(
100        description='Dump and query Telegram users')
101
102    parser.add_argument('users', type=int, nargs='*',
103        help='one or more user IDs to query for')
104
105    parser.add_argument('--load-users', action='store_true',
106        help='load users from Telegram (do this first run)')
107
108    return parser.parse_args()
109</code></pre>
110<h3 id="all_together"><a class="anchor" href="#all_together">¶</a>All together</h3>
111<p>Last, the entry point. We import a Cassandra Cluster, and connect to some default keyspace (we called it <code>mdad</code> earlier).</p>
112<p>If the user wants to load the users into the database, we’ll do just that first.</p>
113<p>Then, for each user we fetch from the database, we print it. Last names and usernames are optional, so don’t print those if they’re missing (<code>None</code>).</p>
114<pre><code>async def main(args):
115    from cassandra.cluster import Cluster
116
117    cluster = Cluster(CLUSTER_NODES)
118    session = cluster.connect(KEYSPACE)
119
120    if args.load_users:
121        users = await load_users()
122        save_users(session, users)
123
124    for user in fetch_users(session, args.users):
125        print('User', user.id, ':')
126        print('  First name:', user.first_name)
127        if user.last_name:
128            print('  Last name:', user.last_name)
129        if user.username:
130            print('  Username:', user.username)
131
132        print()
133
134if __name__ == '__main__':
135    asyncio.run(main(parse_args()))
136</code></pre>
137<p>Because Telethon is an <code>[asyncio](https://docs.python.org/3/library/asyncio.html)</code> library, we define it as <code>async def main(...)</code> and run it with <code>asyncio.run(main(...))</code>.</p>
138<p>Here’s what it looks like in action:</p>
139<pre><code>$ python data.py --help
140usage: data.py [-h] [--load-users] [users [users ...]]
141
142Dump and query Telegram users
143
144positional arguments:
145  users         one or more user IDs to query for
146
147optional arguments:
148  -h, --help    show this help message and exit
149  --load-users  load users from Telegram (do this first run)
150
151$ python data.py --load-users
152found user: 487158
153found member: 59794114
154found member: 487158
155found member: 191045991
156(...a lot more output)
157
158$ python data.py 487158 59794114
159User 487158 :
160  First name: Rick
161  Last name: Pickle
162
163User 59794114 :
164  Firt name: Peter
165  Username: pete
166</code></pre>
167<p>Telegram’s data now persists in Cassandra, and we can efficiently query it whenever we need to! I would’ve shown a video presenting its usage, but I’m afraid that would leak some of the data I want to keep private :-).</p>
168<p>Feel free to download the code and try it yourself:</p>
169<p><em>download removed</em></p>
170<h2 id="references"><a class="anchor" href="#references">¶</a>References</h2>
171<ul>
172<li><a href="https://docs.datastax.com/en/developer/python-driver/3.22/getting_started/">DataStax Python Driver for Apache Cassandra – Getting Started</a></li>
173<li><a href="https://docs.telethon.dev/en/latest/">Telethon’s Documentation</a></li>
174</ul>
175</main>
176</body>
177</html>
178